WP How To Prevent Screen Scraping: Policy, Contracts and Technology Evaluation | Imperva

Archive

How To Prevent Screen Scraping: Policy, Contracts and Technology Evaluation

How To Prevent Screen Scraping: Policy, Contracts and Technology Evaluation

Note: This is a guest post on Clareity Consulting, a group who provides management and information technology consulting for the real estate and related industries.

When organizations create policy requiring screen-scraping, web scraping and other automated attack prevention and monitoring, it’s important for those organizations to be specific enough to ensure that compliance with policy can be measured in some way.

Indeed, it is equally important for the organization to ensure that their technology contracts contain clear and explicit terms that implement those policies.

For those unfamiliar with this subject, how to prevent screen scraping can be stopped multiple ways. The following list includes some background reading regarding web scraping and screen-scraping in the real estate industry:

If policies and contracts do not contain specific anti-scraping technology requirements, one can easily end up in an argument over whether the steps taken on how to prevent screen scraping are sufficient, even if those steps are demonstrably ineffective.

For example, a website provider might implement a “CAPTCHA” on login and say, “This should be enough to prove the humanity of the user. It’s not a computer program using the website.”

But, not only are many CAPTCHA tests easy for computers to defeat (it’s an arm’s race!), but if all a data pirate needs to do is have a human being log in and/or complete a CAPTCHA test once per day and have the cookie (containing session information) captured by a computer for use in scraping data, it’s not a very high barrier.

Likewise, an anti web scraping solution might block an IP address as being used by a scraper if the website gets more than 20 or 30 information requests per minute from that address – and while that seems like a reasonable step, these days the more advanced scrapers spin up a hundred servers on different IP addresses and have each of them grab the data from just a few pages, then move those servers to different IP addresses.

Thus, anti-scraping is difficult, and while the mechanisms mentioned above might play a part in a solution, one must include a more comprehensive solution if one wishes to actually have a reasonable chance of stopping the screen scrapers.

Moreover, that more comprehensive solution should be detailed explicitly both in the terms of any contracts executed by the defending organization, as well as in any policies implemented by them regarding reasonable security measures.

Anti-scraping requirements might look something like the following:

The display (website, app’s API) must implement technology that prevents, detects, and proactively mitigates scraping. This means implementing an effective combination of the countermeasures defined in the “OWASP Automated Threat Handbook” “Figure 7: Automated Threat Countermeasure Classes” (reproduced below and available at
https://www.owasp.org/index.php/OWASP_Automated_Threats_to_Web_Applications).

Those countermeasures must be demonstrably effective against commercial scraping services as well as advanced and evolving scraping techniques.

The anti-scraping solution must be comprised of multiple countermeasures in all three classes of countermeasures (prevent, detect, and recover) as defined by OWASP, sufficient to address all aspects of the security threat model, including at least complete implementations of all of the following: Fingerprinting, Reputation, Rate, Monitoring, and Instrumentation.

Those fielding displays and APIs requiring anti-scraping technology must demonstrate compliance with the above requirements using technology they have built or a commercial product/service.

It must be demonstrated that the technology meets those requirements and that it has been properly configured to effectively address scraping.

Following is some more detail about each countermeasure:

Countermeasure Brief Description Prevent Detect Recover (Mitigate)
Requirements Define relevant threats and assess effects on site’s performance toward business objectives X X X
Obfuscation Hide assets, add overhead to screen scraping and hinder theft of assets X    
Fingerprinting Identify automated usage by user agent string, HTTP request format, and/or device fingerprint content X X  
Reputation Use reputation analysis of user identity, user behavior, resources accessed, not accessed, or repeatedly accessed X X  
Rate Limit number and/or rate of usage per user, IP address/range, device ID/fingerpint, etc X X  
Monitoring Monitor errors, anomalies, function usage/sequencing, and provide alerting and/or monitoring dashboard   X  
Instrumentation Perform real-time attack detection and automated response X X X
Response Use incident data to feed back into adjustments to countermeasures (e.g. Requirements, Testing, Monitoring). X   X
Sharing Share fingerprints and bot detection signals across infrastructure and clients X X  

The appendix that follows provides more thorough OWASP countermeasure definitions.

Appendix: OWASP Countermeasure Definitions

The following is excerpted from “OWASP Automated Threat Handbook Web Applications”:

  • Requirements. Identify relevant automated threats in security risk assessment, and assess effects of alternative countermeasures on functionality usability and accessibility. Use this to then define additional application development and deployment requirements.
  • Obfuscation. Hinder automated attacks by dynamically changing URLs, field names and content, or limiting access to indexing data, or adding extra headers/fields dynamically, or converting data into images, or adding page and session-specific tokens.
  • Fingerprinting. Identification and restriction of automated usage by automation identification techniques, including utilization of user agent string, and/or HTTP request format (e.g. header ordering), and/or HTTP header anomalies (e.g. HTTP protocol, header inconsistencies), dynamic injections, and/or device fingerprint content to determine whether a user is likely to be a human or not. As a result of these countermeasures, for example, browsers automated via tools such as Selenium must certainly be blocked. The technology should use machine learning or behavioral analysis utilized to detect automation patterns and adapt to the evolving threat on an ongoing basis.
  • Reputation. Identification and restriction of automated usage by utilizing reputation analysis of user identity (e.g. web browser fingerprint, device fingerprint, username, session, IP address/range/geolocation), and/or user behavior (e.g. previous site, entry point, time of day, rate of requests, rate of new session generation, paths through application), and/or types of resources accessed (e.g. static vs dynamic, invisible/ hidden links, robots.txt file, paths excluded in robots.txt, honey trap resources, cache-defined resources), and/or types of resources not accessed (e.g. JavaScript generated links), and/ or types of resources repeatedly accessed. As a result of these countermeasures, for example, known commercial scraping tools and the use of data center IP addresses must certainly be identified and blocked.
  • Rate. Set upper and/or lower limits and/or trend thresholds, and limit number and/or rate of usage per user, per group of users, per IP address/range, and per device ID/fingerprint. Note that this kind of countermeasure cannot stand alone as hackers commonly utilize a slow crawl from many rotating IP addresses that can simulate the activity of legitimate users.
  • Monitoring. Monitor errors, anomalies, function usage/sequencing, and provide alerting and/or monitoring dashboard.
  • Instrumentation. Build in application-wide instrumentation to perform real-time attack detection and automated response including locking users out, blocking, delaying, changing behavior, altering capacity/capability, enhanced identity authentication, CAPTCHA, penalty box, or other technique needed to ensure that automated attacks are unsuccessful.
  • Response. Define actions in an incident response plan for various automated attack scenarios. Consider automated responses once an attack is detected. Consider using actual incident data to feed back into other countermeasures (e.g. Requirements, Testing, Monitoring).
  • Sharing. Share information about automated attacks, such as IP addresses or known violator device fingerprints, with others in the same sector, with trade organizations, and with national CERTs.

The appendix that follows provides more thorough OWASP countermeasure definitions.

Appendix: OWASP Countermeasure Definitions

The following is excerpted from “OWASP Automated Threat Handbook Web Applications”:

Requirements. Identify relevant automated threats in security risk assessment, and assess effects of alternative countermeasures on functionality usability and accessibility. Use this to then define additional application development and deployment requirements.

Obfuscation. Hinder automated attacks by dynamically changing URLs, field names and content, or limiting access to indexing data, or adding extra headers/fields dynamically, or converting data into images, or adding page and session-specific tokens.

Fingerprinting. Identification and restriction of automated usage by automation identification techniques, including utilization of user agent string, and/or HTTP request format (e.g. header ordering), and/or HTTP header anomalies (e.g. HTTP protocol, header inconsistencies), dynamic injections, and/or device fingerprint content to determine whether a user is likely to be a human or not. As a result of these countermeasures, for example, browsers automated via tools such as Selenium must certainly be blocked. The technology should use machine learning or behavioral analysis utilized to detect automation patterns and adapt to the evolving threat on an ongoing basis.

Reputation. Identification and restriction of automated usage by utilizing reputation analysis of user identity (e.g. web browser fingerprint, device fingerprint, username, session, IP address/range/geolocation), and/or user behavior (e.g. previous site, entry point, time of day, rate of requests, rate of new session generation, paths through application), and/or types of resources accessed (e.g. static vs dynamic, invisible/ hidden links, robots.txt file, paths excluded in robots.txt, honey trap resources, cache-defined resources), and/or types of resources not accessed (e.g. JavaScript generated links), and/ or types of resources repeatedly accessed. As a result of these countermeasures, for example, known commercial scraping tools and the use of data center IP addresses must certainly be identified and blocked.

Rate. Set upper and/or lower limits and/or trend thresholds, and limit number and/or rate of usage per user, per group of users, per IP address/range, and per device ID/fingerprint. Note that this kind of countermeasure cannot stand alone as hackers commonly utilize a slow crawl from many rotating IP addresses that can simulate the activity of legitimate users.

Monitoring. Monitor errors, anomalies, function usage/sequencing, and provide alerting and/or monitoring dashboard.

Instrumentation. Build in application-wide instrumentation to perform real-time attack detection and automated response including locking users out, blocking, delaying, changing behavior, altering capacity/capability, enhanced identity authentication, CAPTCHA, penalty box, or other technique needed to ensure that automated attacks are unsuccessful.

Response. Define actions in an incident response plan for various automated attack scenarios. Consider automated responses once an attack is detected. Consider using actual incident data to feed back into other countermeasures (e.g. Requirements, Testing, Monitoring).

Sharing. Share information about automated attacks, such as IP addresses or known violator device fingerprints, with others in the same sector, with trade organizations, and with national CERTs.