There are a number of mitigation techniques and anti scraping services built to control web scrapers, and some are more effective than others:
1. Robot exclusion standard
This is one thing that will not work as it relies on etiquette. The robot exclusion standard/ protocol, or simply robots.txt, is used by websites to communicate with web crawlers and good bots providing information about which areas of the website should not be processed or scanned. However, web scrapers and other bad bots need not cooperate.
2. Manual IP blocking
Dealing with web scrapers manually involves many man-hours spent on a game of whack-a-mole, trawling through logs, identifying tell-tale behaviour, blocking IP addresses and rewriting firewall rules (see next point). All this effort may work for a short while before the web scrapers are back, with new IP addresses or hiding behind new proxies. Blocking IP addresses can also affect legitimate users coming via the same service provider as the web scrapers.
3. Web application firewalls (WAF)
WAFs are designed to protect web applications from being exploited due to the presence of common software vulnerabilities. Web scrapers are not targeting vulnerabilities but aiming to mimic real users. Therefore, other than being programmed to block manually identified IP addresses (see last point), they are of little use for controlling web scraping.
4. Login enforcement
Some websites require a login to access the most valued data, however, this is no protection from web scrapers, as it is easy for the perpetrators to create their own accounts and program their web scrapers accordingly. Strong authentication or CAPTCHAs (see next point) can be deployed, but these introduce more inconvenience for legitimate users, whose initial casual interest may be dispelled by the commitment of account creation.
5. Are you a human?
One obvious way to vet web scraping is to ask users to prove they are human. This is the aim of CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart). They annoy some users who find them hard to interpret and, needless to say, workarounds have been developed. One of the bad-bot activities described by OWASP is CAPTCHA Bypass (OAT-0093 ). There are also CAPTCHA farms, where the test posed by the CAPTCHA is outsourced to teams of low-cost humans via sites on the dark web.
Geo-fencing means websites are only exposed within the geographic locations in which they conduct business. This will not stop web scraping per se, but will mean the perpetrators have to go to the extra effort of seeming to run their web scrapers within a specific geographic location. This may simply involve using a VPN link to a local point of presence (PoP).
7. Flow enforcement
Enforcing the route legitimate users take through a website can ensure they are validated each step of the way. Web scrapers are often hardwired to go after high-value targets and encounter problems if forced to follow a typical user’s predetermined flow.
8. Direct bot detection and mitigation
The aim here is the direct detection of scrapers through a range of techniques including behavioral analysis and digital fingerprinting using specific bot detection and control technology designed for the task.
As the sheer volume, sophistication and damage potential of web scraping grows, they put a costly and unmanageable strain on security staff and resources. Although many software vendors claim to help enterprises address this critical problem, only Distil Network’s unique, more holistic approach provides a truly effective anti-scraping service.
Through a potent combination of superior technology, human expertise and a wealth of knowledge and experience, Distil provides unprecedented protection from web scraping without affecting the flow of business-critical traffic. Finally, security professionals can rest easy, knowing their defense is as sophisticated, adaptable and vigilant as the threat itself.
Many customers have benefited from Distil’s anti-scraping services. These include Funda, a Dutch property sales site, which integrated Distil with AWS and F5 to prevent web scraping of leads; easyJet which put in place a 24-hour automated system for detecting and acting on web scraping activity; and Move, which spent over a year building its own web scraping prevention tools before realising Distil could adapt more rapidly to the changing bot-scape.
Turning the tables with anti scraping services
With direct bot mitigation in place it is even possible to turn the tables on unwanted web scrapers and get them to work to your advantage. For example, if you are about to change prices, temporarily turn off bot protection and allow web scrapers access to price match old prices and turn protection back on before making changes. Web scrapers can also be fed false information by directing bots to bogus web pages not seen by customers.
About the Author
Bob joined Quocirca in 2002. His main area of coverage is route to market for ITC vendors, but he also has an additional focus on IT security, network computing systems management and managed services. Bob writes regular analytical columns for Computing, Computer Weekly, TechRepublic and Computer Reseller News (CRN), and has written for The Times, Financial Times and The Daily Telegraph. Bob blogs for Computing, Info Security Advisor and IT-Director.com. He also provides general comment for the European IT and business press. Bob has extensive knowledge of the IT industry. Prior to joining Quocirca in he spent 16 years working for US technology vendors including DEC (now HP), Sybase, Gupta, Merant (now Serena), eGain and webMethods (now Software AG). Bob has a BSc in Geology from Manchester University and PhD in Geochemistry from Leicester University.More Content by Bob Tarzey