Block Web Scrapers from Stealing Your Business
There are 50 billion pages on the web, and most of them look genuine. But the reality is that many of those pages have been created as a result of bots programmed to directly copy or selectively extract data from other sites, and they’re specifically intended to steal business from legitimate organizations.
Understand how you can prevent your business from becoming a victim of these pirates by downloading The Ultimate Guide to Preventing Web Scraping. You’ll learn:
How web scraping can damage your business
The Open Web Application Security Project (OWASP) describes it as “Collect application content and/or other data for use elsewhere”. That can mean loss of exclusivity, undercut offers, lowered search engine rankings, even entire new competing businesses. And it costs almost nothing to do.
What you can do to prevent web scraping
You may be attempting to deal with web scrapers manually or your existing security vendors are probably telling you they have a plug-in to block the scrapers. Sadly, it’s not that easy (and the legal system doesn’t necessarily see it as a problem).
Manually dealing with web scrapers involves many man-hours spent on a game of whack-a-mole, trawling through logs and taking action based off tell-tale behavior. All this effort may work for a short while before the web scrapers are back, with new IP addresses or hiding behind new proxies.
Web application firewalls (WAFs) are designed to protect web applications from being exploited due to presence of common software vulnerabilities. The problem is web scrapers are not targeting vulnerabilities; they are mimicking real users. Therefore, other than being programmed to block manually identified IP addresses (see previous point), they are of little use for controlling web scraping.
Login enforcement is used by some web sites to access their most valuable data. The problem is, it is easy for web scrapers to create their own accounts and program their web scrapers accordingly. Stronger authentication or CAPTCHAs can be used (see next point), but this introduces more inconvenience for your legitimate users, potentially enough for them to abandon creating the account.
CAPTCHAs are an obvious way to try and ask a user to prove they are human. This is the aim of CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart). They annoy some users who find them hard to interpret and, needless to say, workarounds have been developed. One of the bad-bot activities described by OWASP is CAPTCHA Bypass (OAT-0093 ). There are also CAPTCHA farms, where the test posed by the CAPTCHA is outsourced to teams of low-cost humans via sites on the dark web.
Geo-fencing means web sites are only exposed within the geographic locations in which they conduct business. This will not stop web scraping per se, but will mean the perpetrators have to go to the extra effort of seeming to run their web scrapers within a specific geographic location. This may simply involve using a VPN link to a local point of presence (PoP).
Flow enforcement of the route legitimate users take through your web site can ensure they are validated each step of the way. Web scrapers are often hardwired to go after high-value targets and encounter problems if forced to follow a typical user's predetermined flow.
Direct bot detection and mitigation is Distil’s approach to protecting your business from web scraping, using behavioral analysis and digital fingerprinting technology that’s specifically designed to separate out legitimate traffic from would-be thieves. We use machine learning to identify active web scrapers and share that information across our customer base to the benefit of all. Clients like EasyJet and Dutch real estate MLS service Funda then are able to apply individual rules to allow for legitimate partner and search engine activity and determine whether, and how much, information can be scraped.
Once bot mitigation is in place, it can even be used to turn the tables on the web scrapers. To find out how, and learn more about protecting your business, download the complete guide today.
To help businesses better understand the range of threats posed by non-human traffic, Distil has partnered with research firm Quocirca to produce a series of executive guides covering Account Takeovers, Web Scraping, and more.
About the Author
Peter Zavlaris weighs in on various topics around bot mitigation, bot defense sharing white papers, videos and other resources on the topic.More Content by Peter Zavlaris