Web Scraping : Everything You Wanted to Know (but were afraid to ask)

July 22, 2015 Courtney Cleaves

Web scraping is the act of taking content from a website with the intent of using it for purposes outside the direct control of the site owner. If your site contains content that competitors could leverage for their own commercial advantage then your business could be at risk – and you wouldn’t even know it.

Web scraping is akin to web indexing, the process by which search engines index web content. The difference is the robots.txt “rule”, which governs where bots may go on a site. Web indexers (“good bots”) follow the rules; web scrapers, on the other hand, simply steal whatever content they’ve been programmed to fetch – prices, promotions, offers, or information that would otherwise only be available to paid subscribers or authorized business partners.

You’d think this kind of blatant theft would be illegal, but the legal landscape is littered with inconsistencies and inconclusive cases, and varies from country to country. Things are beginning to change, but in the meantime, you need to be aware of the potential dangers of web scraping.

Web scraping early history and tools

Scraping has been around almost as long as the web. The motivation behind commercial web scraping has always been to gain an easy commercial advantage and includes things like undercutting a competitor’s promotional pricing, stealing leads, hijacking marketing campaigns, redirecting APIs, and the outright theft of content and data.

The first aggregators and comparison engines appeared hot on the heels of the first ecommerce boom, and operated largely unchallenged until the legal challenges of the early 2000s (described below). Early scraping tools were pretty basic – manually copying and pasting anything visible from the site. Once programmers got involved, scraping graduated to using the Unix grep command or regular expression-matching techniques, posting remote HTTP requests using socket programming, and parsing site content using data query languages. Today, however, it’s a very different story; web scraping is big business with high powered tools and services to match

Web scraping grows up with advanced scraping tools and consultants of hire

Search the web today for scraping tools and you’ll be overwhelmed by the choices available. There are relatively simple free tools that simply automate what programmers were creating manually in the early days, like Import.io and Kimono. Additionally, there are  high-powered professional tools like Uipath and Screen Scraper that go beyond extracting data to provide automated form filling and manipulating APIs to initiate data transfer between applications. Yet other tools, like Metascraper, excel at peeling off metadata and even mimic human behaviour. Check out Scraper Wiki for the latest in good and bad scraping techniques. Automation Anywhere claims to be able to automate any web-based task, and browser automation tools like Selenium and PhantomJS can imitate human browsing right down to natural pauses and are almost indistinguishable from humans.

If you’re not a programmer, web scraping is still within easy reach.  Just enter “web scraping consultant” into a search engine and you’ll get pages and pages of professional service offerings. Anyone can get into the web scraping game.

What industries or types of sites do web scrapers target?

Although any website is susceptible to web scraping, certain industries are prime targets:

  • Digital publishers and directories. Given that much of their intellectual property is right out there in the open, digital publishers are at the top of the “most wanted” list. Yell.com, the multinational directories and Internet services company that grew out of British Telecom’s yellow pages operation, found that they were not only losing unique data and suffering sluggish site performance, but their customers were also being slammed by spammers using scraped form data. Manta is a destination site where about 30 million small and midsize companies market themselves to each other and to consumers, making it attractive to scrapers that use automated bots to steal content and distract valuable IT resources away from business-facing tasks.
  • Travel. It seems there’s a new travel deal locator every day, and it’s no coincidence that many of these companies created their business models based on web scraping. Leading online travel agencies like Kayak, Priceline, TripAdvisor, Expedia, Trivago, and Hipmunk all built their multibillion dollar meta-search businesses around site-scraping (though many of them were legally scraping). Red Label Vacations, the largest independent travel brand in Canada, had bots executing searches on their site, which was triggering third party API calls and incurring significant fees. (Learn how to defend online travel sites from web scraping).

  • Real estate.  A few years back, several MLSs were attacked by a nationally-operated scraper; recovery cost the services over ten million dollars and a lot of time in court. Around the same time, Realtor.com was offline for a week because of a wave of attacks, and had to spend a seven-figure sum on offline advertising to keep the business going. Any listing site is a treasure trove of leads for the real-estate ecosystem of bankers, brokers, moving companies, and the like. Realtors operating in super-hot markets like the San Francisco Bay Area, where time really is of the essence in submitting a successful offer, are particularly vulnerable. In fact, MLS rules now require that operators of Virtual Office Websites take steps to ensure their data is not harvested by web scrapers. A recent examination of an Internet Data Exchange (IDX – the MLS search technology) vendor’s site – found over seven million page requests had been made by bots in a two-month period. Now that IDX rules are changing to allow the sale of data, the target is about to get even larger. Learn why web scraping is costly for Real Estate Agents, brokers, and portals.

Bottom line – if your site contains content that represents revenue for your business, that business is at risk.

Is web scraping illegal?

The legal status of web scraping has been yo-yoing around the legal landscape since the turn of the century. That’s when Bidder’s Edge, an early auction data aggregator, was sued by eBay for scraping data from the online auction site using Trespass to Chattels laws; the courts found in eBay’s favor, but Bidder’s Edge appealed and the case was eventually settled out of court. The intention of this judgment was reversed in 2001 when a travel agency sued a competitor for scraping its prices as a basis for setting its own prices. In this case, the judge ruled that the fact this scraping was not welcomed by the site’s owner was insufficient grounds to make it “unauthorized access” under federal hacking laws and two years after that was overruled again in Intel vs Hamidi.

For the next several years, the courts presided over a terms of use tug-of-war, ruling time and again that simply including “do not scrape us” in your website Ts & Cs did not constitute a legally binding agreement. It seemed like the battle against the scrapers had been lost.

The tide started to turn In 2009 when Facebook won a lawsuit against a web scraper using a copyright law which laid the foundation for other lawsuits that could tie web scraping to copyright violations and hence monetary damages. Then, in 2013 he Associated Press won their case against web scraper Meltwater on the basis of fair use, and the fate of the scraper was sealed – or so we thought. Unfortunately, shortly before that judgment, Andrew Auernheimer was convicted of a felony for scraping content from public areas of the AT&T website that was exposed because of faulty programming by AT&T.

Now we seem to be back in legal limbo-land. Data protection laws in Europe have been used successfully to prevent web scrapers from what amounts to invasions of privacy, but US  scraping still appears to be considered an acceptable risk in the hypercompetitive world of online business.

Protecting your business against the web scrapers

Given that more than half of all website visitors are now non-human, your site is vulnerable. So you need to know which of those non-human visitors are well-intentioned (i.e., searchbots) and which are not. That means knowing what should be going on on your site and taking immediate steps to block any bad actors.

Dealing with bots manually really is not an option unless your business wants to dedicate hundreds of man-hours a month to a game of Whack-a-Mole involving trawling through server logs, identifying patterns, tracing IP addresses and rewriting rules on your web application firewall (WAF). That works for a few minutes. Then the bad guys are back, having cycled through another set of IP addresses and anonymous proxies. Blocking specific IP addresses or bands of addresses also runs the huge risk of blocking legitimate customers using the same service provider as the bot.

What you do need is a solution that’s focused specifically on the many-faceted problem of bots. It’s not an add-on to an existing security product. Bot detection techniques such as  fingerprinting of existing bots and machine learning to understand anticipated site visitor behavior analyze your traffic and let you decide who’s welcome on your site and who is not – all of these are key aspects of an effective solution.

Plus, and you’ve heard it a million times before, but it’s always worth repeating: get – and stay – on top of patches and site security. Fewer than 50% of enterprises patch quickly enough to block the bad guys, and fewer than 30% of corporate websites use SSL. You’ll find more advice from leading website security experts here.

What your vendor needs to deliver to stop web scraping

Here are some good questions to ask any vendor promising their solution will help you block the bad scrapers without interfering with search bot and legitimate user activity:

  • How will your solution block a bot from attempting to re-enter my site multiple times from random IP addresses?

  • How will your solution keep up to date with both bots and normal traffic patterns and human interactions on on my site?

  • How does your solution ensure human visitors are not “collateral damage”?

  • Does your solution require changes to my existing web infrastructure?

  • Can I choose an in-house or a cloud-based implementation?

  • Can we continue to use CAPTCHAs and unblock verification forms that meet our corporate branding guidelines?

If you’re not satisfied with the answers you get, talk to us - or take a risk-free trial to see for yourself how easy it is to stop web scraping.

Stay safe out there!

 
Previous Article
Why WAF Whitelisting is Always Better than Blacklisting
Why WAF Whitelisting is Always Better than Blacklisting

The best approach to WAF (web application firewall) security is to whitelist the good rather than to blackl...

Next Article
Five Ways to Optimize Your WAF to Protect Against Bad Bots
Five Ways to Optimize Your WAF to Protect Against Bad Bots

Discover the Five Ways to Optimize Your WAF (Web Application Firewall) to Protect Against Bad Bots. Identif...