Web Scraping 101: The Basics of Bots on Your Website

April 15, 2019 Katherine Oberhofer

What is Web Scraping?

Web scraping is a method used to extract data from websites. Sometimes called screen scraping, web scraping software may access the web directly using either a web browser or the hypertext transfer protocol. Even though such methods can be carried out manually the term commonly refers to automated processes via the use of a web crawler or bot. Simply put, it is a form of copying where explicit data is gathered and copied from the web. Some interesting key findings are as follows;

  • Content scraping is the leading use for web scraping
  • Services that offer web scraping run as low as $3.33 per hour
  • $58,000 annually is the normal for the average web scraper to make

What is Content Scraping?

Content scraping is the process of copying unique/original content from other websites and publishing it elsewhere. Such practice is illegal as it is carried out without the consent of the original author or source. Typically, such content scrapers copy the whole content and pass it off as their own.

Content scraping has an adverse effect on the site that has invested time, money, and resources to produce the original content as their web authority ranks and SEO are negatively affected by having duplicate copy elsewhere.

Scraping Data and Price Scraping

What is Scraping Data?

Data scraping, also called web scraping, is the function of importing data from a site into a file on your computer or a document. As it is one of the most effective methods of obtaining data or content from the web, scraping data can become a problem in certain industries, where duplicate content is on multiple sites, and in some cases, create the spark for a price war.

What is Price Scraping?

Price Scraping is the process in which bots target the pricing section of a website in order to scrape the pricing data. Typically price scraping is undertaken by online competitors looking to use your pricing against you to gain a competitive advantage. This is particularly unfavorable as it can create the start of a price war.

Which Industries see Price Scraping?

Airline/Travel

Whether it be airline tickets through to hotel rooms, to user-generated reviews and unique editorial content, no matter the nature of a travel website, any unique content on a website could be stolen by bots. If a site is not explicitly protected against web scraping, anybody is able to duplicate that content for very little - no investment research or anything else necessary. Such content can then be sold to a competitor, or even used against yourself as a means to steal your organic search traffic. Some pricing scraping is performed by market intelligence companies to provide their data to competitors.

Ecommerce

Online retail has become extremely competitive and unsafe and is under assault constantly by the internet underbelly of malicious online actors, inclusive of big industry competitors. Such threat constituencies are leveraging bad bots in numerous forms that have adverse effects for online retailers. Bad bots scrape prices and product data, carry out click fraud, and endanger the overall security of e-commerce websites, brand reputation as well as customer loyalty. Of all bad bot threats, price scraping and product data scraping are the most costly and rampant to online retailers. There are five most common forms of data scraping activities online retailers are prone to on their sites:

  1. Price Scraping - Bots target the pricing section of a website in order to scrape the pricing data to share amongst online competitors. Amazon retail has ‘all sorts of “scraping” software’ in order to find the prices of brands online. They also have a whole team ‘dedicated to scraping’.
  2. Product Matching - Bots collect a huge number of data points from a site in order to make exact matches against a retailer’s wide variety of products.
  3. Product Variation Tracking - Bots are used to scrape product information to a level that accounts for multiple variants.
  4. Product Availability Testing - Bots scrape availability data in order to enable competitive positioning against an online retailer’s items products relative to availability and inventory level.
  5. Continuous Data Refresh - Bots are deployed on the same online retail site on a regular basis so that buyers of the scraped data are able to react to modifications made by the targeted site.

Prevent Site Scraping

Scraped content from your site can have an adverse effect on your website that has invested time, money, and resources to produce your content. Help stop web scraping and stop your web authority ranks and SEO being negatively affected by scrapers.

There are numerous measures that can be taken to manage web scrapers, some more effective than others:

1. Robot exclusion standard

This will not work as it relies on etiquette. The robot exclusion standard/ protocol, or simply robots.txt, is utilized by websites to communicate with web crawlers and good bots that provides information about which areas of the website should not be processed or scanned. Web scrapers and other bad bots need not cooperate on this, however.

2. Manually Block

Dealing with web scrapers manually is similar to the analogy of whack-a-mole, and a lot of man hours at that, trawling through logs, identifying tell-tale behavior, blocking IP addresses and rewriting firewall rules (see next point). Such effort may be successful for a short while before the web scrapers return, with new IP addresses or hiding behind new proxies. Blocking IP addresses may also affect legitimate users, especially those coming through the same service provider as the web scrapers.

3. Web application firewalls (WAF)

WAFs are built to protect web applications from being exploited because of the presence of common software vulnerabilities. Web scrapers main objective is to mimic real users and not target vulnerabilities. Therefore, as opposed to being programmed to block manually identified IP addresses (see the last point), in terms of controlling web scraping they are of little use.

4. Login enforcement

For some sites login is required to access the most valued data, nevertheless, this is zero protection from web scrapers, due to the ease associated with the perpetrators to create their own accounts and program their web scrapers accordingly.

5. Are you a human?

An obvious way to vet web scraping is to ask users to prove they are human. This is the goal of CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart). They frustrate some users who find them hard to interpret. One of the bad-bot activities explained by OWASP is CAPTCHA Bypass (OAT-0093 ). Existant are also CAPTCHA farms, where the test posed by the CAPTCHA is outsourced to teams of low-cost humans through sites on the dark web.

6. Geo-fencing

Geo-fencing is the term given to websites that are only exposed within the geographic locations in which they operate their business. This will not put a stop to the web scraping issue per se, but it will mean the perpetrators have to go to more effort than usual of seeming to run their web scrapers within a certain geographic location. This could simply involve using a VPN link to a local point of presence (PoP).

7. Flow enforcement

Enforcing the path legitimate users take through a website can ensure they are validated each step of the way. Web scrapers are typically hardwired to go after targets of high-value and encounter difficulties if forced to follow a typical user’s predetermined flow.

8. Direct bot detection and mitigation

The goal here is the direct detection of scrapers through a range of techniques including behavioral analysis and digital fingerprinting using specific bot detection and control technology designed for the task. Spread across numerous customers, suppliers of those technologies are able to improve their understanding of bots and web scrapers specifically via machine learning to the benefit of all. Once web scrapers are identified, it can then be decided based on factors such as provenance and other things if such activity should be permitted, controlled or blocked and appropriate action is taken.

Distil Networks is a provider of bot detection and mitigation tools to help stop web scraping.

 
Previous Article
Distil Networks - The Number 1 Bot Manager
Distil Networks - The Number 1 Bot Manager

In Forrester’s evaluation of the emerging bot management market, they identified the 12 most significant pr...

Next Article
[Infographic] Bots in the Ticketing Ecosystem
[Infographic] Bots in the Ticketing Ecosystem

Bad bots are responsible for 39.9% of all ticketing website traffic. At the heart of the bot problem is the...