The world wide web is estimated to hold almost 50 billion indexed pages. These are pages that are accessible via search engines and therefore open to all. Needless to say, the indexing must be automated, a continuous process that ensures the most recent content can be found. The indexing is carried out by web crawlers that continuously check for page changes and additions.
Web crawlers are one of the earliest examples of software or web robots (usually referred to simply as bots). If all the information contained on those websites can be found by web crawlers then it can be accessed by anyone. This means businesses can keep an eye on competitors as never before; in fact, whole new businesses, such as price comparison sites now exist that rely on nothing more than aggregating information from multiple sites that deal with a particular commodity. The process of doing this also relies on a type of bot called web scrapers.
It is estimated that 46% of all web activity is now bots. Some, such as web crawlers are generally considered good bots. Others, for example those used to mount denial of service attacks are definitely bad bots. Web scrapers are grey; in some cases a site owner will want its content lifted and aggregated elsewhere; in other cases, doing so is tantamount to theft. This e-book looks at the issues around web scrapers and what can be done to control their activity.
The history and growth of web scraping
The concepts behind web scraping pre-date the web itself. With the advent of client server computing in the early 1990s the use of graphical user interfaces (GUI) became common-place (chiefly through Microsoft Windows). However, many organizations had legacy applications often written in Cobol and running on mainframe computers. The output of these applications was designed for visual display units (VDUs), sometimes referred to as dumb terminals. There was a need to re-present the VDU output for the new GUIs.
This led to the birth of screen-scraping, the idea was later adapted to serve up the content of legacy applications for web browsers. It was also realised the concept could be reversed. Web pages themselves have similarities to VDU screens, regardless of how either are generated the user sees an area of screen real estate populated by data fields with uninteresting space in-between. Web scraping extracts data in a way it can be understood and reused.
In many cases the target web pages will need input to stimulate the filling of fields, for example the desired destination on a travel site. So, as is the case with many bots, web scrapers need to mimic human activity which can make them hard to differentiate from real users.
Web scrapers are one of twenty types of automated threats (bad bots) that are described in a new taxonomy from OWASP (the Open Web Application Security Project). Each is given an OAT (OWASP Automated Threat) code. Web scraping is OAT-011 “Collect application content and/or other data for use elsewhere”.
Web scraping protection is critical to protecting unique content, competitive advantage, site availability, and SEO pagerank
For many businesses, their website has become their primary shop window, where their offerings are displayed along with pricing information and current offers. This is information that is open to all; prospects and customers of course, but also competitors and free-loaders who deploy web scrapers to regularly harvest data from sites they are interested in.
Web scraping protection ranges from stopping an unwanted activity that can undermine or even destroy your businesses to the wanted bots that may supplement it. However, even with the latter, there can be too much of a good thing; server and network infrastructure can become over-stressed and costs incurred from random spikes in traffic due to aggressive web scraping. The trick is to be able to recognize web scrapers amongst all the other bots and block, limit or allow them according to policy.
There are other possible unwanted impacts. Search engine optimization (SEO) can be negatively impacted. For example, as content is copied around the internet by web scrapers, Google’s search engine assumes that sites hosting the original content are trying to game its search algorithms, and so lowers the ranking of pages with the original and copied content. The entity that copied the information can sometimes even end up with a higher ranking than the originating site; that may even be the goal.
About the Author
Bob joined Quocirca in 2002. His main area of coverage is route to market for ITC vendors, but he also has an additional focus on IT security, network computing systems management and managed services. Bob writes regular analytical columns for Computing, Computer Weekly, TechRepublic and Computer Reseller News (CRN), and has written for The Times, Financial Times and The Daily Telegraph. Bob blogs for Computing, Info Security Advisor and IT-Director.com. He also provides general comment for the European IT and business press. Bob has extensive knowledge of the IT industry. Prior to joining Quocirca in he spent 16 years working for US technology vendors including DEC (now HP), Sybase, Gupta, Merant (now Serena), eGain and webMethods (now Software AG). Bob has a BSc in Geology from Manchester University and PhD in Geochemistry from Leicester University.More Content by Bob Tarzey