Over the last few days there’s been a lot of talk about Edward Snowden using an “inexpensive and widely available” piece of software to scrape NSA records. When scraping is in the news, we’re reminded of Aaron Swartz and Chelsea Manning, known for using wget to pull records from JSTOR and American diplomatic cables, respectively. Web scraping is often defined as an “accretive process”. The idea is that the value of the data only comes from the volume of collected data and not the importance of each individual piece of data. This is true in the cases of Swartz, Manning, and Snowden; but the notoriety of these specific attacks has overshadowed the day-to-day reality of web scraping activity
Clunky metaphors are often helpful when talking about technology that many aren’t familiar with. The term “scraper” itself is misleading (data is being copied, not scraped off or taken away permanently) but used to describe the end result of the process. To give a clearer picture on scraping methods, we can compare a backhoe and a trowel.
Wget is basically the backhoe of the web scraping world. It can be used to grab a lot of information quickly, but ultimately you’re left with just one giant pile of copied data. After the data has been scraped, you’re either stuck with something raw and unprocessed or you have to go through the time-consuming process of sifting through it, finding if there’s anything you actually want amongst all the dirt. It’s an effective tool for the types of scraping that Snowden, Manning, and Swartz performed because no level of precision is required. Just getting the data copied is enough. It doesn’t matter if it’s pretty or well-formatted and often, being an exact mirror of the data, is preferred.
At Distil Networks though, most of the scraping we see is far more targeted. Rather than a backhoe to grab everything, our customers are dealing with scraping tools that are employed more like trowel. The scraper has a specific target and uses a precise tool to extract valuable data. Instead collecting a massive data set that includes tons of junk and irrelevant information, people who scrape are building tools that pinpoint and retrieve only a given target. A great example of this is price scraping. Scrapers aren’t going to go through and copy an entire product list hourly; but they will write a precise bot that accesses the product pages and pulls current prices. When data is returned by a specialized bot, it’s instantly actionable. A bot written to scrape a website and pull only prices could easily turn around and lower (or raise) prices on the scraper’s website. No manual interaction required.
The other problem with this more precise form of scraping is that it’s much harder to detect. Much like a backhoe, wget isn’t going to sneak up on a website owner. It’s a fairly blunt instrument that leaves pretty obvious signs of where it is and where it has been. For this reason, it’s also easy to catch via rate limiting on requests per minute, requests per IP, or in some cases just the user agent. This is also the type of protection that companies will attempt to monitor with tools they build themselves; tools that can be very effective for stopping something as primitive as wget.
A more precise bot is better at evading detection. When someone is using a precise bot to scrape a website, they’re often not going to come anywhere near the traditional rate limiting protections. Rather than pulling all of your data, these bots will hit one or two pages, grab the information they need and them come back later to do it all over again. If they do need large chunks of data, they can split it across multiple hours staying below any limits you may have.
Though it’s becoming a far more discussed topic than ever before, all of these conversations are centered on one tool initially released in 1996: wget. Though it was the main tool used in some of the most notable scraping stories, it’s not the end-all be-all of web scraping, nor is it the tool we see most often in the wild. The focus so far has been exclusively on the backhoe, grab everything, approach. As more articles continue to come out about web scraping, this provides an opportunity to broaden the discussion to include all forms of web scraping, namely the precision attacks that many websites face daily.
About the Author
Andrew Stein is Distil's Co-Founder and Chief Scientist. Getting his start running a large online kids’ game, Andrew took his passion for web development to NC State where he became the Senior Web Developer for the Department of Electrical and Computer Engineering. Working on everything from identity management programs to digital signage systems, Andrew has run into a little bit of everything and is always eager for new challenges.Follow on Twitter More Content by Andrew Stein