How to Scrape a Website of All Its Content in Seconds

October 12, 2016 Bobby Power

Recently, we’ve seen a number of high-profile content theft cases involving automated commands and content scraping tools. For example, LinkedIn made public that nefarious actors used malicious bots to actively scrape user profile data from its site for almost a year. But while LinkedIn's extended data theft headache was likely the result of advanced persistent bots—a bot category that is still on the rise—the incident is yet another illustration of scraping tools in action.

Scraping tools can be used for new hire candidate and job searches, market research, price comparisons, and other relatively innocuous reasons. But these tools are more often used by perpetrators. Built by paid scraping services and freelance scrapers, homegrown bots and automation tools steal and exploit competitors’ content for their own gain. Stolen content often runs the gamut of online purchases, including travel agency ticket prices, concert ticket prices, online retailer product pages, and more.

These tools are incredibly easy to access; free web apps are seemingly a dime a dozen. And while we’ve seen how easy it is to build and automate commands to rummage through sites while behaving like legitimate human users, we wanted to show how quickly any user can jump online and start their own scraping campaign.

All a malicious user needs to do is:

  1. Access any online scraping tool, e.g., Import.io, Extracty, Portia by Scrapinghub.
  2. Enter a competitor’s URL and run the scraper. Some tools offer full site scraping, while others let the user selectively choose which content or elements to scrape.
  3. Export the scraped results.

Sound easy? Check out our demo video to see the process in action:

Piece of cake, right? With the motivation, any malicious user can simply be browsing a competitor’s site, open a new tab, and start scraping within seconds. But the process only works on sites not equipped with bot protection. Compare that result with the Captcha test served when Distil detected the scraping tool in action.

 

About the Author

Bobby Power

Bobby comes to Distil Networks as a technical writer with previous software documentation experience in both the public and private sectors. He is responsible for working with Distil’s Product Marketing team to develop detailed documentation and online help, including Knowledge Base articles, in-app help, user guides, and more. He spends his free time with his wife, son, daughter, and dog, and writes for a few music outlets, including AdHoc, Decoder Magazine, Thump/Vice, and Creative Loafing.

More Content by Bobby Power
Previous Article
Gartner Market Guide for Online Fraud Detection
Gartner Market Guide for Online Fraud Detection

Gartner published their Market Guide for Online Fraud Detection to provide fraud managers and web infrastru...

Next Article
The Rise of Advanced Persistent Bots
The Rise of Advanced Persistent Bots

Bots are growing more advanced and persistent, laying waste to old best practices. Read on to see how bad b...