The Dirty Secret About Robots.txt

July 18, 2012 Andrew Stein

Robots.txt

UPDATED:  Per many requests, we’ve added a graph of one client’s legitimate and malicious web traffic.

UPDATED:  Many people from the IT community know the limits of Robots.txt.  But because of the sheer number of companies (big & small) we’ve spoken with, who thought robots.txt protected them from malicious robots, we thought we’d write this post.  Spread the word, leave a comment, we’d love to hear your thoughts on what you’re doing to protect your content.

I lock my car doors. I think everyone does and the reason is simple – no one wants his or her stuff stolen. If you leave your car unlocked, anyone could walk by, see your cool new GPS, open your door and take it.

 postit_note 

But what if you left a note saying “I know the doors are open, but please don’t take my GPS”? It might work for all good, law-abiding citizens – but the bad guys of the world? Chances are, it’s not going to stop them at all.

Welcome to robots.txt

Since 1994, webmasters having been creating “robots.txt” files and using them as that proverbial “please don’t steal” note. The idea behind robots.txt is simple – robots.txt contains the Robots Exclusion Protocol, which is supposed to stop bots, web crawlers and search engines from indexing areas of your website you don’t want showing up on search engines. It’s a great concept and the good bots like Googlebot, Bingbot, and Yahoo!’s Slurp bot all follow the rules and protocols you specify in your robot.txt file. They’re considered the law-abiding citizens of the Internet world and treat what your robots.txt file says as though it were law.

Robots.txt is Not Enough

The problem is that these aren’t the only bots on the web. There’s a whole other set of bots that will never even look at those robots.txt rules and just burn right through your website stealing your content, data, user info, etc. Often these bots are just people seeing something they want and taking it. In real life it’s the GPS you spent your hard earned money on. Online, it’s the content of your website that drives people to you and makes your site stand out.

Unfortunately, there are still 1,000’s of websites and webmasters who believe the robots.txt is enough to stop bots from crawling and stealing their content, data, and user info.

Find Something That Works

What if you could find a solution or process to mitigate those malicious bots as well.   Here’s a quick graph of one of our client’s traffic, before and after using our service to block bots and web scrapers.
NOTE:  They had experienced 3 straight years of decline in legitimate traffic prior to the data on this graph.

 before_and_after_distil-680x465 

Read the Article

About the Author

Andrew Stein

Andrew Stein is Distil's Co-Founder and Chief Scientist. Getting his start running a large online kids’ game, Andrew took his passion for web development to NC State where he became the Senior Web Developer for the Department of Electrical and Computer Engineering. Working on everything from identity management programs to digital signage systems, Andrew has run into a little bit of everything and is always eager for new challenges.

Follow on Twitter More Content by Andrew Stein
Previous Article
Scraping Just Got a Lot More Dangerous
Scraping Just Got a Lot More Dangerous

A Federal Court restricted fair use by upholding the NY Times and The AP’s claim to copyright against Meltw...

Next Article
Look Who’s Crawling Now: Distil Networks’ Good Bot Report Shows Who’s Crawling Your Site
Look Who’s Crawling Now: Distil Networks’ Good Bot Report Shows Who’s Crawling Your Site

With Distil Networks’ Good Bot Report, you can see search engine crawlers & social media crawlers in a repo...