Building A Better Mouse Trap: How We Detect and Block Bot Traffic

Bots are one of the most vexing technical problems web applications must deal with on the web today. They tax server resources, scrape and steal content, and relentlessly manage to adapt to countermeasures deployed by development teams. A framework to build a bot exists for nearly every web language and most advanced frameworks and automated browsers will properly execute javascript. This forces bot detection to occur in post-facto log analysis or through receiving server down alerts.

I’ve spent countless hours on calls and answering support emails going over how we specialize in bot detection so I’m finally taking the time to sit down and outline how we fight the endless war against web bots so our customers won’t have to. This post contains four sections where I touch on how we identify and detect bot traffic to how we end up blocking them:

Our Approach and Philosophy
Uniquely Identifying Every Connection
The Cylon Detector: Separating Bots from Humans
Improving Detection through Data Analysis

Our Approach and Philosophy

Our core philosophy in approaching bot protection is to assume bot designers will always find a way around any single, standalone approach we can implement within our system. Like a virus, they can and do adapt quickly to the remedies we come up with and there is no one single method of detection and blocking that stops all or even a small majority of bots. Instead, what we’ve done is build a multiphase system with enough variance at each phase to make that adaptation hard.

We kept our design scope as simple as we could:

Identify incoming requests as either a Bot or Human Traffic based on a series of assumptions and tests.
Filter out requests we believe are from Bots.
Analyze connection data asynchronously to update our assumptions and come up with new data.

Uniquely Identifying Every Connection

The single most important piece in our bot detection design, and in a broader sense one of the most important pieces in general security, is identification. Without a reliable methodology for identifying a specific connection there can be no dependable way of tracking that connection for asynchronous analysis. Without being able to do background analysis on a specific connection, we’d be limiting bot blocking ability to just general coded traps and link injections; methods that most bot designers know to look for and can quickly adapt.

Webservers identify incoming connections by assigning an IP address to the request, and that methodology is followed through most analytics software suites available. Unfortunately, IP is also probably one of the least reliable methods of identifying and tracking an HTTP requesting client. It’s not inherently difficult or even time consuming to use a cloud provider’s API to build a mechanical turk that can proceed to cycle through instances and IP addresses.

Worse yet, if the bot were scraping through a NAT, then the probability for false positives goes up as the IP is cycled to normal users after its been tainted by the activities of the bot.

What we needed was a way to uniquely identify not the connection or request itself, but rather the machine making the request. That way, our system would become IP agnostic but tie activities to an individual machine. This approach is especially appropriate for mechanical turks that are cycling the same machine image or set of images through multiple instances. By assigning a unique identifier to the image itself, we could in theory be able to continuously identify the bot as it cycled through multiple instances and IP addresses.

While this theory is sound, it’s still remarkably easy to circumvent. Most unique id generators are designed to create very specific ids. If the bot designer targeting us altered any of the connection and request properties we used to generate our UID, he or she would essentially secure a clean slate for their army of bots to continue accessing the website. To circumvent this, we created a three level ID system that uses locality-sensitive hashing to build a UID for the various types of bots we encounter.

At this point, I need to define what I mean by ‘types’ of bots. Bots can range in complexity from libcurl and CasperJS implementations to distributed selenium2 deployments and compromised home computers with zombie browsers. Because we build our UID based on connection properties, we gain more information the more advanced the scraping vector is. This may seem counter intuitive, but it essentially boils down to this: at each level of identification, we present more challenges to the browsing client that returns additional bits of information if the client can manage to process those requests.

A perl script, for instance, wouldn’t be able to handle a javascript challenge and would be stuck with an ID at level 1. A fully automated browser, on the other hand, would be able to pass through all ID related challenges and would have an ID at level 3.

The Cylon Detector: Separating Bots from Humans

Once we’ve given an incoming connection a specific ID, we can begin looking at it objectively to determine if the traffic is originating from a bot or not. This is the core phase in blocking bots in real-time and we had several design considerations here as well:

Real users should not be impacted. It’s better to err on the side of caution and let a bot slip through rather than block a percentage of legitimate users.
We need a ‘score’ to be able to both set a configurable threshold after which a block would occur and to allow modular violations.
Previously caught bots need to be stopped before system resources and connection time is wasted.

Creating a rudimentary scoring system is simple enough. Each violation is assigned a bitwise value, and that value is compared to the user defined settings for that given domain. This system lends itself to easy translation to human readable actions: block, don’t block or captcha. Where the system starts to become a bit more complicated are in points 1 and 3.

Keeping a running tally of ‘caught’ bots in memory on a single machine is straightforward and very easy to do. Keeping that same running tally synchronized across a hundred geographically dispersed machines is not quite as trivial. We had to devise a custom solution that needs its own blog post to keep our global network and our deployed appliances in sync.

Harder still, and most important to our customers, is ensuring real-users aren’t incorrectly identified as a bot and blocked. This is where our identification system continues to pay dividends. By building a challenge system directly into the identification method we use, we’ve already done quite a bit of the heavy lifting with regards to making a determination about a given connection:

If the connection has level 3 ID, it is either a user or an advanced bot.
If the connection has a level 2 ID, it might be a user with some funky browser, but probably a bot.
Level 1: Cylon. Block.

That’s all good, but what kind of actual bot filters do we do? We have a pretty wide variety of traps and javascript challenges we use that are generated on the fly and randomly placed within elements on a given page.

Additionally, because of identification system, we can employ advanced rate limiting algorithms on a per-machine basis, rather than per IP. This allows us to apply rate limiting to individual requesting clients, rather than entire IP’s transforming it from a rather blunt instrument against request floods, to a rather effective way to limit aggressive crawling as well as long-polling of a given website.

And of course, once a bot is identified, that information is propagated across our entire network as well as our appliances.

Improving Detection Through Data Analysis

The final phase we employ is data analysis, and this occurs asynchronously. What we’ve built and continue to expand and elaborate on is a classification engine that correlates additional data that isn’t available instantly when a connection is made and a webpage requested. Specifically, we trawl through our data and look at the browsing behavior, targets, duration and more of a specific UID in relation to what we consider normal for the domain that UID is crawling.

What this allows us to do is determine the probability of that UID being a bot given the data we’ve observed about that UID and the domain it was browsing. This is accomplished by training our classifiers on all the data we’ve gathered across our entire network based on the type of domain (i.e. e-commerce site, social media site) as determined by a human.

The last bit is key, as we’ve found that different types of sites have different types of browsing behavior: social media users tend to browse in small networks or clusters while e-commerce users tend to be much larger clusters. Classifiers trained on type of site wouldn’t be that effective on another.

Once our classification engine has made a determination on the probability of a type of UID being a bot it will then compare that to a threshold we set. If the threshold is exceeded, the UID is pushed out to the global network and the now labeled scraper is stopped.

Final Thoughts

The core of our bot blocking technology is being able to identify incoming connections, and then track and observe those ID’s as they continue browsing. Without having to worry that a bot will slip through our net by changing its IP address, this opens up the field. Our hands are freed to concentrate on newer, better traps, and more powerful, tailored classifiers.

You can apply this advanced approach to your own website in just a couple of hours. Request to see how easy it is to block malicious bot traffic, malware and competitors without impacting legitimate users.