The Rise of Advanced Persistent Bots

October 3, 2016 Peter Zavlaris

Gone are the days of the simple dumb bot; now they’ve reached a new level of sophistication. Each can connect to sites via a browser and execute JavaScript. Simulating humans, the brainy bots are programmed to limit requests, vary the time spent on any given page, and even simulate mouse movement. All of this makes them more evasive and harder to block. And they've gone from being dispatched from a single device to large distributed networks comprised of virtual machines, or “zombie” computers—all remotely controlled without their owners’ awareness.

Meanwhile, web applications are increasingly vulnerable to automated threats and are subjected to click fraud, scraping, account takeover, comment spam, and more. These and other malicious or illicit activities are described in detail in the OWASP Automated Threat Handbook for Web Applications.

Scripted using Python or cURL, for years dumb bots have been able to download the entire contents of a website. They connect from an ISP and make web requests using a command prompt. Together with rate limiting, standard bot mitigation practice has been to use log analysis to identify IP addresses of the bad bots and then rewrite firewall rules to block any future requests from those locations.

Simple Bot:

Marty Boos, Sr. Director of Technology Operations at StubHub, explains in his video testimonial, “Until recently, bots were fairly unsophisticated. People used cURL or some other non-browser based tools to mine a lot of data off of our site,” he said. “That’s morphed into browser-based plugins—Selenium, DejaClick—things like that.”

Evolving bot sophistication poses a challenging problem for defenders. The bot builders are all too aware that yesterday’s web defense strategies primarily hinged on IP recognition and blocking—a strategy that works for simple bots. But the game has changed.   

The Rise of Advanced Persistent Bots (APBs)

Today’s sophisticated bad bots are either advanced, meaning they can load JavaScript, hold onto cookies, and load up external resources, and persistent, in that they can randomize their IP address, headers, and user agents. Almost 90% of malicious or illicit bots are now considered to be APBs.

Shown above, an APB is controlled from a central computer. Using orchestration software (e.g., SaltStack), the bot herder can spin up multiple instances through such cloud providers as Amazon AWS and DigitalOcean so as to extract desired data from a target site. Each instance can have numerous IPs. And all instances can either attack simultaneously, or be dispatched in waves.  

APBs are the perfect weapon to circumvent log analysis and basic IP blocking. They’ll attack from as many addresses as it takes to bypass weak security controls found at the target. And APBs avoid rate limiting thresholds by reducing the amount of requests made per IP.

This is the new adversary. Organizations looking to protect page content, block cyber thieves from taking over customer accounts, or prevent hackers from performing vulnerability scans must have a far more robust solution to defend against such distributed attacks.

How Do You Stop APBs?

The key to preventing bad bot infiltration is to positively ID every visitor—human or bot. One highly effective method for achieving this is to determine if the requester is actually what they say they are. A requester’s header information can offer clues.

For example, one might claim, “I’m Internet Explorer running on a laptop.” If that is true, then its headers should be formatted a certain way, its TCP packets should correspond to the operating system it’s ostensibly running, and the JavaScript engine should match. Using this method can eliminate as many as 80% of bad bots, but what about the remaining 20% that are still able to get through?  

Machine Learning Based-Blocking

Some bad bots have become so advanced that they automate an actual browser. In such cases it’s necessary to evaluate requester behavior using machine learning. Here, a bot-browsing pattern appears different than legitimate traffic.

A metrics profile can reveal anomalies that distinguish bots. For example, how did the requester enter the site? What time of day did it come in? Is it connecting from an ISP or a datacenter? What pages did it go through? How did it navigate through the site?

Bots end up being very random or quite systematic. Their patterns help identify what is legitimate and what is not.

Advanced Digital Fingerprinting

Most security solutions today use access control lists (ACLs) or other blocking mechanisms based on the positive ID of a given IP address. By contrast, Distil creates a digital fingerprint by peering deeply into the requesting device’s browser, using 200 unique markers to identify it. It can then block bad actors based on their fingerprint. If the bad actor shifts its IP, the requester profile retains the same fingerprint and can still be identified.

Conclusion

Modern bad bots are advanced and persistent. Attacks are distributed across vast networks to ensure that while security teams are playing whack-a-mole, threat actors are able to steal data. Performing log analysis and using ACLs to isolate and block malicious/illicit IPs isn’t sufficient in solving the bot problem anymore. The key to keeping bad bots off your site is positive identification using a variety of methods: header evaluation, machine learning, and fingerprinting. The rise of the advanced persistent bot needs an advanced persistent defense.

Providing bad bot armies to steal content or perform other illicit activities has become a blossoming industry—and now anyone can cheaply and  easily target your site. Learn more about the players, technologies, and services in our 2016 Economics of Web Scraping Report.
 
Previous Article
How to Scrape a Website of All Its Content in Seconds
How to Scrape a Website of All Its Content in Seconds

With the advancement of web scrapers, users can scrape website content easily. Find out how it's done and h...

Next Article
Can Comic Sans Detect Cyber Attacks?
Can Comic Sans Detect Cyber Attacks?

In our effort against bad bots, we used information theory in a study of comparing fonts installed on 1.2 m...