Overview

One of the world’s leading hotel booking websites assists anyone looking to book hotel accommodation.

Through web scraping, the hotel booking website’s competitors were acutely aware of its price offerings. After exfiltrating the company’s data, the scrapers would then use this data to adjust their pricing to be slightly cheaper.

Challenges

Early on, the web scrapers used a fairly standard, repetitive pattern that was more easily identifiable by the team. Over an 18-month period, the hotel booking website watched as scraping methods continually increased in their sophistication. Manual IP address-based rate limiting proved to be too basic and was no longer effective. Competitors’ bots were able to easily masquerade as human site visitors and became far more aggressive.

“We started seeing randomized requests from JavaScript-enabled browsers,” the architect observed. “The scrapers were consuming so much of our wholesaler query allocation that we weren’t able to deliver our pricing to legitimate users—a sort of application denial of service.”

“Ordinarily, 1,000 search requests might yield a single booking—a look to book ratio the wholesalers place on us. If competitors’ bots are themselves making those search requests, we quickly exceed that metric. A few wholesalers were lenient and simply called to ask that we investigate, but others were imposing a strict rate limit that we were exceeding.”

“It quickly became clear that we weren’t going to be able to block the scraper bots without devoting a large chunk of in-house resources to that goal. We initially had 2-1/2 people working on the problem fulltime, which conservatively could have cost us $250,000 (USD) in annual staffing costs alone—never mind the other negative impacts it was having on our business. That’s when we learned about Imperva Bot Management (formerly Distil Networks).”

The Results

Deploying Imperva Bot Management with Akamai, HAProxy, and Cisco IPS

The hotel booking website uses Akamai for its content delivery network, coupled with its Dynamic Site Accelerator.

Set up in a typical deployment, requests are made to the hotel booking website’s load balancer, then passed to its Imperva Appliance for inspection. The requests are then looped back into a separate NIP on the load balancer, where balancing evaluation occurs. The requests are then passed back to the origin server application pool. On the return path, the flow is reversed so the Imperva Appliance can carry out HTTP stream injection before returning the responses. All of the above happens in about 3 to 7 milliseconds.

Investment justification

Before implementing Imperva Bot Management, the hotel booking website suffered traffic spikes during very aggressive web scraping periods resulting in application denial of service. With Imperva, site availability remains constant. And the daily scraping activity that once comprised a significant portion of searches sent to wholesalers has been eliminated, so the hotel booking website is able to expand its presence without additional investment.

Using the Imperva Portal, the architect’s analysis showed a saving to the company of 20% on both infrastructure and wholesaler API bandwidth. “The metrics we look at are the proportion of bad bots versus total traffic. But then I also use the CAPTCHA report to reassure us that we’re not mistakenly serving incorrect data. Yesterday we served 283,000 capture forms, and only had 166 attempts supplied to humans.”

Coupled with the hotel booking website’s own logs, the architect likes to use the Imperva Portal’s Threat by Organization Report. After initial deployment, “We saw a lot of activity coming from Google account services. We still get a lot from AWS, so it has been useful to tell Amazon, ‘Hey, someone is misusing your servers to breach our terms of service. Can you stop them?’ And they do.”

Remember the 2-1/2 fulltime staff members who had their fingers in the dyke? They’re now free to perform more strategic duties. In deploying Imperva, the hotel booking website realized an immediate reduction in bot traffic, bandwidth, and overhead costs. As the architect says, “The thing that all of us enjoy so much is that it just works, where so many times other solutions don’t.”