How do scraping bots affect your business?
OWASP AUTOMATED THREATS EXPLAINED
Price Scraping | Denial of Service | Skewing
Online retail has become incredibly competitive and unsafe, and is under constant assault by the Internet underbelly of nefarious online actors, including big industry competitors. These threat constituencies are leveraging bad bots in numerous ways that hurt online retailers.
Bad bots scrape prices and product data, perform click fraud, and endanger the overall security of e-commerce websites, customer loyalty, and brand reputation. Of all bad bot threats, price scraping and product data scraping are the most rampant and costly to online retailers.
An industry built to scrape pricing and product data
Online retailers have spent years and millions of dollars establishing their brand presence online and garnering a loyal following of customers. These customers represent the lifeblood of the business. Yet, a large and growing pool of online competitors are working hard each day to steal away these customers and permanently win their business.
These bad actors seek to scrape information from legitimate online retail sites to gain product and pricing intelligence that can be used to undercut their pricing or position against their offerings. Whether termed ‘price scrapers’, ‘pricing bots’, or ‘pricing intelligence solutions’, an entire industry has grown around the use of automated bots dedicated to scraping as much data as possible from online retailers’ websites.
These are the five most common types of data scraping activities online retailers can expect to find on their websites:
- Price Scraping - Bots target the pricing section of a site and scrape pricing information to share with online competitors.
- Product Matching - Bots collect and aggregate hundreds, or thousands, of data points from a retail site in order to make exact matches against a retailer’s wide variety of products.
- Product Variation Tracking - Bots scrape product data to a level that accounts for multiple variants within a product or product line, such as color, cut and size.
- Product Availability Targeting - Bots scrape product availability data to enable competitive positioning against an online retailer’s products based on inventory level and availability.
- Continuous Data Refresh - Bots visit the same online retail site on a regular basis so that buyers of the scraped data can react to changes made by the targeted retail site.
Preventing price scraping bots can be difficult
There are several reasons for this. First, the nature of the industry makes identifying and prosecuting those who launch bots extremely difficult, time-consuming and expensive. Bots can originate from practically any location in the world and most often originate from well-known hosting providers and networks that organizations trust. Second, the highly illegal bots frequently originate from international locations, where US laws provide little or no recourse once bot originators are identified. Third, most bot technologies are not even considered criminal. In fact, vendors that offer price scraping solutions publicly tout the names of large, legitimate businesses as their customers. Consider a vendor named Upstream Commerce that markets a price scraping and a product data scraping solution.
Their homepage streams the names of retail industry customers, including Toys-R-Us, Petco, Lowest and eBags. In addition, they make no apologies as they promote their bot-based scraping technologies this way: “Upstream Commerce™ transforms the way retailers grow sales and boost margins through real-time pricing and product assortment optimization, using state of the art Predictive and Prescriptive Analytics and competitive intelligence tools.” Scraping vendor Kapow Software even more brazenly describes their price monitoring product: “Increase market share by collecting real-time competitive information via automated data feeds from competitive websites.”
As more vendors compete to offer increasingly sophisticated price scraping and product data scraping products, an arms race of capabilities is taking place. For online retailers, this means a drastic rise in the need for a holistic approach to bot defense as sophisticated and adaptable as the threat itself.
The how and why of price scraping from ‘pricing intelligence solutions’
Termed ‘pricing intelligence solutions’, these products are bringing advanced features and higher levels of sophistication to attacks on retailer websites. Built atop traditional price scraping functionality, they incorporate additional elements to steal and use victims retail site data in the most scalable and damaging manner possible. Here how their price scraping techniques work and what they’re after:
- Crawling: Ever-changing and custom built bots crawl a retailer’s site. Many hit the site each day, targeting the product pages specifically and scanning every product.
- Morphing: Bots take the form of data extractors that morph in line with changes to a retailer’s site. This way, bots can attack a site, even if the retailer adds complexity.
- Matching: Bots collect thousands of pieces of information to make exact matches against a retailer’s products. They lift pricing, inventory and availability data from the shopping cart and other site locations.
- Analytics: Leveraging semantic analysis and data mining, analytics engines sift through the data scraped by bots and make matches to prices and products, even if the scraped data is not an exact match. This, and other advanced analytics, enables the recipient of the scraped data to make multiple competitive positioning moves that steal away competitive advantage.
The ramifications of this type of advanced approach to retail site scraping are dire for many online retailers. The fact is, online retailers must stop bots from scraping their data in the first place, before the real harm occurs downstream. Otherwise, the retailers face a losing battle once competitors apply advanced analytics against the data. This final step can take apart an online retailer’s business — product by product.
Price scraping and data scraping case studies
Nefarious competitors using bots to aggressively scrape prices Like most leading online retailers, Hayneedle has been engaged in a daily struggle against competitors using bots to scrape their pricing. Brian Gress is Hayneedle’s Director of IT Systems & Governance. He tells us, “We've worked through various in-house bot mitigation methods with limited success. Initially, we would segment bad bot traffic off to a couple of underpowered servers.” “But as we became more aggressive with our pricing we drew traffic away from competitors, so they'd scrape us to see what we were selling and at what price.”
“Some bot operators interject scripts into their code to mimic human behavior. Others will circumvent the web product page and go directly to the API behind it. Before Distil, bot blocking was eating up 20% of team resources, only 30% effective, and impacting the team’s quality of life. Distil lets us granularly control questionable traffic or simply block it all.” says Gress.
Build.com’s Dan Davis, VP of Technology, explains, “Googlebot and Bingbot are critical to our business—we want to make sure they aren’t impeded,” he said. “But rogue bots have irregular browsing patterns that can affect site performance. In a few severe DDoS attack-style cases, bot cart activity consumed all available threads, ultimately causing outages.” “Multiple carts were simultaneously being loaded up with thousands of items, thereby causing infrastructure and software slowdowns. At the time, the application we were using to thwart cart bots would get exponentially slower as more items were added to a cart,” Davis said.
Minimum Advertised Pricing Policy (MAPP) dictates public product price listings, but many companies work around that by offering cart discounts. A product page may show a $100 price, but once added to a cart, there might be an automatic 10% discount. “Bot operators were probably trying to learn shipping charges and fully-loaded costs, such as charging sales tax. Most likely their goal was to collect data, and sell it to a competitor. There will always be a way to learn what we sell our product for. But if we can make it burdensome, it lessens competitors’ ability to compete on pricing.”
Distil’s hands-off, proactive approach saves Build.com two to three man-days per month and is much more effective than the ‘whack-a-mole’ we were playing writing F5 iRules. We’ve also prevented at least three outages, and freed up 25% of our F5 LTM capacity.” says Davis.
According to Tom Frenchu, CIO of TabCom, “Most of the bots targeting our websites were competitive crawlers looking for full-load pricing information. Today there are all these crawlers that are smart enough to search, perform an ‘Add To Cart’ action, and scrape for price. All of those search operations represent a higher computational hit—it's not simple HTML scraping. Such bots tax our backend servers and can really impact site performance,” he explains.
“I might be concurrently running 20 different promotions for 14 different sites. Pricing is dynamic between channel and email offers. The bots have become smart and fast enough these days to respond to the complexity of Internet marketing. They can hit, run through their scripts extremely quickly, and pull down our systems. If given free reign, they would be running a lot of queries against multiple scenarios for many different sites.”
Frenchu says their infrastructure was getting overburdened with such unwanted, continual search requests and ultimately killing response times for legitimate customers. Having all of those bots carting items—a computationally intensive operation—put a lot of pressure on internal servers and databases. “In the e-commerce world, it’s a given that if customers can't get to our sites, then there are no sales happening. Any such abrupt halt in the revenue stream—even a temporary one—presents a massive business problem,” Frenchu says.
“Everyone's after a price competitive edge. Amazon has put a lot of pressure on every reseller—not just small or midsize, but every reseller. They change their prices numerous times throughout the day. If we drop our price to $1.99, an hour later it could be $1.98 at another site. Then they appear higher in search engine price sorts. Boom—we just lost a sale. Shopping loyalty is a challenge and keeping customers satisfied is difficult, if you just rely on price,” explains Frenchu.
“Before deploying Distil, I estimate that 20% of our traffic was bots, but it varied tremendously. One of our sites was almost 80% bots, while some of the premium domains were around 10%. Now we don’t have to deal them at all,” says Frenchu. “We’re saving about 10% in cloud infrastructure costs, as well as extending the life of internal servers by not overburdening them as they were, pre-Distil,” says Frenchu.
About the Author
Elias Terman is VP of Marketing and is responsible for all aspects of the global marketing and communications strategy. Elias started his career as an entrepreneur, and now enjoys helping grow Silicon Valley startups into industry leaders. He built out the marketing and business development organizations at OneLogin leading to explosive growth, helped establish SnapLogic as the leading independent integration company, and led MindFire Studio to the Inc 500.Follow on Twitter More Content by Elias Terman