You may have seen this Wired article go viral yesterday: How a Math Genius Hacked OkCupid to Find True Love. It was a techie, data-driven love story. A mathematician named Chris McKinlay scraped over 20,000 user profiles from the dating site OKCupid and used them to build a statistical model to predict the compatibility of potential mates. Then he scraped another 5,000 profiles to ensure the model worked and set off on his (statistically optimal) quest to find ‘the one’.
The story had a fairly tale ending with Chris meeting the girl of his dreams and a wedding date under ‘analysis’. Good for Chris!
It’s worth noting however that the same techniques this UCLA PhD student employed to scrape OkCupid are also used everyday for more nefarious purposes. Content scraping is big business – attackers do everything from stealing and republishing original content, performing competitive pricing analysis, and more. It can have a material impact on a business bottom line.
That’s where Distil comes in – we block bots and keep your content safe. So what if OKCupid was using Distil to protect their content? Let’s see how Chris’ scraping attempts would have fared with our content protection in place:
Attempt 1: Program a Python Script
Quoting from the story:
[Chris] set up 12 fake OkCupid accounts and wrote a Python script to manage them. The script would search his target demographic (heterosexual and bisexual women between the ages of 25 and 45), visit their pages, and scrape their profiles for every scrap of available information…
Client identification is the first step of Distil’s content protection system. By sitting inline with HTTP traffic Distil is able to fingerprint not just the source IP but the unique client making a request. This fingerprint includes many bits of information – including whether the request is coming from a desktop browser, mobile browser, or an automated script. If Distil recognizes the fingerprint as an automated script, we can block the request outright before it can scrape any content.
Result: Chris’s Python script would have been identified as an automated agent and blocked at the outset.
Attempt 2: Build a Botnet
After harvesting thousands of profiles Chris set off OKCupid’s intrusion detection system. Chris’ account was throttled so he did what any good scraper would do – he adapted:
[Chris] turned to his friend Sam Torrisi, a neuroscientist who’d recently taught McKinlay music theory in exchange for advanced math lessons. Torrisi was also on OkCupid, and he agreed to install spyware on his computer to monitor his use of the site.
A growing number of scrapers and DDoS attacks are performed from zombie nodes infected with spyware similar to the software Chris installed on his friend’s machine. Effectively Chris built a mini-botnet.
With the inline fingerprinting mechanism described above, Distil is able to differentiate between normal and malicious requests originating from not just the same IP but the same machine. This means Chris’ spyware would have been blocked but Sam’s normal browsing would have continued unimpeded.
Result: Distil can detect and block malicious requests made from spyware infected machines.
Attempt 3: Throw Hardware at it
Ramping up his efforts Chris threw more hardware at the problem:
He brought in a second computer from home and plugged it into the math department’s broadband line so it could run uninterrupted 24 hours a day. After three weeks he’d harvested 6 million questions and answers from 20,000 women all over the country.
This is where things get interesting. In order to scrape the site in a timely fashion Chris needed to run his botnet 24 hours a day. Unfortunately for him Distil’s behavioral analysis would have detected this abnormality. The behavioral engine takes into account many factors including the velocity of access, pages accessed per session and total session length. These are compared to network-wide and domain-specific statistics, if they fall outside of the appropriate range – they are blocked.
Result: ‘Uninterrupted’ access would have failed multiple behavioral tests and been blocked.
In conclusion – we’re happy Chris’ model helped him find true love. Congrats you two! But every day we see these same techniques used with less innocent intentions. Luckily with Distil in the mix, you can rest easy that bad traffic is differentiated from good and your content is kept safe.
Whether you agree with Chris’ tactics or not, you probably could benefit from increased insight and control over your own human, good bot and bad bot website traffic. Keep your content safe with Distil Networks and let us prevent bad bots from ever scraping your site. Sign up for your free trial today!
About the Author
John Bullard, Distil Networks’ VP of Engineering, is a technical entrepreneur focused on enterprise software. At Distil, John helps scale the the core platform and DevOps teams.Follow on Twitter More Content by John Bullard