In recent years, software companies have been clamoring to have automated tests for their products. We’ve seen the benefits of testing and continuous integration and they are significant. The ability to push new code into production with (hopeful) impunity is a far cry from the days of the past where an army of manual testers were required to run through a scripted series of smoke tests before a new product was released upon the masses. Suffice to say, it’s a much better proposition to have a robot perform your regression testing without handholding than everyone losing a Friday night smoke testing before release.
One of the more popular ways to functionally test a web application is using an automated browser powered by a form of web driver (Selenium, Watir, etc.). Because of this popularity, many developers and QA engineers have honed their skills to create suites of regression tests that catch what the functionality SHOULD be.
Automated tests, however, are not the only use for these tools. On the more “nefarious” side, they can be used to create intelligent bots to access sites and scrape the precious data from them. For instance, say someone wants to create a site for the gluten-free craze that’s going around. This site would aggregate all of the data for restaurants that offer gluten-free options as well as their location, price and hours.
For someone with experience in the QA automation world, one would begin with creating a browser instance pointing at Yelp and with some logic, pulling down and storing the information. But that would be using the fundamentals learned from simply coding tests for a known application in a controlled environment. What if other variables were put into place? What if you had to get around (more than basic) security? What if someone was actively trying to detect bots?
As an Automation Engineer, these variables present a new way of thinking with a tool that I was already extremely familiar with. Instead of going heads down and knocking out strict happy path or negative test cases, other conditions had to be taken into account. Running across Captcha’s as a detection device is a common occurrence, but once around those, what else should I be aware of? How is this site detecting that I am not a human? Should I create some sort of rudimentary AI behavior for my browser? These are all questions that I was faced with as an experienced Automation Engineer diving into the realm of web-scraping.
Obviously, people have found solutions to these questions, but bot detection is constantly evolving. Because of this, I believe that web-scraping and attempting to bypass these systems has given me insight into the way I now write and create my functional test suites. I have expanded my thinking about how users interact with systems by trying to bypass security systems disguised as a human. This is absolutely invaluable when creating automated test plans that focus on edge cases and “what ifs.” I encourage anyone with an interest in automated testing to give scraping a try. Not only will it expand your horizons in the automated testing world, but it may create a new found interest in architecting new ways to block harmful bots.
About the Author
Stephen Atkinson, Distil's QA Automation Engineer, started his software engineering career as a developer for the East Carolina University, where he took part in research and development of video games that would be used in a medical environment for diagnosis and rehabilitation. From there, he expanded his knowledge into automated software testing at iContact and later, 6fusion.More Content by Stephen Atkinson