At Distil Networks, we process server logs – lots of logs – and as an intern, I get to do the grunt work -- pretty much exclusively grunt work. So naturally, after my first several weeks of writing test cases, these two realities collided. Processing server logs in real time to detect bots presents a whole host of technical challenges and is an interesting problem for anybody who has the privilege to work at Internet scale. After topping off my test-framework with a module that validates our SSL certificates, verifies our https functionality, and scared the bejesus out of the founders, I got to evaluate an alternative option to our current background processing system (Microsoft Azure, if you’re curious).
Since our resident computer scientist had set up a Hadoop cluster a few years ago, it was an obvious choice for further investigation. The current landscape around Hadoop has grown up quite a lot in the last few years and open source tools have been written that solve a lot of the problems our theoretical implementations would have to solve. Hadoop’s core functionality, its distributed file system and map-reduce framework, are ideal technologies for log processing. Large dumps of log files can be easily chopped up into blocks of data and processed by distributed machines, then compressed and written into long term, low cost storage by Hadoop with theoretical ease.
I use the word theoretical because in practice the setup and management of Hadoop is no easy task. It really is grunt work. Fortunately, there are some new free (emphasis) technologies that make the initial cluster setup straightforward and simple. If you use Ambari and Amazon’s EC2 as I did, you can literally get a Hadoop cluster with some slick monitoring up and running in an afternoon. Ambari is a new project in the Apache Software Foundation’s incubator, which aims to provide an open-source framework for the configuration, management, and monitoring of a Hadoop cluster. The first release of Ambari was 10 months ago, so you would not be surprised to hear that there is a catch when I said it is easy to setup.
Like any new piece of software, there are a whole bunch of kinks that still need to be worked out when you start deviating from the initial instructions. The Ambari project is just too young to have worked out all of the details of automating the management of a distributed system and as a result it adds extra complexity when you want to do simple things like restart your cluster. The architecture behind Ambari is incredibly cool when you become so frustrated with it that you end up reading about its design goals. So, it seems clear to me that the contributors to that project have a lot of hard work cut out for them, but then again so do I.
So really that’s no surprise. Looking at the Ambari project and our own in house development, it has become clear to me that for any large software project there is a large amount of high effort low fun work that has to be done before anybody gets to play with the shiny new toy.
Nick Skelsey is a summer intern at Distil Networks. He is currently studying Computer Science and Politics at the University of Virginia. At Distil, Nick writes test cases against the bot detection system in addition to other suitable intern work.
About the AuthorFollow on Twitter More Content by Courtney Brady