Support Vector Machines and Hadoop: Theory vs. Practice

September 18, 2013 Mike Freenor

Destruction_of_Leviathan Knowing theory is one thing.  Being able to apply that theory to Big Data is a different beast.  As a developer thoroughly used to the von Neumann model of computing, moving over to Hadoop (or any mapreduce framework) can feel a bit like facing Leviathan.  As with other big problems, however, Leviathan doesn’t look so big and bad when he’s been broken into pieces.  This week’s entry will be about one particular technique we have adapted to deal with the large amounts of data we see at Distil Networks, and how it breaks nicely into smaller pieces in the mapreduce framework.

Support vector machines (SVMs; see [2]) form a class of supervised learning models, an alternative to neural networks for pattern recognition.  There are many reasons to prefer SVMs.  I personally prefer them because they are conceptually easy to understand–solving an SVM is simply solving a convex optimization problem.  A trained SVM is a set of vectors and much easier to eyeball and understand than a trained neural network, which is essentially a black box.  SVMs can be applied to the same class of problems as neural networks including character and image recognition.  At Distil Networks, we’re starting to use an SVM as one tool among many to detect bots based on their behavior and other statistical features.

One big downside to SVMs is their computational complexity, which grows non-linearly with the size of the data set used for training.  (Simply checking whether or not you have the solution to an SVM is O(n2) where n is the size of the training data set.)  With a massive dataset like Distil Networks’ web logs, it would take days (or worse) to train an SVM.  Developing, testing, and deploying such a system would clearly be prohibitive.

There’s a silver lining, however.  Training an SVM can easily be broken into parts (for full details, see [1] and [3]), making it a class of problem well-suited to mapreduce.  One can simply take the training data set and split it into subsets, training an SVM on each of these subsets.  These trained SVMs yield “support vectors” (separating vectors used in the classification).  Once these support vectors have been found, the entire set of support vectors (from each training subset) can be compiled into a global list of support vectors.

This initial global list is unlikely to be the optimal solution, however.  Luckily, getting to the optimal solution is a matter of repetition.  The original training set can be split into pieces again, this time the global list of support vectors being added into each training subset.  A new set of support vectors will be derived from this process, as in the first pass, and the process can continue along indefinitely.  Somewhere along the process, these support vectors provably converge to the global optimum, just as if you ran the entire process without splits and parallelism at all!  So without sacrificing correctness, it’s possible to adapt the SVM process to settings with Big Data, without prohibitive time costs, all by adapting the algorithm to exploit mapreduce.

This has two main advantages.  First of all, we get the obvious improvements from parallelism–since more computers can work concurrently on the problem, we see a speedup.  The algorithmic complexity of solving an SVM is non-linear, however, conferring even greater benefits.  That is, dealing with subsets of the training data speeds up the problem out of proportion.  Divide and conquer makes the problem less intimidating, to be sure, and much more manageable.

Links [1] and [3] below discuss this general scheme in full detail.  [3] discusses a scheme in which the stages of splitting and recombining the training set appears as a binary tree (with each split being recombined with other splits one at a time).  The entire process can be started over again many times (fed back into itself), until a convergent solution can be found.  Luckily for those of us working in mapreduce, the binary structure of the tree does not matter.  That is, the training data can be split, shuffled, and recombined with global support vectors in a variety of ways, including the somewhat uncontrolled default behavior of Hadoop, without having to work any black magic in particular to get certain outputs to arrive at certain machines.

The simplest solution to get working is to use Hadoop to split the problem into pieces, solving an SVM in each map phase.  The reduce phase can be used to train a second SVM, using whatever support vectors it received from its parent map processes as training data.  Hadoop’s load balancer will not produce a binary tree as in [3], but this does not affect the correctness or convergence properties of the algorithm.  On the contrary, it may change the observed rate of convergence (these sorts of numbers are being investigated as we continue to augment our behavioral analytic platform at Distil).  Given that SVM has a global optimum, checking for this convergence is as simple as checking for when you’ve found a fixed point–training an SVM on the optimum set of support vectors should return that same set of support vectors.

Orthodox Bulgarian icon of St. George fighting the dragonWhen fighting a dragon, one needn’t charge straight away like St. George.  Unlike George, Hadoop programmers have the power to reduce the dragon into wings, scales, claws, and teeth first — none of which are quite so scary on their own.  All it takes is some research and investigation into how statistical algorithms can be split into independent sub-parts, each of which can be solved on its own terms.

Interested in the insights applied big data can give you on your company’s website? Sign up for a no obligation free trial of Distil Networks today.

 

[1] F. Ozgur Catak, M. Erdal Balaban. “CloudSVM : Training an SVM Classier in Cloud Computing Systems.”

[2] Corinna Cortes, Vladimir Vapnik. “Support-Vector Networks.”

[3] Zhanquan Sun, Geoffrey Fox. “Study on Parallel SVM Based on MapReduce.” 

About the Author

Mike Freenor

Mike Freenor is all about the numbers. As Senior Data Scientist at Distil Networks, he is responsible for designing and implementing a suite of statistical and behavioral analyses, built in Hadoop mapreduce. Mike is currently wrapping up his doctoral dissertation, critiquing modern macroeconomic methodology at Carnegie Mellon University.

More Content by Mike Freenor
Previous Article
Are You Going to End Up on California’s Wall of Shame?
Are You Going to End Up on California’s Wall of Shame?

Within the technology industry, the notion of privacy is something that is on the forefront of everyone’s m...

Next Article
The Search Is On: Competitive Price Scraping
The Search Is On: Competitive Price Scraping

Have you ever searched Google, Bing or Yahoo for “competitive price scraping?" If you work in the IT or sec...