Data Science Terms

October 2, 2014 William Cox

  

If you’re talking to someone who speaks a different language usually it’s easy to spot – different words and different meanings. The tricky part comes when you’re using the same words, but mean different things. This happens quite often in interdisciplinary teams. Furthermore, with the rise of the “data science” movement, now not only does your company have marketing, sales, support and software development personnel but you may have also thrown social scientists, biologists, economists or engineers into the mix on your data science team. It pays to clarify your terminology, and I mean that literally. I once spent the better part of two months talking past another team of electrical engineers because they were steeped in signal processing while I was coming from an applied engineering background.

The solution to the problem is, in theory, simple – define your terms. Since at Distil we use multiple tiers of automated bot detection algorithms, a few of my colleagues asked for definitions for some of the things we were talking about and building on the data science team. I decided it would be helpful to write down the definitions for some important data science terms. Therefore, without any particular order, I present a glossary of words your data scientist may be saying and what they mean. 

Classifier - A classifier can describe a few things. In theoretical terms it covers a range of mathematical ways of separating data into groups, or “classes”, or to predict an output number (height, weight, cost, etc.). These range in complexity but the practical ones have ways of automatically learning how to predict new data given old data to learn from. Another use of the term would be to indicate a piece of software that does the classification – actual code that takes in new data and spits out what group(s) it thinks that new data belongs to or what output value it thinks is appropriate.

Features - These are the signals or inputs to a classifier. In most cases these are difficult to come by and require time and effort to perfect. For example, a classifier predicting whether a building will burn down in the next 10 years may have features such as the age of the building, the type of construction, and whether the occupants are smokers. Not all features are good, some can be counter-productive and simply add complexity to the classifier where none is needed.

Training - Just as a jedi requires time and effort to perfect their skills, so too a classifier needs piles of data and time to automatically learn how to make new, similar, predictions.

Testing - Classifier all trained? Feed it new data and see how it does.

Accuracy - A measure of how closely the classifier matches the true output. This is complicated by the fact that “true outputs” are often very difficult to come by (think of building a classifier to tell if someone is lying). If someone uses this term, pay special attention and ask how the number was arrived at. Even in a simple case with True/False outcomes, accuracy is measuring how many true Trues were marked true and how many true Falses were marked false – complicated, right? Also, going from 99% to 99.9% accuracy is usually much much more difficult than going from 70% to 80% accuracy.

False Positive / False Negative - Sometimes the cost of an error is asymmetrical. Consider the example where your classifier predicts how long a bridge will last before failure. Saying a bridge will last 100 years, when it only lasts 50 is a Very Bad Thing. Saying it’ll last 100 years when it will really last 200 isn’t so bad. Accuracy incorporates both sides of the error coin. False positive or false negative rates zero in on one side or the other.

Decision Threshold - Many classifiers will produce some confidence measure about the output. Deciding how confident you must be in order to use the output is a tricky business. Set it too high, and you never will make a decision. Set it too low and you might as well be guessing. The accuracy measurement is after choosing a threshold. Changing the threshold can change the accuracy wildly.

AUC (Area Under the Curve) - This number ranges from 0.5 (bad) to 1 (good). The closer to 1, the better the classifier can perform. It’s a similar measure to accuracy, but takes into account the delicate nature of selecting a decision threshold.

ROC Curve - A chart used to show how well a classifier is performing. It shows the trade-off between false positive rate (FPR) and the true positive rate (TPR). The area under this curve is the AUC term above. A chart like this can be used to select an appropriate decision threshold.

Kablorsky - This is not a word. If your data scientist says this, they you should reconsider their employment.  

About the Author

William Cox

William Cox, Distil's Data Scientist, is a former electrical engineer turned data scientist. He finds the term “data janitor” a bit more realistic. Formerly he worked in the defense industry building systems to track and destroy torpedoes, classify radar signatures, and using lasers to communicate underwater. He now uses his not-so-super powers to detect and track bad bots using machine learning and predictive modeling techniques.

Follow on Twitter More Content by William Cox
Previous Article
Congress Tackles the Internet
Congress Tackles the Internet

The Senate Subcommittee on Crime and Terrorism held a hearing about the threat posed by botnets, Plagiarism...

Next Article
5 Ways Pay Per Click Fraud Hurts Your Wallet
5 Ways Pay Per Click Fraud Hurts Your Wallet

Pay-per-click fraud costs online advertisers millions of dollars a year. If you use PPC or online advertisi...