Can Comic Sans Detect Cyber Attacks?

September 30, 2016 Tim Hopper

Evaluating Fingerprints with Information Theory

TL;DR

The mathematical theory of information theory, developed to understand the limits of data compression and data transmission, proved to be a powerful tool for understanding how well the fonts installed on a computer distinguish that computer from others. Using Information Theory, evaluated a list of 495 fonts installed on 1.2 million machines. In our sample, we found that 95% of the identifying power of fonts is contained in the top 50 fonts.

Introduction

Part of our Bot Protection product is a proprietary method for distinguishing clients accessing websites. This client identification goes far beyond detecting IP addresses; we investigate deep properties of the users’ browsers and the structure of their HTTP requests.

As browsers and devices develop, we continue to innovate methods of device identification. Over the last few months, we conducted a study to see how well the fonts installed on a client’s machine to better identify users. When a browser executes JavaScript, it is able to introspect whether a given list of fonts is installed on the machine. 

Data Collection

We ran an experiment on a sample of our customers where we inspected 495 different fonts on a each device making HTTP requests. Here’s a toy example showing the kind of data we collected:

Unique ID Bangers Comic Sans Times New Roman
8189-4F5D False True True
EF4F-4925 False False False
2421–4DA4 True False False
7A76–4ACC True False True
4BC0–4482 True True True
0C65–4C82 True False True
ED94–4377 False False False
E4DD–4E22 True False False

We wanted to answer a couple of questions about this data. First, how much “information” do the fonts capture about the Unique ID; put another way, how well can we distinguish users simply by knowing which fonts they have installed. Second, if we were only to select a subset of fonts to collect, which subset is the best at distinguishing users.

Information Metric 1: Distinct Rows

Initially, we naively measured the information contained in the font data by computing the ratio of unique font sets divided by the total number of font sets. In the example above, we have five distinct font sets:

Unique ID Bangers Comic Sans Times New Roman
8189-4F5D False True True
EF4F-4925 False False False
2421–4DA4 True False False
7A76–4ACC True False True
4BC0–4482 True True True
0C65–4C82 True False True
ED94–4377 False False False
E4DD–4E22 True False False

However, counting distinct rows as a metric for information content proves inadequate. Consider this example using the first five rows and first two fonts from above:

Unique ID Bangers Comic Sans
8189–4F5D False True
EF4F–4925 False False
2421–4DA4 True False
7A76–4ACC True False
4BC0–4482 True True

The Bangers column contains 2 distinct values as does the Comic Sans column. The two columns together contain 4 distinct rows. That is, the number of distinct rows when using both columns is the sum of the two columns independently. Because the information is additive, we intuitively might conclude that the columns contain independent, non-redundant information. That does not prove to be true.

If we know a client has Bangers, there is a 67% chance (greater than random) that the client doesn’t have Comic Sans (see this by looking at the three rows where Bangers=True). In other words, knowing that a client has one of the fonts provides us with (probabilistic) information about whether or not it has the other. Because the fonts encode partially redundant information about the clients, we would hope our measure of the information contained in both to be less than the sum of the parts.

Information Metric 2: Shannon Entropy

The mathematical field of Information Theory provides more robust tools for evaluating the information content of data. A mathematical function called Shannon entropy measures how many bits would be required to optimally encode a dataset; this can also be interpreted as the number of bits of information in data. Shannon entropy also maintains the property that information only adds when the components contain independent information.

To compute Shannon entropy, we need to count the number of times each font set occurs.

Bangers Comic Sans Times New Roman # Occurrences (ci)
False True True 1
True True True 1
False False False 2
True False False 2
True False True 2

Given this, Shannon entropy is defined as:

Here’s the same computation in Python (assuming we have our original table in a Pandas dataframe):

In our example, the entropy is:

We can interpret this as saying our fonts encode just over 2 bits of information. Because we have eight observations, it would require 2(23)=3 bits to identify each of the users. (To see this, note that there are exactly 8 subsets to make out of a set of 3 items.) Observing that the entropy of our example (2.25) is less than 3, we can conclude that these three fonts don’t encode all the information present in the Unique ID.

Properties of Shannon Entropy

This concept of entropy becomes especially powerful when we realize we can compute the entropy of an arbitrary number of font columns. Better still, we can compare entropy values computed with different number of columns to one another. The entropy of the Bangers column is 0.95. The entropy of the Bangers column together with Comic Sans is 1.75. By adding Comic Sans we almost (but not quite) double the amount of information.

The Shannon entropy function has other properties that make sense in the context of this problem.

  • If two fonts provide no information about one another, entropy becomes additive: shannon_entropy(Font 1, Font 2) = shannon_entropy(Font 1) + shannon_entropy(Font 2) if Font 1 and Font 2 are mutually independent.
  • shannon_entropy(Font 1)= 0 if Font 1 is the same for every Unique ID.
  • shannon_entropy(Font 1, Font 2, Font 3, ...)=log2(# rows) if each row is distinct for every Unique ID.
  • shannon_entropy(Font 1)=1 if Font 1=True for half the clients, and False for the other half. (In other words, an evenly distributed binary column contains exactly 1 bit of information.)

The idea of entropy will also generalize to non-binary features, and even in the presence of non-binary features, it still describes the number of bits (binary features) required to optimally encode our data in binary form.

For example, suppose we also collect information about which browser is making HTTP requests.

Unique ID Bangers Comic Sans Times New Roman Browser
8189–4F5D False True True Chrome
EF4F–4925 False False False Firefox
2421–4DA4 True False False Firefox
7A76–4ACC True False True Chrome
4BC0–4482 True True True Safari
0C65–4C82 True False True Chrome
ED94–4377 False False False Firefox
E4DD–4E22 True False False Safari

We again count how many times each distinct row occurs:

Times New Roman Bangers Comic Sans Browser # Occurrences (ci)
0 0 0 Firefox 2
0 1 0 Firefox 1
0 1 0 Safari 1
1 0 1 Chrome 1
1 1 0 Chrome 1
1 1 0 Safari 1
1 1 1 Safari 1

We can use these counts in the entropy formula to get an entropy of 2.75. That is, these 4 features (3 fonts and the browser) encode just under 3 bits of information. If all the counts had been 1, the entropy would have been 3; in that case, 4 features uniquely identify the 8 users (3 bits = 23 users). If we consider the Browser column on it’s own, we see that it has an entropy of 1.56. That is, it encodes more than 1 bit of information. (A single binary column can never contain 1 bit of information).

Font Selection Heuristic

We now return to the second part of our original problem. We had data on 495 fonts from a large number of clients. Because there is a fixed cost (in time) for detecting each font on a client, we want to collect no more than 50 fonts in the future.

Shannon entropy provides a way scoring the possible subsets of 50 fonts. Unfortunately, there are 1.41069possible ways to chose 50 fonts from a set of 495 fonts! Given the combinatorial complexity of this problem and time constraints our research, we devised a heuristic for selecting the 50 fonts.

We started by computing the individual entropy of all 495 fonts. Because the font columns are binary valued, the highest entropy font would be that which is closest to having an equal number of Trues and Falses. (Remember, if a font is always True or always False, its entropy is 0: it provides 0 information towards identifying a user.) Once we have selected our first font, we selected the remaining 49 by an iterative process. At each step, add the font that provides the largest marginal increase to the entropy of all fonts selected thus far.

Here’s Python code for selecting the best column and for generating the best subset:

We can use this algorithm to select two columns in our toy example. Here is the entropy of the individual columns:

Font Entropy
Bangers 0.95
Comic Sans 0.81
Times New Roman 1.0

Thus, we will select Times New Roman as our first font.

For our second iteration, we look at the entropy of Times New Roman plus the other two columns.

Fonts Entropy
Times New Roman, Bangers 1.91
Times New Roman, Comic Sans 1.5

Because Bangers maximizes the entropy now, we will select Times New Roman and Bangers as our two font subset. (Recall that the entropy of all 3 fonts is 2.25; we get about 85% of the information with 66% of the fonts.)

Real Data

In our experiment, we collected information about the 495 fonts on about 1.2 million devices. 495 bits could theoretically uniquely identify 2245(56 trevigintillion) distinct devices, more than will ever access a single network! The maximum possible entropy of 1.2 million data points is 2(1.2106)20; in other words, it would require about 20 optimally designed bits/fonts to identify these clients. Sadly, fonts aren’t installed on machines to be an optimally encoded identifier, and the entropy of our dataset is about 7.7.

Using our heuristic, the best 50 fonts have an entropy of 7.2. That is, 10% fonts contain about 95% of the information. The graph below shows how the entropy grows from the first font selected by our heuristic (which contains about 1 bit of information) to using all 50 fonts (containing 7.2 bits of information).

Conclusion

The foundations for all of this was laid by Claude Shannon’s seminal paper A Mathematical Theory of Communication published in 1948. According to David MacKay, Shannon “both created the field of information theory and solved most of its fundamental problems” in this paper. He likewise laid much of the theoretical groundwork for the digital communication networks we have today.

Nearly 70 years later, Shannon’s same framework proves valuable for measuring the information content of clients on our global information network. We were able to use an information theoretic model to measure how much information a collection of fonts contains about a client. Given our sample of 495 fonts on 1.2 million clients, we found that 95% of the identifying power of the fonts is contained in the 50 top fonts.

About the Author

Tim Hopper

Tim has been with Distil since October 2015. He has a masters degree in operations research and a bachelors degree in mathematics and has worked at several tech startups. He loves to use advanced mathematics and statistics to solve real world problems. When not sitting in front of a computer, he enjoys exploring the outdoors through hiking and running.

More Content by Tim Hopper
Previous Article
The Rise of Advanced Persistent Bots
The Rise of Advanced Persistent Bots

Bots are growing more advanced and persistent, laying waste to old best practices. Read on to see how bad b...

Next Article
New Findings Announced in Distil’s 2016 Economics of Web Scraping Report
New Findings Announced in Distil’s 2016 Economics of Web Scraping Report

Your entire website can be crawled and used against you by competitors. View how they do it and how you can...