Evaluating Fingerprints with Information Theory
The mathematical theory of information theory, developed to understand the limits of data compression and data transmission, proved to be a powerful tool for understanding how well the fonts installed on a computer distinguish that computer from others. Using Information Theory, evaluated a list of 495 fonts installed on 1.2 million machines. In our sample, we found that 95% of the identifying power of fonts is contained in the top 50 fonts.
Part of our Bot Protection product is a proprietary method for distinguishing clients accessing websites. This client identification goes far beyond detecting IP addresses; we investigate deep properties of the users’ browsers and the structure of their HTTP requests.
We ran an experiment on a sample of our customers where we inspected 495 different fonts on a each device making HTTP requests. Here’s a toy example showing the kind of data we collected:
|Unique ID||Bangers||Comic Sans||Times New Roman|
We wanted to answer a couple of questions about this data. First, how much “information” do the fonts capture about the Unique ID; put another way, how well can we distinguish users simply by knowing which fonts they have installed. Second, if we were only to select a subset of fonts to collect, which subset is the best at distinguishing users.
Information Metric 1: Distinct Rows
Initially, we naively measured the information contained in the font data by computing the ratio of unique font sets divided by the total number of font sets. In the example above, we have five distinct font sets:
|Unique ID||Bangers||Comic Sans||Times New Roman|
However, counting distinct rows as a metric for information content proves inadequate. Consider this example using the first five rows and first two fonts from above:
|Unique ID||Bangers||Comic Sans|
The Bangers column contains 2 distinct values as does the Comic Sans column. The two columns together contain 4 distinct rows. That is, the number of distinct rows when using both columns is the sum of the two columns independently. Because the information is additive, we intuitively might conclude that the columns contain independent, non-redundant information. That does not prove to be true.
If we know a client has Bangers, there is a 67% chance (greater than random) that the client doesn’t have Comic Sans (see this by looking at the three rows where Bangers=True). In other words, knowing that a client has one of the fonts provides us with (probabilistic) information about whether or not it has the other. Because the fonts encode partially redundant information about the clients, we would hope our measure of the information contained in both to be less than the sum of the parts.
Information Metric 2: Shannon Entropy
The mathematical field of Information Theory provides more robust tools for evaluating the information content of data. A mathematical function called Shannon entropy measures how many bits would be required to optimally encode a dataset; this can also be interpreted as the number of bits of information in data. Shannon entropy also maintains the property that information only adds when the components contain independent information.
To compute Shannon entropy, we need to count the number of times each font set occurs.
|Bangers||Comic Sans||Times New Roman||# Occurrences (ci)|
Given this, Shannon entropy is defined as:
Here’s the same computation in Python (assuming we have our original table in a Pandas dataframe):
In our example, the entropy is:
We can interpret this as saying our fonts encode just over 2 bits of information. Because we have eight observations, it would require 2(23)=3 bits to identify each of the users. (To see this, note that there are exactly 8 subsets to make out of a set of 3 items.) Observing that the entropy of our example (2.25) is less than 3, we can conclude that these three fonts don’t encode all the information present in the Unique ID.
Properties of Shannon Entropy
This concept of entropy becomes especially powerful when we realize we can compute the entropy of an arbitrary number of font columns. Better still, we can compare entropy values computed with different number of columns to one another. The entropy of the Bangers column is 0.95. The entropy of the Bangers column together with Comic Sans is 1.75. By adding Comic Sans we almost (but not quite) double the amount of information.
The Shannon entropy function has other properties that make sense in the context of this problem.
- If two fonts provide no information about one another, entropy becomes additive: shannon_entropy(Font 1, Font 2) = shannon_entropy(Font 1) + shannon_entropy(Font 2) if Font 1 and Font 2 are mutually independent.
- shannon_entropy(Font 1)= 0 if Font 1 is the same for every Unique ID.
- shannon_entropy(Font 1, Font 2, Font 3, ...)=log2(# rows) if each row is distinct for every Unique ID.
- shannon_entropy(Font 1)=1 if Font 1=True for half the clients, and False for the other half. (In other words, an evenly distributed binary column contains exactly 1 bit of information.)
The idea of entropy will also generalize to non-binary features, and even in the presence of non-binary features, it still describes the number of bits (binary features) required to optimally encode our data in binary form.
For example, suppose we also collect information about which browser is making HTTP requests.
|Unique ID||Bangers||Comic Sans||Times New Roman||Browser|
We again count how many times each distinct row occurs:
|Times New Roman||Bangers||Comic Sans||Browser||# Occurrences (ci)|
We can use these counts in the entropy formula to get an entropy of 2.75. That is, these 4 features (3 fonts and the browser) encode just under 3 bits of information. If all the counts had been 1, the entropy would have been 3; in that case, 4 features uniquely identify the 8 users (3 bits = 23 users). If we consider the Browser column on it’s own, we see that it has an entropy of 1.56. That is, it encodes more than 1 bit of information. (A single binary column can never contain 1 bit of information).
Font Selection Heuristic
We now return to the second part of our original problem. We had data on 495 fonts from a large number of clients. Because there is a fixed cost (in time) for detecting each font on a client, we want to collect no more than 50 fonts in the future.
Shannon entropy provides a way scoring the possible subsets of 50 fonts. Unfortunately, there are 1.41069possible ways to chose 50 fonts from a set of 495 fonts! Given the combinatorial complexity of this problem and time constraints our research, we devised a heuristic for selecting the 50 fonts.
We started by computing the individual entropy of all 495 fonts. Because the font columns are binary valued, the highest entropy font would be that which is closest to having an equal number of Trues and Falses. (Remember, if a font is always True or always False, its entropy is 0: it provides 0 information towards identifying a user.) Once we have selected our first font, we selected the remaining 49 by an iterative process. At each step, add the font that provides the largest marginal increase to the entropy of all fonts selected thus far.
Here’s Python code for selecting the best column and for generating the best subset:
We can use this algorithm to select two columns in our toy example. Here is the entropy of the individual columns:
|Times New Roman||1.0|
Thus, we will select Times New Roman as our first font.
For our second iteration, we look at the entropy of Times New Roman plus the other two columns.
|Times New Roman, Bangers||1.91|
|Times New Roman, Comic Sans||1.5|
Because Bangers maximizes the entropy now, we will select Times New Roman and Bangers as our two font subset. (Recall that the entropy of all 3 fonts is 2.25; we get about 85% of the information with 66% of the fonts.)
In our experiment, we collected information about the 495 fonts on about 1.2 million devices. 495 bits could theoretically uniquely identify 2245(56 trevigintillion) distinct devices, more than will ever access a single network! The maximum possible entropy of 1.2 million data points is 2(1.2106)20; in other words, it would require about 20 optimally designed bits/fonts to identify these clients. Sadly, fonts aren’t installed on machines to be an optimally encoded identifier, and the entropy of our dataset is about 7.7.
Using our heuristic, the best 50 fonts have an entropy of 7.2. That is, 10% fonts contain about 95% of the information. The graph below shows how the entropy grows from the first font selected by our heuristic (which contains about 1 bit of information) to using all 50 fonts (containing 7.2 bits of information).
The foundations for all of this was laid by Claude Shannon’s seminal paper A Mathematical Theory of Communication published in 1948. According to David MacKay, Shannon “both created the field of information theory and solved most of its fundamental problems” in this paper. He likewise laid much of the theoretical groundwork for the digital communication networks we have today.
Nearly 70 years later, Shannon’s same framework proves valuable for measuring the information content of clients on our global information network. We were able to use an information theoretic model to measure how much information a collection of fonts contains about a client. Given our sample of 495 fonts on 1.2 million clients, we found that 95% of the identifying power of the fonts is contained in the 50 top fonts.
About the Author
Tim has been with Distil since October 2015. He has a masters degree in operations research and a bachelors degree in mathematics and has worked at several tech startups. He loves to use advanced mathematics and statistics to solve real world problems. When not sitting in front of a computer, he enjoys exploring the outdoors through hiking and running.More Content by Tim Hopper