2018-07-04 // Andreas Dewes ‹ back to the blog

Differential privacy is a concept that has become quite popular in the recent years, especially since Apple announced using it for anonymizing user data. Differential privacy allows us to measure the quality of data anonymization by quantifying how much information a given anonymization method "leaks" about a given individual that is added to a dataset using that method. More formally, the mathematical definition is given as

$$ \begin{equation} \mathrm{Pr}[\mathcal{K}(D_1) \in S] \le exp(\epsilon) \times \mathrm{Pr}[\mathcal{K}(D_2) \in S] \label{eq:dp} \end{equation} $$

Here, $\mathrm{Pr}[\mathcal{K}(D_1) \in S]$ is the probability of a randomized function $\mathcal{K}$ to yield one of values in the $S$ when evaluating it on a given dataset $D_1$. The right side is identical to the left except that the function is now evaluated on a dataset $D_2$ that differs from $D_1$ in at most one element. And finally, $\epsilon$ is a parameter that describes how much information is leaked (or generated) by the function.

Now all this sounds awfully complicated, so let's have a look at a very simple example that illustrates the different components: Let's assume we want to generate an anyomous dataset in which we record whether a person has a given disease or not. Basically, each entry in our database will correspond to on person and will contain either $yes$ or $no$, depending on whether that person has the given disease or not. Needless to say that we want to ensure the anonymity of the people in our database. This sounds simple at first because if we have a large number of entries in the database it will be quite difficult for an adversary to figure out anything about a given person, even if he/she knows that this person must be in the dataset. However, what would happen if the adversary would gain access to the database before and after we add a given person? If we would add the information about that person unaltered it's would be enough to compare the number of $yes$ and $no$ entries in the database before and after inserting that person's information in order to learn whether that person has the disease or not. Clearly that's not good. So in order to protect the privacy of a person when adding his/her information to the database we're going to add some statistical noise to it. We do this using the following mechanism: For each person, we randomly do one of the following: With a given probability p (e.g. 50 %), we insert the true value of that person's data. With a probability $1-p$, we insert a random value instead. Now, an adversary can no longer be certain about whether a person really has the disease, because it might be possible that a random value was inserted into the database for that person. This is much better than before, as we have made it more difficult for an adversary to extract information about a person from our dataset. Of course, by adding noise we've also made it more difficult for us to extract meaningful information from our dataset. So how should we decide how much noise to add, and how can differential privacy help us to quantify the information leakage? Let's see!

First of all, we need to map our problem above to the terminology of equation ($\ref{eq:dp}$). Let's start with the function $\mathcal{K}$: In the scenario above, an adversary would compare the number of people with a given attribute value (e.g. $yes$) before and after inserting a new value (and adding noise to it). We therefore choose $\mathcal{K_v}(D)$ as the function that return the number of people in $D$ with an attribute value equal to $v \in [yes,no]$.

So let's assume tht $\mathcal{K_{yes}}(D_1) = n$, i.e. there are $n$ people in database $D_1$ with attribute value $yes$. If we add another person to the dataset, producing a new database version $D_2$, the new value of our function will be either $n$ or $n+1$: $\mathcal{K_{yes}}(D_2) \in [n,n+1]$, depending on the attribute value of the added person and our randomization algorithm.

Now, how can we use this to argue about privacy? Let's look at the two extreme cases:

  • If we do not add any noise at all when adding a person to the database, the value of $\mathcal{K_v}(D_2)$ will be fully determined by the attribute value of the person we add. Therefore, if the person has an attribute value $yes$, $\mathcal{K_{yes}}(D_2) = \mathcal{K_{yes}}(D_1)+1$ and $\mathcal{K_{no}}(D_2) = \mathcal{K_{no}}(D_2)$. Calculating the probabilites from equation ($\ref{eq:dp}$) for $S = [n+1]$ then yields $Pr[\mathcal{K_{yes}}(D_1) \in [n]] = 1$ and $Pr[\mathcal{K_{yes}}(D_2) \in [n]] = 0$. For this, no finite $\epsilon$ exists for which $$ \begin{equation} Pr[\mathcal{K_{yes}}(D_1) \in [n]] \le \exp{\epsilon} \times Pr[\mathcal{K_{yes}}(D_2) \in [n]] \end{equation} $$ , which means that our (non-random) function $K$ is not $\epsilon$-differentially private. This is of course what we expect since we didn't add any noise and an adversary with access to the two database versions $D_1$ and $D_2$ can learn about the attribute value of the added person with absolute certainty.
  • If on the other hand we always return the same value $m$ for $K_{yes}(D)$, the probability that
    $$ \begin{equation} \mathrm{Pr}[\mathcal{K_{yes}}(D_1) \in [m]] = \mathrm{Pr}[\mathcal{K_{yes}}(D_2) \in [m]] = 1 \end{equation} $$ and we find that $\epsilon = 0$, which is what we'd expect since the value of $m$ doesn't reveal any information about the person's added attribute. Of course, it would also be entirely useless in practice as it wouldn't release any information to legitimate database users either.
  • A variation of the former method is to always add $1$ to the value of $\mathcal{K_{yes}}$ when adding a person's data to the database. Surprisingly (or not), this also yields an infinitely large $\epsilon$ as an adversary would still know with certainty if a person was added to the database, even though he would not be able to learn anything about the attribute value of that person. This is important to keep in mind and a very neat property of differential privacy, since it treats any information that an adversary can gain from a query function $\mathcal{K}$ as a privacy breach, regardless of whether we might actually consider this a breach or not by our personal definition: Even learning that a person was added to a database can be valuable information to an adversary, something that is often not properly considered in many risk models but is easily taken into account when using differential privacy.
  • Finally, another interesting extreme case consist of picking a random attribute value with a uniform distribution (i.e. the result of a fair coin flip) when adding a person to the database. In this scenario, the probabilities that $\mathcal{K_{yes}}(D)$ will produce either the same $n$ or yield $n$ and $n+1$ for the two versions of the database are identical and equal to 50 %. For this case we obtain $\epsilon = \ln{2} \approx 0.693$, as $P(K_{yes}(D_1) \in [n]) = 1$ and $$ \begin{equation} P(K_{yes}(D_2) \in [n]) = 0.5 = \exp{(\ln{2})}\cdot P(K_{yes}(D_1) \in [n]). \end{equation} $$ Again, this might seem surprising at first since we just add random values to the database when adding a person, but just as before an adversary can still learn if a person was added to the database, though not with absolute certainty as before.

In the following sections, we'll investigate two schemes for building up our differentially private database:

  • In scheme $I$, we will draw a sample from a random boolean process that returns 0 with probability $p$ ($P(\hat{x}=0)=p$ and 1 with probability $1-p$. If we draw a 1, we add the person's attribute value to the database. If we draw a 0, we don't include that person in our dataset.
  • In scheme $II$, we will again draw a sample from the same RNG. If we draw a 1, we will add the person's real attribute value to the database. If we draw a 0, we will draw a number from another random boolean process with $P(\hat{x}=0) = k$ and add that number as the attribute value to the database instead.

Now, these schemes look very similar at the surface but they have different implications as we will see. In both schemes, the probability that we add the true attribute value of a person to the database is equal to $p$. However, scheme $I$ leaks more information about a person as it will only add a $yes$ to the database if it's really the attribute value of a person. Hence, if an adversary observes that a $yes$ was added to the database between two versions, he knows with certainty that this must be the true attribute value of the person. In the other scheme, with probability $p\cdot (1-k)$ that a $yes$ will be added to the database even if the person's attribute value is $no$, therefore it provides plausible deniability.

Let's see how these two processes evaluate under differential privacy: $\mathcal{K_{yes}}$ will again return the number of $yes$ values in our database. Between two versions $D_1$ and $D_2$ the probability that $\mathcal{K_{yes}}(D_1) \in [n]$ and $\mathcal{K_{yes}}(D_2) \in [n]$ given that a person's attribute value is $yes$ is equal to $p$ for scheme $I$, hence this scheme is $\epsilon = -\ln{p}$ differentially private. For scheme $II$, the same probability is equal to $p+(1-p)\cdot k$ ,which yields $\epsilon = -\ln{(p+(1-p)\cdot k)}$. As you can see, including the randomization of the added attributes in scheme $II$ reduces the $\epsilon$. The "price" that we pay for this improved security is a reduction of usable information though. To see how much information is conveyed by each scheme, we can use simple Bayesian reasoning:

Let's say we're interested in the probability that the attribute of a person is $yes$ given that we observed the state of the database before and after the person's data was added. Now, the total count in the database can change by at most 1 when adding the data. We write this change as $\Delta_{I,II} \in [0,1]$, indicating the schema that was used to anonymize the data in the index. The probability that the attribute value $x_i$ of a person is $yes$ given the observed value of $\Delta$ is then given as

$$ \begin{equation} P(x_i=yes | \Delta_{I,II} = 1) = P(\Delta_{I,II} | x_i = \mathrm{yes})\cdot \frac{P(x_i=\mathrm{yes})}{P(\Delta_{I,II}=0)} \end{equation} $$

Here we simply used Bayes' theorem to rewrite the conditional probability on the left. Now, for scheme $I$ we find $P(\Delta_I = 1 | x_i = \mathrm{yes}) = 1-p$ and $P(\Delta_I = 1) = (1-p)\cdot P(x_i = \mathrm{yes})$, which yields

$$ \begin{equation} P(x_i = \mathrm{yes} | \Delta_I = 1) = 1 \end{equation} $$

This means that if we see an increase in the count between two adjacent database versions we can be sure that the attribute value of the person that was added must be $\mathrm{yes}$. This is not good of course as it compromises the privacy of the person. Let's see how scheme $II$ fares. For this one, we have

$$ \begin{equation} P(\Delta_{II} = 1 | x_i = \mathrm{yes}) = (1-p) + p\cdot(1-k) = 1-pk \end{equation} $$

and

$$ \begin{equation} P(\Delta_{II} = 1) = (1-p)\cdot P(x_i = \mathrm{yes}) + p\cdot(1-k) \end{equation} $$

so we obtain

$$ \begin{equation} P(x_i=yes | \Delta_{I,II} = 1) = \frac{(1-pk)\cdot P(x_i = \mathrm{yes})}{(1-p)\cdot P(x_i = \mathrm{yes})+p\cdot(1-k)} \end{equation} $$

If we set $k \to 0$ (i.e. we always report 0 in case we return a "random" value) we recover schema I with a probability of 100 percent and if we set $p \to 0$ (i.e. we always report the true value) we also recover schema I.

So how much information does an adversary gain if he observes a $\Delta_{II} = 1$? His relative gain in confidence is given as

$$ \begin{equation} P(x_i=yes | \Delta_{I,II} = 1)/P(x_i = \mathrm{yes}) = \frac{1-pk}{(1-p)\cdot P(x_i = \mathrm{yes})+p\cdot(1-k)} \end{equation} $$

The value depends on the prior probability $P(x_i = \mathrm{yes})$ as well as the anonymization parameters $p$ and $k$. We've seen above that $\epsilon_{II} =-ln{(p\cdot(1-k)+k)}$.