This document explains the most important concepts underlying Dwork.

Differential Privacy

Differential Privacy (DP) is a mathematical concept developed by Cynthia Dwork et. al.. DP applies to randomized anonymization mechanisms and enables us to estimate how well such a mechanism protects the privacy of an individual. But let's take a step back and see what that means:

When analyzing a dataset (e.g. the "absenteeism data" we use throughout this documentation), we might for example be interested to calcualte the mean weight of all persons in it. Let's assume the exact mean weight is 75.5 kg for our dataset. Since this is an aggregated value and not related to any specific person in our dataset, we could just publish it directly. Now imagine we do this analysis again after one month, only this time there's one more person that was added to the dataset. An adversary ^{1} who knows that only one person was added and that has observed our previous result and knows how many rows were in the dataset could now calculate the weight of the person that was added. Clearly this would violate the privacy of that individual. How can we protect the data better? Well, we could for example change the result by a small, random amount before publishing it. Doing this makes it harder for the adversary to reconstruct the original value, because now he has to account for the random value that was added, which he cannot predict. And this is exactly how a randomized anonymization mechanism protects data! Not all randomized mechanisms are always secure though, and how secure they are depends on the amount of noise we add. And that's where differential privacy comes into play: It provides us with a mathematical way to think about the security of randomized anonymization mechanisms.

Sensitivity

In the context of differential privacy (DP), the sensitivity of an expression tells us how much the value of the expression might change if we add or remove a single row of data. Dwork uses the sensitivity to calculate how much random noise it needs to inject into a result to ensure the anonymity of all individuals whose data contributed to the result. Examples:

If we e.g. calculate the sum of all Weight values in our example dataset, ds['Weight'].sum(), adding or removing a single row of data will change the result by at most ±200 (as the weight has a value range of [0,200]). Hence, the sensitivity of the sum is 200.

If we calculate the number of rows in our dataset, ds.len(), adding or removing a single row of data will change the result by at most ±1. Hence, the sensitivity of the len() function is 1.

If we calculate the mean of all Weight values, ds['Weight'].sum()/ds.len(), adding or removing a single row of data will change the dividend by ±200 and the divisor by ±1. Here the sensitivity is no longer constant but depends on the actual value of ds.len(): The larger the value, the smaller the sensitivity.

Dwork knows how to calculate the sensitivity of basic expressions like sum() or len(), and it knows how the sensitivity changes if we combine these expressions using numerical operators like +, -, * and /. Therefore, it can often automatically calculate the sensitivity of a given expression. The calculation is done in such a way that Dwork might sometimes overestimate the sensitivity, but will never underestimate it.

Someone with bad intentions who tries to uncover personal information from an anonymized dataset. ↩

## Concepts

This document explains the most important concepts underlying Dwork.

## Differential Privacy

Differential Privacy (DP) is a mathematical concept developed by Cynthia Dwork et. al.. DP applies to randomized anonymization mechanisms and enables us to estimate how well such a mechanism protects the privacy of an individual. But let's take a step back and see what that means:

When analyzing a dataset (e.g. the "absenteeism data" we use throughout this documentation), we might for example be interested to calcualte the mean weight of all persons in it. Let's assume the exact mean weight is 75.5 kg for our dataset. Since this is an aggregated value and not related to any specific person in our dataset, we could just publish it directly. Now imagine we do this analysis again after one month, only this time there's one more person that was added to the dataset. An adversary

^{1}who knows that only one person was added and that has observed our previous result and knows how many rows were in the dataset could now calculate the weight of the person that was added. Clearly this would violate the privacy of that individual. How can we protect the data better? Well, we could for example change the result by a small, random amount before publishing it. Doing this makes it harder for the adversary to reconstruct the original value, because now he has to account for the random value that was added, which he cannot predict. And this is exactly how a randomized anonymization mechanism protects data! Not all randomized mechanisms are always secure though, and how secure they are depends on the amount of noise we add. And that's where differential privacy comes into play: It provides us with a mathematical way to think about the security of randomized anonymization mechanisms.## Sensitivity

In the context of differential privacy (DP), the sensitivity of an expression tells us how much the value of the expression might change if we add or remove a single row of data. Dwork uses the sensitivity to calculate how much random noise it needs to inject into a result to ensure the anonymity of all individuals whose data contributed to the result. Examples:

`Weight`

values in our example dataset,`ds['Weight'].sum()`

, adding or removing a single row of data will change the result by at most`±200`

(as the weight has a value range of`[0,200]`

). Hence, the sensitivity of the sum is`200`

.`ds.len()`

, adding or removing a single row of data will change the result by at most`±1`

. Hence, the sensitivity of the`len()`

function is`1`

.`Weight`

values,`ds['Weight'].sum()/ds.len()`

, adding or removing a single row of data will change the dividend by`±200`

and the divisor by`±1`

. Here the sensitivity is no longer constant but depends on the actual value of`ds.len()`

: The larger the value, the smaller the sensitivity.Dwork knows how to calculate the sensitivity of basic expressions like

`sum()`

or`len()`

, and it knows how the sensitivity changes if we combine these expressions using numerical operators like`+`

,`-`

,`*`

and`/`

. Therefore, it can often automatically calculate the sensitivity of a given expression. The calculation is done in such a way that Dwork might sometimes overestimate the sensitivity, but will never underestimate it.Someone with bad intentions who tries to uncover personal information from an anonymized dataset. ↩