Risk Assessment

Dwork provides functionality to assess the re-identification and inference risk of datasets. This can help you to learn if publishing a given dataset with personal data can pose a risk to the individuals whose data it contains. Dwork uses statistical tests to assess these risks.

Re-Identification Risk Analysis

Dwork can estimate the risk of re-identification for a specific row of a dataset, a random sample of rows or the entire dataset. To do that, Dwork assumes that an adversary knows specific attribute values of an individual. With that so-called background or context knowledge, the adversary can look for matching rows in the dataset. If there is only a small number of matching entries in the dataset, this might lead to a re-identification of the individuals' data. Examples:

ds = PandasDataset(...)

# will return an estimate of the re-identification risk based on all attributes
ds.reidentification_risk()

# will return an estimate of the re-identification risk based on knowledge of
# the 'income', 'age' and 'zip_code' attributes
ds.reidentification_risk(attributes=['income', 'age', 'zip_code'])

# same as above, but for a specific row of the dataset
ds.reidentification_risk(row=1, attributes=['income', 'age', 'zip_code'])

# same as above, but for a random sample of 10 % of the rows
ds.reidentification_risk(sample=0.1, attributes=['income', 'age', 'zip_code'])

Selecting specific attributes that might be used for re-identification often makes sense since not all attribute values are easy to learn by an adversary. For example, information about the exact income of an individual might be more difficult to uncover than information about her/his age or the zip code of her/his residence.