Getting Started With Dwork
Dwork helps you to analyze personal data without compromising privacy. Dwork is written in Python and has an intuitive interface that makes it easy to work with data. To get stared, we first need to install Dwork, which we can do via
pip. Please note that Dwork requires Python version 3.5 or higher.
pip3 install dwork
Then, we load our dataset into Dwork:
import pandas as pd from dwork.dataset.pandas import PandasDataset from dwork.dataschema import DataSchema from dwork.ast.types import Integer class AbsenteeismSchema(DataSchema): Weight = Integer(min=0, max=200) Height = Integer(min=0, max=200) filename = f"absenteeism_at_work.csv" df = pd.read_csv(filename, sep=";") ds = PandasDataset(AbsenteeismSchema, df)
Here, we first imported the necessary functions from pandas and Dwork. Then, we defined a data schema for our dataset. This is important to tell Dwork about the types and ranges of individual attributes. Dwork can also try to infer this information automatically from the dataset, often this is not a good idea though as it can reveal personal information (for example, knowing the largest or smallest value of a given attribute can already reveal information about individuals from the dataset). For our example dataset we just defined two attributes,
Height, that both have integer values in the range
PandasDataset instance that we've created can now be used almost like a normal pandas dataset. For example, if we want to calculate the mean value of the weight of all persons in the dataset we can simply write
result = ds["Weight"].sum()/ds.len()
result is not a numerical variable, but an instance of a Dwork
Expression. We can choose to get the true value of the expression by calling
result.true(), or we can get a differentially private value by calling
result.dp(epsilon=0.5). Dwork will automatically calculate the sensitivity for us and add the proper amount of noise. Neat, isn't it?
Dwork allows you to filter a dataset by specifying a conditional expression.
# return only rows with Weight > 100 dsf = ds[ds["Weight"] > 100]
Dwork also allows you to group the data by a single or multiple attributes. This is useful to e.g. generate statistics for a number of subgroups of your dataset.
# group the dataset by weight, using 10 kg intervals, as well as by height using # 10 cm intervals dsg = ds.group_by(ds['Weight'].discretize(10), ds['Height'].discretize(10))