Getting Started With Dwork

Dwork helps you to analyze personal data without compromising privacy. Dwork is written in Python and has an intuitive interface that makes it easy to work with data. To get stared, we first need to install Dwork, which we can do via pip. Please note that Dwork requires Python version 3.5 or higher.

pip3 install dwork

Then, we load our dataset into Dwork:

import pandas as pd

from dwork.dataset.pandas import PandasDataset
from dwork.dataschema import DataSchema
from dwork.ast.types import Integer

class AbsenteeismSchema(DataSchema):
    Weight = Integer(min=0, max=200)
    Height = Integer(min=0, max=200)

filename = f"absenteeism_at_work.csv"
df = pd.read_csv(filename, sep=";")
ds = PandasDataset(AbsenteeismSchema, df)

Here, we first imported the necessary functions from pandas and Dwork. Then, we defined a data schema for our dataset. This is important to tell Dwork about the types and ranges of individual attributes. Dwork can also try to infer this information automatically from the dataset, often this is not a good idea though as it can reveal personal information (for example, knowing the largest or smallest value of a given attribute can already reveal information about individuals from the dataset). For our example dataset we just defined two attributes, Weight and Height, that both have integer values in the range [0,200].

The PandasDataset instance that we've created can now be used almost like a normal pandas dataset. For example, if we want to calculate the mean value of the weight of all persons in the dataset we can simply write

result = ds["Weight"].sum()/ds.len()

Now, the result is not a numerical variable, but an instance of a Dwork Expression. We can choose to get the true value of the expression by calling result.true(), or we can get a differentially private value by calling result.dp(epsilon=0.5). Dwork will automatically calculate the sensitivity for us and add the proper amount of noise. Neat, isn't it?


The filtering functionality is still under construction and not yet merged into the `master` branch.

Dwork allows you to filter a dataset by specifying a conditional expression.

# return only rows with Weight > 100
dsf = ds[ds["Weight"] > 100]


The grouping functionality is still under construction and not yet merged into the `master` branch.

Dwork also allows you to group the data by a single or multiple attributes. This is useful to e.g. generate statistics for a number of subgroups of your dataset.

# group the dataset by weight, using 10 kg intervals, as well as by height using
# 10 cm intervals
dsg = ds.group_by(ds['Weight'].discretize(10), ds['Height'].discretize(10))