2018-06-12 // Katharine Jarmul ‹ back to the blog

As a data scientist or machine learning practitioner, you likely already know a bit about the new European general data protection regulation (GDPR); but after chatting about the subject with a variety of folks in data science, we here at KIProtect wanted to publish a more detailed guide than the ones we found available on the internet. Presented below is our data science guidelines when thinking about and implementing solutions that are GDPR compliant and privacy aware.

Privacy... Why Care?

When we think about privacy, one option some sites have taken is simply to drop or block European users. Maybe this is an option or idea for you.. Maybe not. Regardless of whether you are worried about the hefty fines, thinking about and valuing the privacy and data rights of your users is something commendable with or without the new regulation. When a company says “Figuring out data rights is too hard, I’ll just skip that and go back to business as usual”, the underlying message is clear: Privacy is not in the business model. And the linked implication for the users, just what are they doing with my data?!?

Unless you are actively doing sinister things with user data (which by the way, isn’t a far stretch, implementing procedures for GDPR should actually be fairly straightforward. And if it isn’t, this is likely because you've accumulated some technical debt which should be addressed regardless of the regulation.

Care about GDPR because:

  • It should be fairly straightforward if you are using normal data processing tools and standards
  • It communicates to your users that you are a trustworthy data guardian
  • If you reach hurdles, this likely gives you a chance to clear some technical debt and allows for easier data management, data acquisition and team member onboarding in the future

Step One: Track your Data, Not your Users

We’ve written about this before, but we’ll say it again: Track your data! This is actually a powerful thing to do as a data scientist because it has quite a lot of additional benefits besides consensual data collection. You know those issues you have with debugging your pipeline or determining data quality or reproducing a model you already deployed? These problems would be a bit easier if you were focused on tracking data provenance and processing. If you aren’t already doing that, you should start.

So, how do I track data provenance?

Utilize your current workflow tools and tag data with provenance information

Possible Questions:

  • Where did the data originate?
  • Are there any user consent or usage restrictions tied to the data?
  • What time was this data produced or first seen?

Track processing of data, including failures

Possible Questions:

  • What jobs or workflows ran on this data? At what time?
  • What was the success rate?
  • What happened to failures? Were there errors processing the data?

Flag consent and errors in an accessible way

Just as you would want to flag low confidence data or outliers, flag data with special consent requirements in an easy to query and route way. Use this chance as an opportunity to also clearly flag murky data (i.e. I don’t know the provenance, consent or quality of this data) and reroute data which should not be used for certain processes (i.e. this user has opted-out of personalization).

There are several frameworks already working on these issues, such as Apache NiFi, but it is likely you can utilize whatever tools you are currently working with by adding some fields or metadata. Remember, perfect is a long way away -- get something that is practical, time-boxed but still fits the requirements. Use the initial implementation for a bit and then do an improvement sprint. This is a process; and iteration will help you integrate feedback from other teams (i.e. requirements from data engineers, IT department, BI tools, etc).

Step Two: Remove or Pseudonymize (or possibly Anonymize) Murky Consent Data

Just like you wouldn’t set up your friend on a blind date without asking them, using user data when you didn’t ask them about it is icky. Because you are now tracking and routing data based on consent, this should be fairly easy to do moving forward, yay! But what to do about the piles of data that is now possibly non-consensual?

First, make sure you talk with other teams regarding if consent was asked when privacy policies were updated properly -- or if you can send some more opt-in messaging (i.e. “We use your purchase history to help recommend items you might enjoy. This means we take your historical purchase data from our site and process it automatically to generate recommendations personalized for you. Would you like to continue to get personalized recommendations via this service?”). Now, that’s not so scary, is it? 😎

What if the user revokes consent or asks that their data not be used? First and foremost, you need a plan for deletion of any personally identifiable data or data attached to or produced by a single user. Hopefully you have been tracking this properly and it is stored in a linkable way. If not, it’s time to get rid of some technical debt and refactor your data architecture. You might need to also think about logging or parsing your logs to track and re-label this unknown data.

Pseudonymization: Utility with Increased Privacy

But what if you can’t get consent via a simple email or the user did not consent to the particular processing you are going to implement... now what? Or let’s say deletion of that data is not possible or desirable. Under the regulation, pseudonymization (as well as anonymization) can alleviate this problem: Pseudonymization allows you to process user data more liberally as the risk to the user in case of data loss or theft is reduced. And if the data is anonymized so no individual users are identifiable anymore you can even use and retain it without explicit consent.

Now, we are not lawyers, and it is yet to be determined how pseudonymization plays out in practice (you can find a longer description of the regulation’s language and incentives around pseudonymization on IAPP; however, making efforts to secure your data as much as possible using pseudonymization and anonymization will show you take data protection seriously and will be your best protection against fines and user complaints.

Pseudonymization is a method that transforms data in a way so you can often still utilize it in your aggregate analytics or machine learning training in the same way you currently use the raw data. Pseudonymized data can also make for great test datasets or datasets you can share with partners, as it essentially has all person-related identifiers removed (remember, however, that is not anonymous!). There are many approaches to pseudonymization, but we believe we offer one of the best because it is extremely performant and uses a novel, structure-preserving method that retains much of the data utility while still strongly protecting it.

Now, those of you who know a bit about de-anonymization attacks are frowning right now, because you know, both mathematically and via real world examples that identifying or de-anonymizing a person when given external information can be quite straightforward. For example, if I know where you live, or where you work, or what are the last 3 movies you watched, I might be able to link this knowledge to the even pseudonymized data. I can use the information I know plus the information exposed in the data itself to guess with high likelihood that those data points are indeed YOU. 🕵🏼 So please keep in mind that pseudonymization is not anonymization. Given the right context and a committed adversary, pseudonymized data can often still be linked back to a unique user.

Anonymization and GDPR

Data held past consent expiration or for which consent was not explicitly given needs to be anonymized to retain it, and when doing so you will need to prove you have taken measures for anonymity. The gold standard of anonymity is differential privacy -- which broken down in its simplest form means an attacker should not be able to learn much about a person whose data was just added to an anonymized database when querying it using differentially private methods. This applies even if they can run these queries on the database just before and after the person is added.

As you can imagine, this is often hard to guarantee, especially over a long period of time and with a continuously changing dataset. If you are doing data science the way most of us are today, an attacker could query the data day by day and possibly learn something about the person in question based on changes in the data (especially if they also have outside information). This is problematic, obviously, and the way we are dealing with datasets and modeling our data needs to address privacy at a more general level in order to ensure differential privacy (i.e. differentially private data collection or differentially private machine learning models).

Given these constraints, what can you do? Moving towards differential privacy guarantees (even if you can’t guarantee differential privacy over a long period of time or a long series of queries) is a good idea for providing some anonymity assurances for your users. Common differential privacy techniques such as additive noise, elimination of singular or cardinal data points and limiting information exposed in queries are useful best practices. If you want to hear more about how to guarantee differential privacy, please refer first to Cynthia Dwork’s lecture on the topic

At KIProtect, we are developing real-world data science APIs and services to help make pseudonymization and anonymization easier to integrate into your workflows. Our goal is to allow you to do data science as usual and leave the data security and privacy layer to us. Interested to know more? Check out our API documentation or feel free to reach out to chat.

Step Three: Privacy First

GDPR calls for Privacy by Design -- first espoused by Ann Cavoukian in her 7 foundational principles. This means, when you are handling and managing data, you need to think of privacy first. Okayyyy, you might say, sounds great…. But what does it really mean to put privacy first?

Putting privacy first means you operate to preserve privacy whenever possible. It also means you think of related concerns such as data security. In our daily operations, this means that we start thinking about several important topics as data scientists and machine learning practitioners. They are as follows:

Utilize Role-based Access

Who has access to what data and why? If you don’t yet have role-based access for your data, you need to set this up. If it is possible, you need to protect particular fields or datasets with the privacy of the user in mind. If I am building a model on customer behavior, do I actually need their personal identifiers? If not, then why do I have access to them?

Data Security Policies and Culture

It’s possible you already have some stringent demands on your data due to IT security concerns, but are they really a part of your team culture? If not, it’s time to think about how we treat data. Do we store it on unencrypted hard drives? Do we put it in open S3 buckets? How are we protecting the data from potential data loss, both in the cloud and on our personal computers? There are many resources to get started on security best practices (more articles on this to come!), but the important thing is to build data security into the culture of your data team. (And, why not utilize pseudonymized data whenever possible, to decrease the risk of exposing real customer data when doing simple tasks like training a new model?)

Privacy-preserving data science

When possible, err on the side of privatizing your data when doing data science. Incorporating pseudonymization (or anonymization) where possible, utilizing thoughtful data collection (like Apple’s differential privacy or Google’s RAPPOR or privacy-preserving models such as PATE and Privacy Preserving Deep Learning. When asked to build a model which may quickly violate privacy, such as hypersegmenting a customer base using age, zip code and gender, ask questions as to the utility and ethics of the task. Privacy and ethics often go hand in hand -- feel free to debate these ideas as a team and company and find a happy medium where you can do meaningful work and which you are also proud to contribute.

Privacy first as a company mantra goes beyond the tasks listed above, of course; but it never hurts to start the conversation and form a culture around privacy. In the end, consumers will be happy to know you are taking your role as data guardian seriously; privacy is good for ethics and for business.

Step Four: Data Portability

One final provision you might be interested in as a data science or engineering team is data portability. In the regulation, data portability provides the ability for users to request their data is ported from one service to another in a machine-readable way. Sounds pretty neat, right? Indeed it is! The freedom to move your data from one place to the next means the status quo of “Oh, I must stay with this service forever because they have all of my data or likes or favorites or preferences” is no more. Users have the ability to move and try out new services, offerings and start ups who can get ahead by preparing for data portability now.

Maybe that’s you! So what do you need to do to get ready for data portability?

Outline questions you will ask when you receive new data.

What do you need to track from the original source? What is the provenance and consent of the data? Is there additional data you need to ask for to integrate into your system? Who can you talk to at the previous company if you have questions about the data?

Design a process for incoming data.

As you can imagine, there is no standard or schema set (yet) for this portability process, so you’ll need to be flexible. Do have an email address, API endpoint or upload service which allows you to intake new data of questionable or unknown schema and structure (you might want to also check out our partner project DP-Kit which provides these tools in a secure way). Set aside some time as a team to determine who will monitor or address incoming data.

Coordinate with your customer-facing teams.

Should data portability readiness be advertised to new users? How will you perform outreach or onboarding for your incoming cohort? Do they need any new onboarding documentation catered to their previous platform? If billing them for the service is required, how will that be managed?

Cell phone providers went through many of these hurdles in the 90s when number portability became a thing. Those who adapted quickly ended up receiving large benefits (at least in the short term) from increased usage and a growing user base. Be that company! Promote your portability success and gain new motivated users thanks to data rights. 💪🏻

In Summary: Be a Thoughtful Data Guardian

The message GDPR sends is clear. Data rights belong to the individual. If we are lucky enough to have folks trust us with their data, we need to be thoughtful data guardians. We need to be trustworthy, not just for those who know how to set up their privacy preferences; but for everyone.

Taking measurements to put privacy at the center of your data science is a useful endeavor for the added benefits: creating more ethical data processing and sourcing. It also benefits your team: better debugging, increased security, more thoughtful processing workflows. In light of the increased scrutiny of data breaches and consumer privacy, investing in privacy-first data science now will likely only benefit your company’s value and place in the industry.

We here at KIProtect are building the data privacy and security layer for real-world data science. Should you have any questions about how to use our products, please let us know. Now, go forth and do privacy-preserving data science!