2019-03-19 // Katharine Jarmul ‹ back to the blog

It might seem like an obvious question, but actually defining private or sensitive data is a difficult task -- even for experts in the field. Let's dive into some of the reasons why that is, and how to approach identification of private or sensitive data for your use case.

Personally Identifiable Data is Private Data

This seems like an obvious place to start -- and it is! PII or Personally Identifiable Information includes a list of topics, well outlined by Piwik, usually related directly to a person. This includes a person's name, address, email, national identity number, birthdate, phone number, credit card numbers and username. Usually PII has some of the highest regulatory precautions because it is directly related to our person, and if there was a release or breach of this data, it might put us at risk for identity theft, phishing attacks and other crimes.

Health-Related Personal Data is Sensitive Data

For areas such as healthcare, there are further data points closely linked to PII, which in the United States are called PHI or Personal Health Information. For HIPAA regulations in the United States, this PHI is related to particular hospital visits or other electronic health records, and covers 18 identifiers, well outlined by HIPAA Journal. Besides the usual PII, this definition also covers things like URLs related to health records, admittance dates or dates of treatment, biometric identifiers and device identifiers.

You might wonder -- why so many extra pieces of information that seem unrelated to your particular person? What does your admittance date or a copy of your retina scan have to do with your privacy?

These extra identifiers can be used to identify you, even in the absence of other personal data. This is why removing or pseudonymizing things like admittance dates are required to use health data for a purpose not directly defined. Let's dive into that further.

Quasi-Identifiers are Private Data

Related to personally identifiable health data, such as admittance dates, are pieces of data called quasi-identifiers. What exactly are quasi-identifiers?

A quasi-identifier is something that partially (i.e. "quasi") identifies someone -- as in, if I know where you live and your gender, this means I know several quasi-identifiers about you. If I combine that with one other piece of information -- like the fact that I know you were released from the hospital on Thursday -- then I could potentially identify your health records if they also listed your gender and/or your location.

So is there a list of quasi-identifiers we can use so that we can remove or pseudonymize this information before using a dataset?

Unfortunately, a list of quasi-identifiers depends on where you live, what data you have and who you ask. There are some useful lists from Canadian eHealth information for example. And you can always read some of the initial research focused on identifying individuals using only quasi-identifiers, such as Latanya Sweeney's work on re-identifying health records using only birth date, postal code and gender. The point is, although we can make extensive lists of quasi-identifiers for many datasets, we might miss something else that is a quasi-identifier for a specific use case.

Let's take this image as an example. What quasi-identifiers do we have for the person captured in this image (from Google street view)?

Google Streetview Image of man with blurred face walking down the street

Yes, his face is blurred -- but what else do we know? We know what year the image was taken. We know the location it was taken in. The region is a mainly business district of the city, so we can potentially guess that he works in the area. We might be able to guess time of day. We know what he is wearing (which he might wear again or we might recognize). Even without the face, we have a lot of potential quasi-identifiers.

Let's take a non-image example. Here is a review written on Google Maps.

Google Maps review of a restaurant -- recommending the food and reservation

Normally, this review would also have an attached Google profile, so depending on the amount of information listed there, we might be able to directly identify the person; but let's say a pseudonym is used. What other clues can I use to identify the person?

I can see what date (or range) the person visited the restaurant. I can determine what types of food they enjoy. I might be able to tell what region they are from based on the types of language they use. I know they visited another country and could potentially link this review with other reviews from their trip. And they didn't even include personal data in the text -- something that happens more often than you'd think!

Removing all potential quasi-identifiers is a good idea if you are passing the dataset to another person or company -- and this means spending time determining what they might be and applying anonymization techniques to the dataset itself. Otherwise, it is useful to simply know that removing all quasi-identifiers is difficult and takes time -- therefore any data collected from persons should be assumed to have quasi-identifiers and therefore be private.

User-Input Data is Private Data

When users are given forms, who knows what comes over them? And by them, I mean all of us! I regularly have to stop myself from treating chat prompts and contact forms as places where I can talk directly to another human. I mean, why not just put our name, phone number and account number directly into the form?

Well, the problem really is that this data is often then stored somewhere -- sometimes without any security precautions, or is used without any sanitization for applications like machine learning or natural language processing. So, your phone number or account number, when entered into the form or chat message, they might be stored or used without first removing that information in order to protect your privacy.

So, when we develop application and allow users to input text, to record their voices or videos, to talk with chatbots or customer service representatives; we should treat their user input as private until proven otherwise.

Proprietary Data is Sensitive Data

Beyond person-related data, there are actually plenty of other types of sensitive data. For example, if you are a bank and some of your proprietary information on how you determine new markets to expand into -- or you are a day trader and your proprietary trade information is released, these are your company or job secrets.

A lot of the data that we use for data science and data mining is related to proprietary data. How much money our company spends on advertisements, our company Intranet materials, operations and logistics data on shipments or devices worldwide. If any of this data were to leak, we would need to determine the extent of the damage -- what if our competitor got ahold of it? What would our board or stockholders think? Would it cause commotion for ourselves or our industry if these secrets were sold?

Use the Robber- and Mom-Tests

If you need to identify sensitive or private data in your dataset -- use a test. The first test would be to ask yourself, "Would it be okay if this data was stolen?". If you can safely say, yes, it would be okay (perhaps not preferred, but if it happened it wouldn't demonstrably affect your day-to-day operations or internal security), then it is likely that data doesn't contain sensitive or private information.

The second test is what I call the "Mom-Test". And this test asks, if the data at hand was your mom's, would you be okay handling it the way you currently do so. Would you use it exactly the same way, share it the same way, store it the same way? If so, you likely are treating this person-related data with care, but if not, you may want to review ways to identify and protect private data before handling it.

Conclusion: Sensitive and Private Data are Everywhere

Hopefully you learned at least one new way to identify sensitive and private data in this article, helping to make your data use more secure. In the end, you are the expert when it comes to your own data, and you should trust your domain knowledge and expertise to help determine what could be a quasi-identifier, what you should remove or protect and how to make sure your data is handled securely.

If you ever aren't sure, or something doesn't pass the Robber or Mom-Tests, it is always better to remain vigilant and cautious. Although perfection is the enemy of good for most things; when managing private or sensitive data, it's far better to strive for truly secure data management.