Don't Roll Your Own Privacy Tools
2019-03-02 Andreas Dewes
Don't Roll Your Own Crypto
"Don't roll your own crypto" is a strong warning that is often given by experienced cryptographers when advising people on how to implement (or better not implement) cryptographic methods in practice: In a nutshell, it means that it's probably not a good idea to implement your own, custom cryptographic methods, at least if you're not an experienced cryptographer. The reason is that there are countless pitfalls that can reduce a seemingly secure crypto-system to little more than a placebo. Sadly exactly this also happened many times in the past when people or companies thought they could just build their own crypto only to find they made subtle but fatal mistakes. The cost of these mistakes is often paid by the users, that often lose their data and -in case of the recent crypto-craze- their hard-earned money when custom-built cryptographic systems fail.
Don't Roll Your Own Privacy Tools Either
Cryptographic systems are mostly built to ensure privacy of data, hence they are part of what we call privacy-enhancing technologies. Anonymization and pseudonymization are other technologies from this family, as they are used to ensure sensitive information remains private while non-sensitive information can still be used e.g. for analytics. Surprisingly, many companies these days opt to implement these techniques themselves, "rolling their own" anonymization or pseudonymization methods. To me as a data security practicioner this is surprising as I think many pseudonymization and anonymization problems are not less challenging and error-prone than most cryptographic problems. So why are engineers that would never try to implement their own version of RSA eager to implement their own k-anonymity algorithm? I think there are several factors:
- Until recently we didn't even have a very good formal definition of what anonymity means and how we can measure it. This changed with the advent of differential privacy and similar measures, which today make it at least possible to compare different anonymization methods regarding their privacy protection (though we still don't have a perfect way to measure privacy).
- There was not much effort on defining specific, in-detail standards on how anonymization or pseudonymization should work on different types of data and in different contexts (this is improving significantly though).
- There was not much effort on analyzing and breaking existing anonymization schemes (this is also rapidly changing though).
- There are very few open implementations of anonymization or pseudonymization methods that can be easily integrated into existing software.
Combined, these factors might have led many engineers tasked with implementing a data anonymization or pseudonymization scheme to decide "rolling their own" instead of looking for a standard solution. If we compare this situation with the development of cryptography I'd say it's similar to the state of the crypto ecosystem almost 30 years ago: Few standards, many interesting ideas and several commercial and open players that implement their own vision of anonymization or pseudonymization.
What Crypto History Can Teach Us
So, can the history of cryptography teach us anything about the future of privacy-enhancing technologies? I think so. If we compare the state of anonymization with the state of cryptography in the 80's we see many parallels: Today there are still few companies and people that really understand what anonymization and pseudonymization is and how to apply it to real-world problems. Also, there is still not much regulation that clarifies how to use these techniques in practice. Finally, most users don't really understand and value these techniques yet. I predict though that in a not so far future these methods will grow in importance and companies as well as users will gain a better understanding of them. And just like today most users won't trust a cloud storage service that doesn't encrypt their sensitive data they might not trust service providers that don't protect their data using pseudonymization and anonymization. What is very certain is that we will see more regulation and standardization in this area, as we already do now with laws like the GDPR in Europe. Of course it's also possible that techniques like traditional pseudonymization will be superseded by more powerful methods like homomorphic encryption, so far it's still unclear what is the real potential of these novel techniques though.
What This Means For Us
Not relying on your own, possibly flawed implementation of anonymization and pseudonymization algorithms won't help you much if there are no good, open implementations that you can use. Relying on commercial, closed-source tools might be an option but the lack of openness again makes it hard to estimate the actual security of a given algorithm. As a company that develops privacy-enhancing technologies it is very important to us that our users can fully trust our approach and audit our methods. This is why we have decided to open-source our entire data processing stack and core algorithms under a permissive license. By doing this we hope to achieve three things:
- Make it easy for researchers and practicioners to analyze and audit our methods. We hope that this will help to build trust in our approaches and to find and fix potential security issues as fast as possible.
- Make it easy to extend and use our methods in a wide variety of contexts. Data protection and data security are of paramount importance today, and as a small team of experts we don't think that we can build a fitting solution for all use cases (at least not this year :D). By open-sourcing our algorithms we want to encourage others to use them in their software, which in turn helps to make data security available to a larger audience.
- Ensure that your data remains usable even if something happens to us: Access to your business-critical data shouldn't depend on our
What we will open-source:
- Our data processing framework, KIProtect.
- Our core pseudonymization and anonymization methods.
- Our core data integrations.
We would of course like to open-source all our software but we can't do this for two reasons:
- We're a profit-oriented startup and we need to be able to defend our USP(s). Opening our core algorithms is already a large risk to us as it makes it easy for competitors to copy our technology. We will therefore keep some aspects of our software closed-source in order to retain a partial implementation advantage. We strongly believe that open-source business models are the future and we see open-sourcing as a core asset and USP, nevertheless it is not without risks.
- Some of our software stack is highly specific to our infrastructure and specific use cases we develop with our clients. We wouldn't be able to package most of this in a way that is genuinely useful to a wide community of users, hence we try to focus on open-sourcing only the aspects of the software that can be easily reused, as we think those parts will provide the highest value to our users.
What do you think about open-sourcing privacy-enhancing technologies? Feel free to discuss with us on HN: