De-identification techniques are often at the forefront of companies’ concerns when it comes to the processing of big data. In addition, anonymization and pseudonymization techniques have been a heavily debated topic in the ongoing reform of EU data protection law. This makes last year’s Article 29 Working Party (WP29) Opinion on Anonymization Techniques1 even more important, as it examines the effectiveness and limits of anonymization techniques and places them in the context of data protection law. This article details the WP29 Opinion on Anonymization Techniques and considers the opinion in relation to the upcoming EU General Data Protection Regulation.
Personal Data, Anonymous Data, and Pseudonymous Data
Under EU data protection law, there are three broad categories of data:
- Personal data. The concept of personal data is extremely wide. Personal data is defined as any information relating to an identified or identifiable natural person. An identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity.2
- Anonymous data. Anonymous data is any information from which the person to whom the data relates cannot be identified, whether by the company processing the data or by any other person. The threshold for anonymization under EU data protection law is very high. Data can only be considered anonymous if re-identification is impossible, meaning that re-identifying an individual must be impossible by any party and by all means likely reasonably to be used for this attempt. It is an absolute threshold, and the company’s intent is not relevant. Anonymized data is no longer considered personal data and is thus outside the scope of EU data protection law.3
- Pseudonymous data. This concept is not formally defined in the current EU data protection legal framework.4 Pseudonymization is a form of de-identification, in which information remains personal data. The legal distinction between anonymized and pseudonymized data is its categorization as personal data. Pseudonymous data still allows for some form of re-identification (even indirect and remote), while anonymous data cannot be re-identified.
Anonymization and Pseudonymization Techniques
The processing step of anonymizing personal data is the last legal second that this data falls under the scope of EU data protection laws as personal data. The WP29 opinion considers several anonymization techniques:
- Noise addition. This means that an imprecision is added to the original data. For example, a doctor may measure your weight correctly, but after noise addition it shows a weight bandwidth of +/- 10lb.
- Substitution. Information values of the original data are replaced with other parameters. For example, instead of indicating a patient’s height with 5’7″, this value is substituted by the word “blue.” If a patient’s height is 5’8”, it is registered as “yellow.” Substitution is often combined with noise addition.
- Aggregation. In order not to be singled out, an individual is grouped with several other individuals that share some or all personal data, i.e. their place of residence and age. For example, a data set does not capture the inhabitants of San Francisco with certain characteristics, but the inhabitants of Northern California. K-anonymity is a form of aggregation. The process impedes re-identification by removing some information but letting the data be intact for future use. If the scrubbed data set is released and the information for each person contained in the release cannot be distinguished from at least k-1 individuals, it is considered k-anonymous. One method of k-anonymity is data suppression. You can suppress data by replacing a value with a place holder. For example, instead of “age 29,” the value is “X.” Another method is by generalizing the data. Instead of “age 29,” the input is “between 25 and 35.” L-diversity is an extension of k-anonymity. K-anonymity can be lifted with an interference attack, which allows the attacker to reverse the visible value to the real value. L-diversity protects anonymity by giving every attribute at least l different values.
- Differential privacy. This comes into play when a company gives a third party access to an anonymized data set. A copy of the original data remains with the company, and the third-party recipient only receives an anonymous data set. Additional techniques such as noise addition are applied prior to the data set transfer. Differential privacy is applied when an authorized third party is requesting data.
Pseudonymization techniques are different from anonymization techniques. With anonymization, the data is scrubbed for any information that may serve as an identifier of a data subject. Pseudonymization does not remove all identifying information from the data but merely reduces the linkability of a dataset with the original identity of an individual (e.g., via an encryption scheme). The WP29 opinion provides the following selected examples of pseudonymization techniques:
- Hash functions. Hashes are a popular tool because they can be computed quickly. They are used to map data of any size to codes of a fixed size. For example, the names Cédric Burton, Sára Gabriella Hoffman, and John M. Smith can be hashed to “01,” “02,” and “03.” However long the name, the hash value will always be two digits.
- Tokenization. Tokenization is a process by which certain data components are substituted with a non-sensitive equivalent. That equivalent is called the token. The token has no exploitable value, but it serves as an identifier. It is a reference that traces back to the original data.
The WP29 opinion examines the above techniques and categorizes them as follows:5
Is Singling out still a risk? | Is Linkability still a risk? | Is Inference still a risk? | |
Noise Addition | Yes | May not | May not |
Substitution | Yes | Yes | May not |
Aggregation | No | Yes | Yes |
L-diversity | No | Yes | May not |
Differential privacy | May not | May not | May not |
Hashing/Tokenization | Yes | Yes | May not |
On average, differential privacy scores the highest as an anonymization technique under EU data protection law. However, depending on the concrete risk to be mitigated, one technique may prevail over the other.
Ongoing Discussions and Likely Trends
In the context of the ongoing reform of EU data protection law, the concepts of personal data, anonymous data, and pseudonymized data have been strongly debated. While a formal agreement still needs to be reached, the following trends have emerged:
- The definitions of personal data and anonymous data will remain substantially similar. EU data protection principles will continue to apply to any information concerning an identified or identifiable person. As is the case today, to determine whether a person is identifiable, account should be taken of all the means likely reasonably to be used either by the company processing the data or by any other person to identify the individual. EU data protection law will not continue to apply to data rendered anonymous in such a way that the data subject is no longer identifiable.
- The concept of personal data will be more specific and broadened. For example, it seems likely that the legislator will create a presumption of qualification of personal data for unique identifiers and explicitly specify in the EU General Data Protection Regulation that online identifiers, location data, and IP addresses are personal data unless companies can demonstrate that the data does not allow identifying individuals (which in practice may be difficult to prove).6
- The concept of pseudonymized data or of pseudonymization will be formally introduced. Companies de-identifying personal data and using pseudonymized data or pseudonymization techniques will likely benefit from some level of flexibility under EU data protection law, even though the data will still be considered to be personal data and fall under the scope of application of EU data protection law. There are currently some divergences between the various EU institutions involved in the legislative process as to whether the new framework will simply obligate companies to pseudonymize data without clear business incentives, or whether the new framework will include substantial flexibility for companies using pseudonymization techniques. Hopefully, the latter approach will be followed, but this is uncertain.
Conclusions and Next Steps
The WP29 Opinion on Anonymization Techniques is a bridge that helps interpret a legal criterion with applicable technical solutions. It is a valuable piece of legal interpretation that is of practical relevance as this topic remains at the forefront of companies’ concern. With technological development and the state of the art methods in steady flux, the opinion also leaves room for interpretation.
The ongoing EU data protection reform will most likely be based on the same core principles and key concepts, including the definition of personal data and anonymous data. The new framework should also formally define pseudonymized data or pseudonymization, and hopefully will provide for strong incentives for companies to de-identify personal data. To anticipate the upcoming change of law, companies should consider identifying the various types of information that they process, and in particular consider reviewing whether their existing de-identification techniques can be considered to be anonymization or pseudonymization techniques and whether new processes to pseudonymize or anonymize the data can be implemented to benefit for more flexibility in the future. Ongoing monitoring of the legal as well as the privacy-engineering environment is necessary to stay within the boundaries of current and upcoming EU data protection law.
1 Article 29 Working Party, Opinion No. 05/2014 on Anonymization Techniques, http://ec.europa.eu/justice/data-protection/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf.
2 Certain personal data receives a higher level of protection under EU data protection because of its sensitivity. As of today, sensitive data includes personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, and the processing of data concerning health or sex life.
3 Recital 26 of the EU Data Protection Directive excludes anonymized data from EU data protection law. It reads: “Whereas the principles of protection must apply to any information concerning an identified or identifiable person; (…); whereas the principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable (…)”.
4 With the exception of a few national data protection laws, such as German law.
5 Article 29 Working Party, Opinion No. 05/2014 on Anonymization Techniques, Table 6. Strengths and Weaknesses of the Techniques considered.
6 EU General Data Protection Regulation will also likely broaden the concept of sensitive data and create new categories of data such as biometric or genetic data.