Anonymization Techniques Defined: Transforming Real Data into Realistic Test Data
Andrew Colombi, PhD
June 23, 2021
Data anonymization is the process of taking a real dataset and transforming or truncating the data in such a way that the resulting data can no longer be used to re-identify a real world individual. Anonymized data can take many forms, depending on the specific technique applied. These five approaches represent the most common pathways to anonymizing (aka obfuscating, aka de-identifying) real data. As you'll see, each comes with its own pros and cons.
Redaction may be the simplest data anonymization technique. This simply means removing or replacing part of a field’s contents. For example, PII in a dataset might be deleted or written over with dummy characters (such as Xs or asterisks) in order to “mask” the data.
Redaction is a good option when dealing with sensitive data that is no longer actively used by your organization. The rest of the dataset which doesn’t involve personal information can still be used for other purposes.
However, data redaction is also a drastic measure. Normally, data redacted can’t be restored or recovered and loses most of its utility for real-world use cases.
2. Format-preserving Encryption
Data encryption is the process of encoding information using a secret called the encryption key. More specifically, format-preserving encryption encodes the information in a way that preserves the format of the original input in the output. E.g. a 16-digit credit card number will be encoded to another 16-digit number.
Using data encryption is a strong layer of defense against security breaches, although not a foolproof one. If a malicious actor manages to exfiltrate the encrypted data from your internal network, this information will be useless without the corresponding decryption key. Of course, this means that the key itself needs to be fiercely safeguarded from prying eyes.
Encrypted data generally retains some of its utility—at least for those with the decryption key, which allows the encryption process to be reversed. This is in contrast with one-way hash functions, which also convert information into an indecipherable format but which are computationally infeasible to invert. Whereas data encryption is more often used to protect data in transit or rest, hash functions are used to create a digital fingerprint of data.
Like data encryption, data scrambling seeks to convert an input dataset into an entirely new format to preserve the information’s confidentiality. However, data scrambling and data encryption are not the same thing: whereas encryption is reversible if you possess the decryption key, data scrambling is permanent and irreversible.
Encryption and scrambling are also different in terms of their methodologies. Data encryption usually makes use of the Advanced Encryption Standard (AES), a specification for the encryption of electronic data established by the U.S. National Institute of Standards and Technology in 2001. Data scrambling, on the other hand, depends on each organization’s implementation: it has no predefined rules or guarantees about how the data will be scrambled. For example, data scrambling might replace all but the leading characters in a person’s name (e.g. “John Smith” becomes “Jxxx Sxxxx”), or it might randomize the order of the interior characters (e.g. “John Smith” becomes “Jhon Stmih”).
Data scrambling is often used when cloning a database from one environment to another. Databases that contain sensitive business information, such as data on human resources, payroll, or customers, must be treated with care during the cloning process. This scrambled database can then be used for purposes such as stress tests and integration tests, which don’t depend on the accuracy of the information within the database.
Data pseudonymization is a less comprehensive data anonymization technique that focuses specifically on personally identifiable information (PII), replacing a person’s information with an alias. The GDPR formally defines pseudonymization as “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information.” However, pseudonymized data may retain less identifiable information (e.g. the dates of a student’s attendance) in order to maintain the data’s statistical utility.
Like encryption, a key feature of data pseudonymization is that it can be reversed when necessary. However, great care must be taken with pseudonymization: data elements that may not seem to be personally identifiable in isolation can become identifiable when viewed together. As such, pseudonymizing data is one of the less safe methods of anonymization, and one that can lead to unintentional re-identification.
Using pseudonymized data is risky because surprisingly few pieces of information are often necessary in order to uniquely identify a person. In many cases, even two or three identifiers are sufficient to narrow the search pool. According to a study from Harvard’s Data Privacy Lab, for example, an estimated 87 percent of Americans can be uniquely identified from three pieces of information: their gender, their date of birth, and their 5-digit ZIP code.
5. Statistical Data Replacement
Commonly known as data masking or obfuscation, this is perhaps the strongest and most sophisticated option when it comes to preventing re-identification with data anonymization. The approach involves learning statistics about the underlying data and then using those statistics to replace the existing data realistically. The information is still obfuscated or obscured, as with data redaction; however, the true records still exist in the underlying database, and can be hidden or revealed as necessary.
Statistical data replacement generates a new dataset by replacing the original with altered content via one-to-one data transformation. This new content often preserves the utility of the original data (e.g. real names are replaced with fictitious names), but it can also simply scramble or null out the original data with random characters when preserving utility isn’t required.
Provided that a dataset is masked well, it’s typically difficult to reverse engineer or re-identify the underlying data involved, making this a good choice for safeguarding privacy along with utility. That said, not all anonymization is performed equally.
Sometimes it looks a lot more like pseudonymization—so much so that the approach has made headlines recently for the ease with which individuals in “anonymized” datasets have been re-identified. To ensure privacy, it’s important to use a powerful masking tool that makes it easy to detect PII, alert users to schema changes that require new masking, and provide comprehensive reports on the PII found and the masking performed.
Andrew is the Co-founder and CTO of Tonic. An early employee at Palantir, he led the team of engineers that launched the company into the commercial sector and later began Palantir’s Foundry product. Today, he is building Tonic’s platform to engineer data that parallels the complex realities of modern data ecosystems and supply development teams with the resource-saving tools they most need.