Test data management

Reverse engineering your test data: It’s not as safe as you think it is

Author

Omed Habib

Author

April 21, 2021

“Is it safe?” Laurence Olivier’s Nazi Dentist from Hell quietly asked this innocent question in the 1976 film Marathon Man. Unprepared civilian Dustin Hoffman could only respond with smart aleck answers to a question he didn’t understand.

The day went downhill rapidly for Hoffman.

After applying salve to the wounds he had just inflicted, Olivier‘s White Angel mused, “Life can be that simple. Discomfort/Relief.”

I’m bringing up such a disturbing, emotionally-charged image because the impact of data re-identification can be instantaneous and devastating for the real people exposed by it. This is not the kind of problem where if you get it wrong, the site goes down and you get a call at 4 a.m. from a furious stakeholder. This is the kind of problem where lives and livelihoods are destroyed if you unwittingly let the wrong ones in.

So, yeah. No pressure.

Let’s define some terms before we get too deep into this.

The Data Privacy Cast (In Order of Appearance):

Personal Identifiable Information (PII) - The US Department of Labor defines PII as:

It’s clear from the above that not protecting this private data well enough is the legal equivalent of just posting it on your homepage with a flashing icon “Download now!”

It’s also worth noting that this page adds, “Department of Labor (DOL) contractors are reminded that safeguarding sensitive information is a critical responsibility that must be taken seriously at all times.”

The DOL has to remind its contractors to be careful, too. You're not alone.

Protected Health Information (PHI) - On the US Dept. of Health and Human Services site, HIPAA defines PHI as all demographic data relating to (a) any individual’s past, present or future physical or mental health or condition; (b) the provision of health care to that individual; or (c ) the past, present, or future payment for the provision of health care by that individual.

It also includes any data “for which there is a reasonable basis to believe it can be used to identify the individual.” More on this in a bit.

Partially anonymized data - This includes data masked by redaction, such as the common practice of using just the last 4 digits of a social security number, and pseudonymization, where some fields in a record are replaced by randomized identifiers. For a better analysis of the risks involved here, read our recent blog on Prop 24’s amendments to the California Consumer Privacy Act.

Inferential data - Here we have come to the heart of the problem. This refers to identifying information that can be re-identified out of anonymized data by various techniques such as cross-referencing it with publicly accessible datasets. Two years ago, a research study published in Nature laid out a specific method to correctly re-identify 99.98% of individuals out of supposedly anonymized datasets using just 15 demographics.

One of the authors of that study, Yves-Alexandre de Montjoye from Imperial College in London, elsewhere showed that he could identify 95% of people in easily acquired smartphone data using just four location timestamps.

And data hunters are avid readers. They have grown much more sophisticated recently while oceans of big data continue to gush out into the publicly available web.

Vulnerability in anonymization techniques

The European Data Protection Board (EDPB), an independent oversight board created to ensure that General Data Protection Regulation (GDPR) rules are applied consistently, unequivocally states that “the natural person is ‘identifiable’ when, although the person has not been identified yet, it is possible to do it.” Which translates to: your organization’s legal culpability benchmark should not be the exposure of data but the possibility of exposure.

In the expression of the White Angel, partially redacted or masked data and pseudonymized data should make you feel “Discomfort.”

Techniques like these fail to prevent re-identification when data is pooled, shared or combined with other datasets. For example, not masking notable outliers, such as a millionaire in an income table that otherwise captures mostly middle class employees at a given company, is like slapping a “Hello, my name is” label on data associated with the company leader.

Another giveaway could be as simple as a zip code, even a partial one, where a sparse population combined with publicly available data can pinpoint an individual. Noise must be added to the dataset in cases like these and many others in order to safeguard against re-identification.

To look at some real-world examples, check out our aforementioned Prop 24 blog, which touches on the re-identification of users by combining anonymized Netflix watchers with IMDB reviews; or the Swiss researchers who identified individuals in 84% of the 120,000 lawsuits they scanned in less than an hour, just based on publicly available Supreme Court documents.

Potential fallout from data re-identification

What “Discomfort” looks like in instances such as these includes:

Leaked PII or PHI - Consider the case of Adult Friend Finder, which really lived up to its name. More than 400 million accounts were exposed, revealing names, email addresses and passwords. The result? Marriages, careers, family connections—severed immediately and permanently. Honesty is indeed the best policy, but this case is also the very definition of TMI—or TMPII.

Non-compliance/regulatory fines - In 2021 so far, there has been a 19% increase in daily data breach notifications. The EU has issued more than €275 million in GDPR fines since the very first fine was issued in the summer of 2019, including a €50 million fine for Google alone. In the States, Equifax blew past that record with a $575 million settlement for its lapse of attention to personal credit reports.

Crisis management / damage control - The longest-lasting yet most subjective cost is probably the loss of consumer confidence, which will make or break a brand in the end. Security Magazine reported that 78% of consumers surveyed would stop engaging with a brand online after a data breach. About a third said they would stop engaging completely.

What can you do to protect against re-identification?

The 1972 world chess champion Bobby Fischer once admitted that he could easily see 20 moves ahead if the line of attack was simple. When his opponent had many possible counter-attacks, his foresight shortened to only one move ahead.

That’s precisely the situation you are in here. The only difference is that as a developer working with private data, you are up against millions of potential opponents who have unlimited counter-attacks.

What you have on your side is a set of best practices in data de-identification, listed below, and the foresight to prepare, with the most advanced privacy guarantees available, well before your data is breached.

In order to move from “Discomfort” to “Relief,” anyone who cares about protecting private data has to assure themselves that every possible leak has a thumb in it.

Minimize data exposure. Set up safeguards so your teams only work with the data they truly need for the task at hand. Truncate tables, subset databases, scale data down to the smallest useful size to perform your work.
Perform advanced data transformation. Once you’ve verified that you are working with a minimized dataset, make use of the most comprehensive data de-identification techniques available. For developers, the goal is to de-identify in a way that minimizes the risk of re-identification, without losing the utility of the data you need for development and testing. Your approach should include locating PII throughout your database, ensuring consistency in transformations across tables (ideally, across DBs), and linking data across your schema to mirror relationships and preserve referential integrity.
Apply differential privacy. The devil is in the details and large datasets contain all kinds of devils that can expose individuals counterintuitively. Learn more about differential privacy here.
Regularly refresh datasets. This is meant to account for schema changes and protect against production data leaking into lower environments.

The Re-identification Takeaway

The point is that there’s a wide arc of variance in approaches to data de-identification and anonymization. Many will leave your organization exposed to the very real risk of re-identification because de-identifying data well enough to fool the highly motivated is not a simple task. At all.

The amount and complexity of data collected on individuals today make it increasingly difficult to de-identify data in a way that is comprehensive and secure, while still outputting data useful enough for software development and testing. At the same time, it has grown increasingly easier to combine anonymized datasets from independent sources, cross reference that data and re-identify individuals with an impressive degree of accuracy. It’s up to you to prevent that.

One final question

Why am I going on about all this stuff from the 70s? That was half a century ago.

The 70s was the height of the political thriller, when any average person could overcome a vast, international criminal cabal just through curiosity, perseverance and attention to detail. That’s the spirit of today’s lesson in protecting data privacy. And the need for that spirit has never been more urgent.

If all of this feels like too much to handle at the moment, we’re here to help. Drop us a line.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Omed Habib

VP Marketing

Omed is a fake data evangelist at Tonic. When not faking data or marketing, Omed is busy geeking out on all things software development, photography and cooking (the cooking stuff is still a work in progress). Omed formerly led Product Marketing teams at AppDynamics, Harness.io, and helped launch startups from inception to unicorns.