Following up on our article defining data anonymization techniques, today we’ll explore the key concepts of data synthesis, and how Tonic is pairing the strongest capabilities of synthesis with the strongest of anonymization to pioneer a new approach: data mimicking.
Here are the two primary methods of data synthesis used today, along with a property to heighten the degree of privacy their outputs provide.
In rule-based data generation, users define the schema of the dataset that they want to create, and then give the system free rein to generate it (usually by randomly generating values). There are many different open-source libraries and tools that you can use to synthesize realistic test data, such as Mockaroo and pydbgen.
If you want to generate a synthetic dataset of students at a university, for example, you might begin by defining the various fields that you want the dataset to incorporate. This might include student names, genders, birthdates, age, addresses (school and home), email addresses, passwords, GPA, coursework, activities and memberships, etc. Each one of these fields would have an associated type: the name field would be composed of text, the age field would be a number, etc. The synthetic data generation system would then draw upon its existing resources and knowledge (such as a list of common first and last names) to create a realistic-looking dataset that obeys these rules.
Machine learning architectures such as generative adversarial networks (GANs) are also capable of generating extremely high-quality synthetic data. GANs are a class of deep learning model that consist of two competing networks: a “generator” which tries to create realistic synthetic data, and a “discriminator” which tries to distinguish between the real data points and the synthetic ones.
In recent years, GANs have made tremendous strides in their capabilities and accuracy. If you need to generate synthetic photographs of people’s faces, for example, you can use a tool such as thispersondoesnotexist.com, while SDV’s open-source CTGAN model can synthesize tabular databases. Of course, training your own GAN will require a good deal of machine learning knowledge, plus a dataset that’s large enough so that the network doesn’t simply replicate the original information.
An important property that can be applied to either of the above two approaches to synthesizing data is that of making them differentially private. The term differential privacy refers to the practice of adding statistical noise to datasets to prevent malicious actors from detecting the presence of a given record in the dataset. It provides a mathematical guarantee against data re-identification, hardening the security of a synthetic output dataset.
For example, given a dataset, an algorithm might perform some statistical analysis on it (such as the dataset’s mean, variance, mode, etc.). The algorithm is then said to be “differentially private” if, looking at the output, it is impossible to tell whether a given record was included in the calculation, regardless of how typical (or atypical) that record is. Essentially, differential privacy is a way of guaranteeing an individual’s privacy when analyzing a larger dataset in aggregate.
Anonymization and synthesis are based on different conceptions of how to work with data safely that each come with their own advantages and challenges:
Mimicked data is a new concept pioneered by Tonic that combines the best aspects of data anonymization and synthesis into an integrated set of capabilities.
The goal of data mimicking is to allow developers to finetune the dials and achieve the balance they need between utility and privacy. This is done by providing developers with a single platform in which all of the required features work together, while focusing as heavily on realism as on privacy in the data generation techniques employed.
Data mimicking is designed specifically for creating realistic data from existing data. Moreover, it is an approach built for today’s complex data ecosystems—tens of thousands of rows and hundreds of tables, spread across multiple databases of different types. To truly mimic today’s data, you can’t work one table at a time, or even one database. You need to be able to capture relationships throughout your entire ecosystem.
Machine learning may be employed to preserve ratios, relationships, and dependencies within certain data. Differential privacy can be applied during transformations, to muffle the impact of outliers and provide mathematical guarantees of privacy. Columns can be linked and partitioned across tables or databases to mirror the data’s complexity, and consistency can be set to ensure that certain inputs always map to the same outputs regardless of their transformations.
The best mimicking infrastructure will also flag PII as sensitive and in need of protection, and alert you to schema changes before any data leaks through to your lower environments. Your data is constantly changing; your test data should too, through the generation of refreshed datasets on demand.
A data mimicking platform’s focus isn’t to build one model to rule them all, but to enable its users to create a heterogeneous super model containing a multitude of smaller models within it, specific to the subset of data at hand. In this way, it allows for handling different parts of a database in different ways. Differential privacy here, consistency there, machine learning over there, etc. The system can pull in what’s needed where it’s needed or most appropriate.
By adapting concepts and methods from each of the earlier approaches, mimicked data represents an advancement of both anonymization and synthesis. It’s useful and it’s safe, which, for the purposes of testing and development, is the holy grail.
This article is an excerpt from our ebook Test Data 101. To learn more about generating quality, safe data for testing and development, download your free copy today.
Enable your developers, unblock your data scientists, and respect data privacy as a human right.