What Is Data Synthesis, and Why Are We Calling It Data Mimicking?
Andrew Colombi, PhD
July 7, 2021
Following up on our article defining data anonymization techniques, today we’ll explore the key concepts of data synthesis, and how Tonic is pairing the strongest capabilities of synthesis with the strongest of anonymization to pioneer a new approach: data mimicking.
Data Synthesis, Defined
Here are the two primary methods of data synthesis used today, along with a property to heighten the degree of privacy their outputs provide.
1. Rule-Based Data Generation
In rule-based data generation, users define the schema of the dataset that they want to create, and then give the system free rein to generate it (usually by randomly generating values). There are many different open-source libraries and tools that you can use to synthesize realistic test data, such as Mockaroo and pydbgen.
If you want to generate a synthetic dataset of students at a university, for example, you might begin by defining the various fields that you want the dataset to incorporate. This might include student names, genders, birthdates, age, addresses (school and home), email addresses, passwords, GPA, coursework, activities and memberships, etc. Each one of these fields would have an associated type: the name field would be composed of text, the age field would be a number, etc. The synthetic data generation system would then draw upon its existing resources and knowledge (such as a list of common first and last names) to create a realistic-looking dataset that obeys these rules.
2. Deep Generative Models
Machine learning architectures such as generative adversarial networks (GANs) are also capable of generating extremely high-quality synthetic data. GANs are a class of deep learning model that consist of two competing networks: a “generator” which tries to create realistic synthetic data, and a “discriminator” which tries to distinguish between the real data points and the synthetic ones.
In recent years, GANs have made tremendous strides in their capabilities and accuracy. If you need to generate synthetic photographs of people’s faces, for example, you can use a tool such as thispersondoesnotexist.com, while SDV’s open-source CTGAN model can synthesize tabular databases. Of course, training your own GAN will require a good deal of machine learning knowledge, plus a dataset that’s large enough so that the network doesn’t simply replicate the original information.
An important property that can be applied to either of the above two approaches to synthesizing data is that of making them differentially private. The term differential privacy refers to the practice of adding statistical noise to datasets to prevent malicious actors from detecting the presence of a given record in the dataset. It provides a mathematical guarantee against data re-identification, hardening the security of a synthetic output dataset.
For example, given a dataset, an algorithm might perform some statistical analysis on it (such as the dataset’s mean, variance, mode, etc.). The algorithm is then said to be “differentially private” if, looking at the output, it is impossible to tell whether a given record was included in the calculation, regardless of how typical (or atypical) that record is. Essentially, differential privacy is a way of guaranteeing an individual’s privacy when analyzing a larger dataset in aggregate.
Tonic’s Approach to Data Synthesis + Anonymization, and Why We’re Calling It Data Mimicking
Anonymization and synthesis are based on different conceptions of how to work with data safely that each come with their own advantages and challenges:
The goal for best-in-class Anonymization is to come as close as possible to production data for high quality testing, while also protecting private information. The process de-identifies individuals in an existing dataset by eliminating or transforming PII and anything that could be used to reconstruct PII.
In Synthesis, the priority is to generate a realistic dataset that is not so closely tied to existing individual records, with a great deal of planning and evaluation put into predefined generation rules, using the requirements of the data’s use case as a guide.
Mimicked Data: The Crossroads of Anonymization and Synthesis
Mimicked data is a new concept pioneered by Tonic that combines the best aspects of data anonymization and synthesis into an integrated set of capabilities.
1. Preserving production behavior
The goal of data mimicking is to allow developers to finetune the dials and achieve the balance they need between utility and privacy. This is done by providing developers with a single platform in which all of the required features work together, while focusing as heavily on realism as on privacy in the data generation techniques employed.
2. Hundreds of databases, thousands of tables, PBs of data
Data mimicking is designed specifically for creating realistic data from existing data. Moreover, it is an approach built for today’s complex data ecosystems—tens of thousands of rows and hundreds of tables, spread across multiple databases of different types. To truly mimic today’s data, you can’t work one table at a time, or even one database. You need to be able to capture relationships throughout your entire ecosystem.
3. Infinite scenarios
Machine learning may be employed to preserve ratios, relationships, and dependencies within certain data. Differential privacy can be applied during transformations, to muffle the impact of outliers and provide mathematical guarantees of privacy. Columns can be linked and partitioned across tables or databases to mirror the data’s complexity, and consistency can be set to ensure that certain inputs always map to the same outputs regardless of their transformations.
The best mimicking infrastructure will also flag PII as sensitive and in need of protection, and alert you to schema changes before any data leaks through to your lower environments. Your data is constantly changing; your test data should too, through the generation of refreshed datasets on demand.
A data mimicking platform’s focus isn’t to build one model to rule them all, but to enable its users to create a heterogeneous super model containing a multitude of smaller models within it, specific to the subset of data at hand. In this way, it allows for handling different parts of a database in different ways. Differential privacy here, consistency there, machine learning over there, etc. The system can pull in what’s needed where it’s needed or most appropriate.
By adapting concepts and methods from each of the earlier approaches, mimicked data represents an advancement of both anonymization and synthesis. It’s useful and it’s safe, which, for the purposes of testing and development, is the holy grail.
This article is an excerpt from our ebook Test Data 101. To learn more about generating quality, safe data for testing and development, download your free copy today.
Andrew Colombi, PhD
Co-Founder & CTO
Andrew is the Co-founder and CTO of Tonic. An early employee at Palantir, he led the team of engineers that launched the company into the commercial sector and later began Palantir’s Foundry product. Today, he is building Tonic’s platform to engineer data that parallels the complex realities of modern data ecosystems and supply development teams with the resource-saving tools they most need.