Strong data governance is not only about security, it’s also about accessibility and accuracy. While your customers are trusting you to keep their information secure, your data science teams need data to build and validate new models with this valuable yet sensitive information.
Tonic Data Science Mode (DSM) uses the latest in generative AI to produce data that is statistically similar enough to your production data for your Data Scientists and ML Engineers to build with and push your company’s goals forward. At the same time, synthetic data produced in DSM is distinct enough from your customers’ identifiable information to meet your data governance standards and protect your customers’ privacy.
DSM’s generative models are built on deep neural networks that train on your source data to accurately approximate its distributions. The variational autoencoders used to build these generative models are regularized to prevent memorization of the dataset. What you get is a completely new dataset distinct from your source data, but with the same statistical properties.
More training iterations will generally result in data that is more similar to the real data, compromising the security of the synthetic data. This poses a natural challenge: the higher the capacity of your synthetic data to behave like your production data, the lower the capacity of the synthetic dataset to protect the identities of your customers.
In this post we show how you can produce data in DSM that performs comparably on a classification task while safeguarding the privacy of the original dataset.
Investigating the Privacy and Utility of Synthetic Data
While your customers are trusting you to keep their information secure, your data science teams need data to build and validate new models with this valuable yet sensitive information.
We conduct an empirical analysis of the privacy-utility trade-off using the UCI Adult Dataset. The data contain 15 columns of information on peoples’ employment, education, socioeconomic status, finances, and demographic information such as marital status, sex, and race, and comes split into a training and a held out test set of 32,561 and 16,281 records, respectively. This data is commonly used for benchmarks of synthetic tabular data.
To investigate the effect of increasing training iterations on the utility and the privacy of the generated data we train several models in DSM for varying numbers of epochs.
To measure the utility of the synthetic data generated in DSM, we examine the performance of classification models trained to predict whether a person’s annual income exceeds $50,000 from their demographic characteristics. We fit the models to both the original dataset and the synthetic datasets. ROC AUC scores are calculated on the held out test set to evaluate the performance of these binary classification models. We compare the scores of the models built with data generated from each DSM configuration with the score from the original dataset to assess utility.
When we refer to the “privacy” of synthetic data, what we are really concerned about is the ability of an attacker to re-identify individuals from our synthetic data. If our synthetic data points are too similar to the real data, we are giving this hypothetical attacker an easy job.
To quantify how similar our synthetic data is to the original data, we use a distance to closest record (DCR) calculation. This method involves calculating the Euclidean distance to the next closest record for each datapoint in the original dataset. Next, distances are calculated from each synthetic datapoint to the closest real datapoint as illustrated below.
The greater the synthetic-real DCR, the more distinct the synthetic data point is from its closest real data point and the more difficult re-identification would be, and the more “private.” Conversely, should a synthetic-real DCR be at or close to 0, this means that the synthetic datapoint is simply a copy of the real datapoint and there is little to no privacy protection.
The challenge with interpreting synthetic-real DCRs is that, by design, synthetic data points lay within the same range as the original data, making it more likely for them to be relatively close to a real record. Comparing the distribution of distances of synthetic records to real records with the distance of real records to themselves allows us to understand if the synthetic data is further from the real data than the real data is to itself, meaning that the synthetic data is more private than a copy or a simple perturbation of the real data.
Below are two histograms showing the DCR distributions from data generated in DSM by models trained for 1,500 epochs and one epoch. The data from the model trained for one epoch has about 11,000 fewer data points that are 0-0.05 from the closest real record as the real-real record DCRs. Further, this model produces more data with longer distances to real records than between the real records themselves. Compared to the distribution of DCRs from the 1,500 epoch model, which has more synthetic records on the low end of the DCR scale and more so tracks the DCR distribution of the real data, we conclude that the data from the one epoch model is more private.
The function to calculate this metric is found in the Tonic reporting library. For the purposes of our empirical analysis, we summarize this metric for each dataset by taking the median DCR.
We sample data from 50 DSM models trained for 1-50 epochs. Each model is trained 20 times to control for randomness in the neural networks.
We find an overall initial drop off in synthetic-real median DCRs within the first ten epochs. As the DSM models train on the initial dataset from 1 to 5 epochs, the distance from the generated data points and their closest real data points drastically decreases and shows a slower decline between five and ten epochs, then leveling out. The IQR of these values at each epoch is very narrow, indicating little variation from run-to-run.
The greater the synthetic-real DCR, the more distinct the synthetic data point is from its closest real data point and the more difficult re-identification would be, and the more “private.”
The ROC AUC scores have a dramatic reaction to increasing epochs even earlier than the median DCRs, with an initial spike in scores around three epochs. After about eight epochs, the scores seem to level out as well. We find more variability in scores from run-to-run than in median DCRs, as represented by the IQR cloud in the above graph.
Because median DCR and ROC AUC values seem to level out at such a low number of epochs, we test if this trend holds with larger numbers of iterations. To check out the trends on the upper end of the epoch scale we train DSM models on the census data for 1, 150, 300, 450, 600, 750, 900, 1050, 1200, 1350, and 1500 epochs.
After 150 epochs, the increase in data utility slows down, as does the decrease in data privacy. The ROC AUC of the synthetic data approaches its real baseline, but the leveling off of this metric suggests that further changes in the model architecture would be necessary to improve utility further, such as adding layers or features. While the median DCR remains above the baseline median DCR the more epochs the model trains for, increasing the model capacity as suggested above could narrow this gap further.
Allowing the DSM models to train for longer on source data produces higher utility data, while the privacy of the data, measured by the median synthetic-real DCR decreases but - critically - never reaches the median real-real DCR.
The picture we paint here suggests that our hypothesis - given a fixed model architecture, the longer a DSM model trains for the more similar it will be to the original data set as measured by median DCR (and thus less private) - is not necessarily true. We do demonstrate evidence of a utility-privacy tradeoff here that would be more interesting to explore with changing model architectures as well.
How Tonic Data Science Mode can optimize the utility-privacy tradeoff of synthetic data
Using synthetic data can be a great way to democratize the data in your org without compromising your values or responsibilities to your customers when it comes to maintaining the privacy of their data.
DSM in Tonic uses deep learning technology to learn the unique statistical structure of your data and generate synthetic datasets that perform just like your production data. Allowing DSM models to train for longer on source data produces higher utility data, while the privacy of the data, measured by the median synthetic-real DCR decreases but - critically - never reaches the median real-real DCR. These results are extremely promising as companies continue to find solutions to the increasing need for stricter data governance policies. Further investigations into the impact of model architecture on this important tradeoff would bring further insights to how DSM models balance privacy of the original data with the utility of the generated dataset.
Sign up for a free trial of Tonic to see how easy it is to democratize your data yourself.