A Comprehensive Guide to Data Validation and ML Model Training with AI Generated Data from Tonic
April 18, 2023
TLDR: Tonic's Data Science Mode unlocks data access for data science teams struggling to build within the bounds of data governance policies. We use generative AI technology to replicate the nuanced patterns in your real datasets, allowing you to build machine learning models that perform just like models trained on your real data. But don’t take our word for it—start your Tonic free trial and select "Machine Learning / Data Science" as your use case to start generating data and follow the steps laid out in this blog post to validate the data. You’ll walk away confident that the data you generate in Tonic Data Science Mode can train machine learning models ready to be pushed to production.
Democratize your data with Tonic
You’re a Senior Data Scientist at a government agency and want to assign a Junior Data Scientist on your team a project looking at predicting citizens’ income brackets given their demographic characteristics. There’s just one problem—due to the data governance policies of the org, your Junior Data Scientist doesn’t have clearance to work with that data.
You don’t have time to go through the long process of getting approval for them to obtain access, and what if another member on your team will soon need access to the data as well? What do you do?
Enter synthetic data from Tonic Data Science Mode (DSM).
We designed DSM using generative AI technology to break down these barriers to productivity for data scientists and seamlessly build machine learning models on high fidelity synthetic data. Synthetic data can be used to address data scarcity, work with privacy protecting restrictions on data access, and even de-bias datasets.
But can you really replace your real training data with synthetic data and get a useful model?
In this post, we show how to use DSM to generate data, how to validate the quality of data from DSM, and how easy it is to use this data to build rigorous machine learning models.
Generate data in Tonic Data Science Mode
We will walk through the DSM data generation workflow in Tonic step-by-step using data from the UCI Adult Census dataset. This dataset contains demographic information on adults from the 1994 U.S. census.
The following steps allow you to easily build AI-powered generative models in DSM to give you high quality data in our beautiful UI:
Create a workspace for the project.
Configure the workspace being sure to enable "Data Science Modeling" and connect your database or upload a CSV.
Create a new model and query the schema for the specific data you want to generate.
Set the parameters for the generative model.
Train the model on the queried data.
Pull the synthetic data into your jupyter notebook (or whatever development environment you use) to build with.
It really is this simple!
Validate synthetic data from Tonic Data Science Mode
We are, of course, confident in the quality of the synthetic data generated in DSM, but you shouldn’t take our word for it. Validating the fidelity of synthetic data before you use it is an essential step before replacing your real training data with it.
To validate the UCI Adult Census data we generated in DSM, we follow the following process:
Compare the real and synthetic feature distributions.
Compare the real and synthetic inter-feature relationships.
Compare the performance of a classification model trained on the synthetic data with a classification model trained on the real data.
Tune the DSM model.
Iterate by tuning the parameters of the DSM model until we are happy with our synthetic data’s performance.
Voila! We are ready to build with our new, validated synthetic data.
The best part is the longevity of the generative model we build in DSM—once we’re done tuning and validating, we can infinitely sample synthetic data for future projects. As with all of Tonic's features, DSM will keep up with all database and schema changes, so we don’t have to worry about updating the training data.
Steps 1 & 2: Distributions and relationships
First we pull in data from DSM:
We query more data points than we know we have in the training set (the real data) to make sure we get every row. Then we pull in the exact number of synthetic data points as real data points.
We’ve made it easy to assess the distributions of categorical and numeric data with the Tonic Reporting library. Here, we plot bar graphs comparing the distributions of the real and synthetic categorical features with two simple lines of code:
To validate that the relationships between categorical features in the synthetic data reflect those in the real data, we plot stratified frequency distributions of each categorical feature pair. As these plots can get large depending on the number of categories in each feature, we have chosen to just plot sex and race stratified by income below.
The synthetic categorical feature distributions mirror the distributions of the real data with very slight margins. The relationships between stratified feature categories also track between stratified frequencies of sex and income as well as race and income for both the real and synthetic data. The other categorical feature pair plots also show that the data from DSM captures the distributions of the real data, however, due to the size of the plots we did not include them here.
We have demonstrated that DSM is able to replicate both the distributions and interrelationships of categorical features in the real dataset.
For the numeric features, we again use the reporting library to quickly compare histograms of the distributions of numeric features between the real and synthetic data.
The Tonic Reporting library also has a function to easily analyze how well DSM captures the interrelationships between numeric features.
Not only was DSM able to replicate the distributions of the categorical features, but it was also able to do so for the numeric features with only slim margins of error. The correlation heatmaps also show almost identical correlations between numeric features. DSM's models did a great job of capturing the distributions and relationships between numeric features.
This is all well and good, but what about the complex relationships between all features—categorical and numeric? Did DSM do a good enough job at capturing the small eccentricities of the dataset for the synthetic data to perform the same as the real data when training a classifier?
Step 3: Model performance
Since our end goal is to replace the real data with synthetic data to train a machine learning model, a key step in validation is to compare the performance of a model trained on the real data to the performance of a model trained on the synthetic data.
We build a model using the UCI Adult Census dataset to predict whether a person makes more than or less than $50,000 a year based on their demographic characteristics. In addition to the UCI dataset, we look at two other datasets:
Telco - This dataset from Kaggle contains numeric and categorical data points from customers of a fictional telecom company. With this dataset we predict whether or not a customer churns.
Titanic - This dataset, also from Kaggle, contains numeric and categorical data points from passengers on the Titanic to predict which passengers survive.
To observe the margin of error associated with variability from the random effects of splitting the data and the randomness inherent to DSM’s neural networks, we execute 10 different 70:30 train-test splits of the original data. We then train 10 different DSM models on the 10 train sets, and generate 10 synthetic sets with the same number of rows as the 10 training sets.
Using the 10 sets of real data and the 10 synthetic datasets, we train four different classifier architectures: K-Nearest Neighbors, Logistic Regression, CatBoost, and Support Vector Classifier. We compare the distribution of performance of these models trained on the real and synthetic datasets by calculating the area under the ROC curve (also known as the ROC AUC score) on the corresponding 10 real held out test sets. We use the ROC AUC score as the measure of the models’ performance because it is independent of the choice of any decision threshold for the classification models, making it the most robust metric for a binary classification problem.
Data generated in DSM exhibit similar performance on each classification model for each dataset, with the variance in scores differing by dataset, possibly due to the size of each dataset. The Titanic dataset is the smallest with 891 rows and the Telco dataset the second smallest with 5,274 rows, while the UCI Adult Census dataset has 32,561 rows. When there are more data points to choose from, resampling the train-test split leads to less variation and thus narrower scoring margins.
We expect that the synthetic data won’t perform quite as well as the real data for these classification tasks, however, the margins between the median ROC AUC scores are quite narrow between the real and synthetic data for each model on each dataset. Given the negligible size of the margins of error, we can be confident in replacing our real data with data generated using DSM.
Modeling with synthetic data
Now that the synthetic data is validated, we’re ready to replace the real data with the synthetic and build our model.
Here, we use data generated in DSM trained on the UCI Adult Census dataset to build a CatBoost model.
Your Junior Data Scientist can now build with safe and secure synthetic data to deliver a product that is indistinguishable from a model trained on real data.
You can trust data from Tonic's Data Science Mode to train your models
Synthetic data produced in DSM trains machine learning models with nearly the same level of performance as your real data. Adjusting the parameters of your models in DSM under the advanced settings can even further improve the accuracy of the synthetic data.
So what’s the take away? Staying skeptical of synthetic data and only using data you trust will get you your results. Whatever your use case, you can be confident in the synthetic data you generate in Tonic's Data Science Mode.
Driven by a passion for promoting game changing technologies, Madelyn creates mission-focused content as a Product Marketing Associate at Tonic. With a background in Data Science, she recognizes the importance of community for developers and creates content to galvanize that community.