We recently released a new workflow to streamline data synthesis for data scientists; it goes by the name of Data Science Mode (DSM) and you can access it by creating a free Tonic account here. Our newest feature in DSM is conditional sampling, which gives you fine-grained control to adjust distributions in your synthetic data. With our new conditional sampling feature, you control the distribution of your data, allowing you to de-bias or augment your existing data.
As a running example, let’s consider financial transactions for users at a bank (eg, users represented in the Czech banking dataset). The dynamics of these transactions are informed by many variables—general trends, user characteristics, animal spirits—and as we’ve seen in previous posts, our models accurately reproduce the dynamics and statistics of these transactional data.
In this post, we show how the user characteristics influence the data, and how this can be adjusted using conditional sampling within DSM.
One of the user characteristics captured in this database is whether or not the account has a credit card. This attribute influences the characteristics of the transactions: the typical amount, balance, frequency, etc.
For example, we can see that the distributions of amount and balance are clearly distinct for the subpopulations of accounts with and without credit cards, but at the same time it is difficult to estimate the account characteristic solely from these values.
The first step is to specify our data, which we do via the following SQL query:
In general, you can condition on any number of columns. For instance, you may want to condition on demographic characteristics of users captured in your data, so that you can synthesize synthetic populations drawn from different distributions.
Selecting columns to condition on changes the model architecture of our VAE, essentially disentangling this condition from the model’s latent space. Crucially, we can now specify the condition when generating samples. Using a conditional model for conditional sampling is advantageous over an algorithm like rejection sampling because it is significantly faster and guaranteed to generate synthetic samples with any specified values in the conditional columns. Rejection sampling on the other hand, can be very slow and may not generate samples for rare specified values in the conditional columns.
After the model is trained we access the model and generate samples through our Python SDK, tonic-api. We access the model via the following simple code snippet.
Let’s see how to do conditional sampling for the model trained on our example dataset. To conditionally sample the model, we specify a conditions dataframe with the values of the conditional columns we are interested in. For instance, let’s say we want to generate a sequence of transactions for 3 accounts where 2 of the accounts have a credit card and the other account does not. To do this, we execute the following code snippet:
Let’s say we want to compare data generated by the model to the real data output by the SQL query. The following code snippet gets 100,000 rows of the real data, isolates the conditions for each account in the real data, and then uses conditional sampling to generate a sequence of transactions for each set of conditions in the real data.
In this code snippet, we get a dataframe of real data, real_df, and a dataframe of synthetic data, synth_df, where each dataframe has the same number of accounts and the same distribution of accounts with credit cards.
And there you have it! This has been a breakdown of how to use conditional sampling in Tonic.ai’s new tool for the data science use case, DSM. If you need a powerful way to generate synthetic data with control over the distribution of a subset of columns of your data, look no further!
Interested in trying it out? Create a free Tonic account and try DSM out on your own datasets.
Enable your developers, unblock your data scientists, and respect data privacy as a human right.