Blog
Data Science

Improving Synthetic Data Distributions with Conditional Sampling | Data Synthesis for Data Science

Author
Joe Ferrara, PhD
November 14, 2022
Improving Synthetic Data Distributions with Conditional Sampling | Data Synthesis for Data Science
In this article

    We recently released a new workflow to streamline data synthesis for data scientists; it goes by the name of Data Science Mode (DSM) and you can access it by creating a free Tonic account here. Our newest feature in DSM is conditional sampling, which gives you fine-grained control to adjust distributions in your synthetic data. With our new conditional sampling feature, you control the distribution of your data, allowing you to de-bias or augment your existing data. 

    As a running example, let’s consider financial transactions for users at a bank (eg, users represented in the Czech banking dataset). The dynamics of these transactions are informed by many variables—general trends, user characteristics, animal spirits—and as we’ve seen in previous posts, our models accurately reproduce the dynamics and statistics of these transactional data. 

    In this post, we show how the user characteristics influence the data, and how this can be adjusted using conditional sampling within DSM.

    Example: Transactions history and credit cards

    One of the user characteristics captured in this database is whether or not the account has a credit card. This attribute influences the characteristics of the transactions: the typical amount, balance, frequency, etc. 

    For example, we can see that the distributions of amount and balance are clearly distinct for the subpopulations of accounts with and without credit cards, but at the same time it is difficult to  estimate the account characteristic solely from these values.

    Fig: Distributions of transaction balance and amounts, grouped by has_cred_card

    In order to assess the relationship between the has_cred_card attribute and overall sequence properties, we use an LSTM to predict the value of has_cred_card from the sequence values. As we see from the high ROC-AUC of 0.8772 on our test set, the has_cred_card flag can be successfully predicted from the transaction sequence data, indicating a strong relationship between this user characteristic and transaction data.

    Fig: Test ROC for LSTM classifier trained to predict has_cred_card

    Setting Up The Model in Data Science Mode

    The first step is to specify our data, which we do via the following SQL query:

    Next, we configure Djinn to treat this as event data, selecting account_id as the Primary Entity and transaction_day as the Order. This treats our rows as sequences, grouped by account_id and sorted by transaction_day.

    Finally, we specify that we want to condition our synthetic data on the value of has_cred_card:

    In general, you can condition on any number of columns. For instance, you may want to condition on demographic characteristics of users captured in your data, so that you can synthesize synthetic populations drawn from different distributions.

    Selecting columns to condition on changes the model architecture of our VAE, essentially disentangling this condition from the model’s latent space. Crucially, we can now specify the condition when generating samples. Using a conditional model for conditional sampling is advantageous over an algorithm like rejection sampling because it is significantly faster and guaranteed to generate synthetic samples with any specified values in the conditional columns. Rejection sampling on the other hand, can be very slow and may not generate samples for rare specified values in the conditional columns.

    Generating Samples

    After the model is trained we access the model and generate samples through our Python SDK, tonic-api. We access the model via the following simple code snippet.

    We then generate as many rows of synthetic data as we’d like by calling model.sample, which returns a dataframe of synthetic samples. The following snippet generates 100 rows of synthetic data.

    Let’s see how to do conditional sampling for the model trained on our example dataset. To conditionally sample the model, we specify a conditions dataframe with the values of the conditional columns we are interested in. For instance, let’s say we want to generate a sequence of transactions for 3 accounts where 2 of the accounts have a credit card and the other account does not. To do this, we execute the following code snippet:

    Let’s say we want to compare data generated by the model to the real data output by the SQL query. The following code snippet gets 100,000 rows of the real data, isolates the conditions for each account in the real data, and then uses conditional sampling to generate a sequence of transactions for each set of conditions in the real data.

    In this code snippet, we get a dataframe of real data, real_df, and a dataframe of synthetic data, synth_df, where each dataframe has the same number of accounts and the same distribution of accounts with credit cards.

    Evaluating a Conditional Model

    With our synthetic data in hand, we now perform the same experiment to predict has_cred_card from an account’s sequences of transactions1. That is, we train a second LSTM model on purely synthetic data, using the same parameters as the model trained on real data. The two models were then evaluated on the held out test set.

    And voila, the two models have, up to some random variation, very similar ROC curves! Conditional sampling has succeeded in preserving the relationship between the non conditional transaction columns and the conditional has_cred_card column.

    How to Use Conditional Sampling in Data Science Mode

    And there you have it! This has been a breakdown of how to use conditional sampling in Tonic.ai’s new tool for the data science use case, DSM. If you need a powerful way to generate synthetic data with control over the distribution of a subset of columns of your data, look no further!

    Interested in trying it out? Create a free Tonic account and try DSM out on your own datasets.

    1 In order to ensure a fair comparison between real and test data, we modified the above queries slightly to ensure that Djinn is only synthesizing data from our classifier train set. Specifically, we use an 80/20 train/test split for the real data, and train Djinn on the train split, conditioning on the has_cred_card column as above.

    Joe Ferrara, PhD
    Senior AI Scientist
    Joe is a Senior Data Scientist at Tonic. He has a PhD in mathematics from UC Santa Cruz, giving him a background in complex math research. At Tonic, Joe focuses on implementing the newest developments in generating synthetic data.
    Real. Fake. Data.
    Say goodbye to inefficient in-house workarounds and clunky legacy tools. The data you need is useful, realistic, safe—and accessible by way of API.
    Book a demo
    The Latest
    Tonic Validate extends its RAG evaluation platform to support metrics from Ragas
    RAG Evaluation Series: Validating the RAG performance of OpenAI vs CustomGPT.ai
    Building vs buying test data infrastructure for ephemeral data environments