Blog
Data Science

A Visual Guide to Using Tonic's Data Science Mode: Synthesize Data to Augment Imbalanced Classes

Author
Madelyn Goodman
December 5, 2022
A Visual Guide to Using Tonic's Data Science Mode: Synthesize Data to Augment Imbalanced Classes
In this article
    Share

    TL;DR: In this post we demonstrate how to use Tonic's Data Science Mode to balance an imbalanced target class with synthetic data to improve model performance. Data Science Mode can take any structured dataset, learn its unique traits and relationships, and generate high performing, ready-to-model synthetic data.

    Already excited and want to test Data Science Mode out for yourself, start a free trial!

    How to Use Tonic's Data Science Mode to Augment your Imbalanced Classes

    Classification models improve with the more data they see, lowering their error as they learn how to classify certain groups. Training classification models on data with imbalanced classes leads to models that more accurately learn to classify the majority class. This is especially problematic when building models to detect rare events. Random upsampling and downsampling are the most common ways to address this problem, however, downsampling can reduce the strength of statistical properties of the data and upsampling can lead to overfit and biased models. 

    Data Science Mode (DSM) provides an intuitive and seamless solution for data scientists building classification models on imbalanced datasets. 

    What is Data Science Mode?

    DSM is an AI-powered synthetic data platform that learns the unique patterns and properties of your proprietary datasets and generates completely new data. Here at Tonic.ai, we recognized that data scientists face challenges unique from those faced by developers, requiring data of impeccable quality to build smarter models faster to drive better business insights. We built DSM to address these challenges, allowing teams to optimize their data science workflows and achieve better results. All you have to do is connect your database, write a SQL query to pull in the data you want to work with, and start generating fake data. 

    In this post, we’ll show you how to use DSM to balance target classes using AI-generated synthetic data. This allows for the building of classification models with reduced bias and increased accuracy. Check out our blog post “How to Solve the Problem of Imbalanced Datasets: Meet Tonic's Data Science Mode” to see how classification models trained on DSM-balanced data out-perform those trained on imbalanced datasets and datasets balanced with SMOTE and SMOTE-NC. If you'd like to follow along here, you can create a free account.

    The Dataset

    We use a dataset from Kaggle to predict customer churn - whether or not a customer has left the service in the last month - from a fictional telecom company, Telco. There are 7,032 rows in this dataset with each representing a customer and their characteristics captured in the 21 columns. These columns include a mix of numeric and categorical features. 

    Predicting customer churn allows the telecom company to deploy targeted customer retention measures to prevent potential losses in revenue. 

    This dataset is clearly imbalanced, with churned customers represented only one-third as frequently as non-churned customers. We use DSM to balance the proportion of churned to not-churned customers 50/50 by generating new data points of churned customers. 

    Using Tonic's Data Science Mode to Augment the Minority Class

    First things first, we need to create an account. Follow the instructions to create an account either using your Google profile or any other email address.

    Create a Workspace and Feed in Your Data

    DSM can read your data directly from a CSV file or by connecting to a relational database or data warehouse. In this tutorial, we show how to use DSM with data from a CSV file. We prepare our data to be read into DSM by first cleaning it, defining our inputs and outputs, and splitting the data into train and test sets at random. We are now ready to create our first workspace in DSM:  

    First, you must create and configure your workspace to source data from a CSV. 

    Training your Data Science Mode mode

    After creating your workspace you will be brought to the "Getting Started" checklist to walk you through creating and training your first model then exporting the data to a CSV.

    1. Create and name your model.

    2. A SQL query will be automatically generated. Review the data preview to ensure data types are properly encoded.

    4. Beneath the "Model Description" box, navigate to Advanced to specify parameters for your DSM model - for this experiment we'll be training for 160 epochs and we’ll allow early stopping for our model.

    5. Select Save and Back to return to the "Getting Started" checklist. 

    6. You are now ready to train the DSM generative model on your data - click Train.

    7. Once training is complete, we can export how ever many rows of data we want to a CSV by selecting Export Data, specifying the number of rows you want, then selecting Generate CSV on the "Getting Started" checklist.

    Connecting to Data Science Mode in Jupyter 

    To balance our target class, we can also pull our data directly into our Jupyter Notebook from DSM by connecting to the Tonic API.

    1. Install the Tonic API package in your virtual environment.
    2. Create your Tonic API key
    1. Navigate back to your workspace and select Leave Getting Started to return to the “Models” page on the left hand panel and export your trained model by selecting Export and copying and pasting the code snippet into your Jupyter Notebook. You can also find diagnostics for your generated data by selecting Synthesis Report.

    Snippet

    The code snippet has the Workspace ID and Job ID pre-populated, you just need to fill in the API Token with your unique Tonic API Key. 

    Sampling from your Data Science Mode Generative Model

    Now that our Jupyter Notebook is connected to our generative model in DSM, we can sample synthetic rows of data to balance our dataset. We need 2,652 synthetic rows of churned customers to balance the Churn column. To get these synthetic rows, we sample many more data points than we need from our generative model in DSM so we can be sure that there will be the correct number of those in the minority class for us to select and add to our dataset:

    We have now pulled in our synthetic data from DSM to balance the proportion of churned customers to non-churned customers 50/50. Woohoo!

    Running our model with DSM-augmented data

    Now that we've balanced our target variable, we can use this data to train a CatBoost model to predict customer churn. We test this model with the receiver operating characteristic area under the curve (ROC AUC) and F1 scores. 

    *1 in the confusion matrix signifies customer churn

    Without any hyper parameterization, our CatBoost model trained on DSM-augmented data displays strong ROC AUC and F1 scores, demonstrating DSM's generative model's ability to produce high quality data that can reduce model bias for imbalanced classification problems.

    Create your free account to augment your own data!

    Here you've seen how to get started with DSM: Creating a workspace, loading in data, training a generative model in DSM, and pulling synthetic samples into a Jupyter Notebook to model with. Head over to GitHub to follow along with the full experiment. 

    Now that you know what you are doing, there are no more excuses! Head over to Tonic's Data Science Mode to create a free account and try it out with your own data. 

    Note: DSM's generative model has some inherent randomness - experiments may not produce identical results every time.

    Madelyn Goodman
    Data Science
    Driven by a passion for promoting game changing technologies, Madelyn creates mission-focused content as a Product Marketing Associate at Tonic. With a background in Data Science, she recognizes the importance of community for developers and creates content to galvanize that community.