Blog
Data Science

How to Solve the Problem of Imbalanced Datasets: Meet Tonic's Data Science Mode

Author
Madelyn Goodman
November 21, 2022
How to Solve the Problem of Imbalanced Datasets: Meet Tonic's Data Science Mode
In this article

    TL;DR: In this post we’ll examine how Tonic’s new workflow for data scientists, Data Science Mode (DSM), helps address biases in classification models caused by imbalanced datasets. We compare the performance of Logistic Regression, XGBoost, and CatBoost models trained on datasets balanced with data from DSM, SMOTE, and SMOTE-NC.

    We find that DSM-augmented data outperforms the other augmentation methods for CatBoost and XGBoost. Already excited about these results and want to try DSM yourself? Start your free trial of Tonic and select the data science use case when on-boarding to get started.

    Tonic's Data Science Mode: A Powerful Tool for Solving the Class Imbalance Problem

    Data Science Mode (DSM) can be used to address the data imbalance question that plagues many data scientists. By using our powerful proprietary generative models it is now possible to produce synthetic data that mimics your own. Rather than subjecting your models to the biases associated with upsampling or downsampling your data, DSM uses advanced AI to learn the unique patterns in your datasets to augment your minority classes. 

    Here, we’ll explore how augmenting a dataset with data from DSM performs compared to other data augmentation methods, Synthetic Minority Over-Sampling Technique (SMOTE) and SMOTE for Nominal and Continuous data (SMOTE-NC). We will be answering a classification problem using Logistic Regression, XGBoost, and CatBoost models.  

    Our Dataset

    We will use a dataset from Kaggle to predict customer churn - whether or not a customer has left the service in the last month - from a fictional telecom company, Telco. There are 7,032 rows in this dataset with each representing a customer and their characteristics captured in the 21 columns. These columns include a mix of numeric and categorical features. 

    Being able to predict which customers will churn allows the telecom company to deploy targeted customer retention measures to prevent resulting losses in revenue. 

    This dataset is clearly imbalanced, with churned customers represented only one-third as frequently as non-churned customers. Let’s do some initial modeling to see how well models respond to the imbalanced data to set a baseline.

    Comparing Models Trained with Imbalanced Data and Data Balanced Using Tonic's Data Science Mode, SMOTE, and SMOTE-NC

    We evaluate three models — Logistic Regression, XGBoost, and Catboost – performing the following procedures:

    1. Train-test split the data 75-25.* 
    2. One-hot encode the categorical variables for Logistic Regression and XGBoost.
    3. Evaluate model performance with a variety of metrics, including ROC AUC, F1, and confusion matrices.

    Imbalanced Data

    The positive class, Churn, is label 1

    All of these models have similar ROC AUC scores and much higher true negative rates compared to their true positive rates. This is a sign of bias for predicting the negative class from training the models on imbalanced data. For an imbalanced dataset, however, their ROC AUC scores are relatively strong - all around 83%. 

    Using Synthetic Data to Balance Customer Churn

    Since there is about 24% less data on customers who churn, our models have a greater opportunity to learn patterns associated with those in the majority class, making it more difficult for our models to make correct classifications. We address this class imbalance by augmenting our training data with synthetic minority samples. 

    DSM produces synthetic data that mimics the patterns and distributions of real data, allowing us to balance our training data with highly realistic synthetic samples. We compare augmenting the dataset with DSM-generated data against the SMOTE and SMOTE-NC methods.

    Note: Both DSM and SMOTE have some inherent randomness in how they generate data. To fully test the performances of these augmentation methods, we sample data with each augmentation technique 100 times and look at the distributions of the ROC AUC and F1 scores. This gives us a more accurate picture of how these augmentation methods perform rather than just fishing for the best random states.

    How Do the Scores Stack Up?

    First, let's take a look at the distributions of the ROC AUC scores for each model by augmentation method.

    *best performing method

    Wow! The ROC AUC distribution from the CatBoost model trained on DSM-augmented data beat not only the unaugmented model's score, but also the score distributions for SMOTE and SMOTE-NC data. We can confidently say that DSM models produce synthetic data that improves CatBoost’s model accuracy for this imbalanced dataset. The XGBoost model also performed very well with the DSM-generated data with a median ROC AUC score lying above the interquartile range of the SMOTE model.

    For Logistic Regression, none of the augmentation methods improved the ROC AUC score obtained by the model trained on the original imbalanced dataset. Other experiments exploring several data augmentation methods to balance minority classes for Logistic Regression have found similar results, suggesting that data augmentation should be used with caution with Logistic Regression. 

    Now let's take a look at the F1 scores of the models.

    *best performing method

    Data augmentation drastically improves the accuracy with which our models classify customer churn. DSM demonstrates superiority over the other augmentation methods in improving performance of the CatBoost model with its entire F1 score distribution being higher than SMOTE and SMOTE-NC. 

    DSM's killer performance at improving CatBoost models is very encouraging. Recalling the confusion matrices of the imbalanced data, the CatBoost model had similar true positive and false negative rates - let’s look and see how DSM-augmentation improves these metrics. 

    Our DSM model is clearly much better at correctly classifying customer churn over the baseline model, showing a 23.7% improvement in the true positive rate and an 8% increase in F1 score. These model improvements could have a dramatic impact on Telco’s bottom line by enhancing customer retention efforts with better accuracy.

    Never Struggle With An Imbalanced Dataset Again

    DSM is powered by generative AI models specifically designed to mimic your proprietary data to solve data science problems. Here we’ve shown DSM's ability to generate synthetic data that improves model performance on an imbalanced dataset better than SMOTE and SMOTE-NC methods. 

    If you would like to recreate this experiment, the complete jupyter notebook can be found on GitHub

    Play With Tonic's Latest Offering Yourself 

    Run don't walk to sign up for a Tonic free trial to pit your status quo augmentation method against DSM with your own data!

    *Note that DSM and other data augmentation tools are only using training data in this experiment.

    Madelyn Goodman
    Data Science
    Driven by a passion for promoting game changing technologies, Madelyn creates mission-focused content as a Product Marketing Associate at Tonic. With a background in Data Science, she recognizes the importance of community for developers and creates content to galvanize that community.
    Real. Fake. Data.
    Say goodbye to inefficient in-house workarounds and clunky legacy tools. The data you need is useful, realistic, safe—and accessible by way of API.
    Book a demo
    The Latest
    Tonic Validate extends its RAG evaluation platform to support metrics from Ragas
    RAG Evaluation Series: Validating the RAG performance of OpenAI vs CustomGPT.ai
    Building vs buying test data infrastructure for ephemeral data environments