TL;DR: In this post we’ll examine how Tonic’s new offering for data scientists, Djinn, helps address biases in classification models caused by imbalanced datasets. We compare the performance of Logistic Regression, XGBoost, and CatBoost models trained on datasets balanced with Djinn, SMOTE, and SMOTE-NC.

We find that Djinn-augmented data outperforms the other augmentation methods for CatBoost and XGBoost. Already excited about these results and want to try Djinn out for yourself? Head to djinn.tonic.ai to create your account and start your free trial!

Djinn: A Powerful Tool for Solving the Class Imbalance Problem

Djinn by Tonic can be used to address the data imbalance question that plagues many data scientists. By using our powerful proprietary generative models it is now possible to produce synthetic data that mimics your own. Rather than subjecting your models to the biases associated with upsampling or downsampling your data, Djinn uses advanced AI to learn the unique patterns in your datasets to augment your minority classes. 

Here, we’ll explore how Djinn-augmented data performs compared to other data augmentation methods, Synthetic Minority Over-Sampling Technique (SMOTE) and SMOTE for Nominal and Continuous data (SMOTE-NC). We will be answering a classification problem using Logistic Regression, XGBoost, and CatBoost models.  

Our Dataset

We will use a dataset from Kaggle to predict customer churn - whether or not a customer has left the service in the last month - from a fictional telecom company, Telco. There are 7,032 rows in this dataset with each representing a customer and their characteristics captured in the 21 columns. These columns include a mix of numeric and categorical features. 

Being able to predict which customers will churn allows the telecom company to deploy targeted customer retention measures to prevent resulting losses in revenue. 

This dataset is clearly imbalanced, with churned customers represented only one-third as frequently as non-churned customers. Let’s do some initial modeling to see how well models respond to the imbalanced data to set a baseline.

Comparing Models Trained with Imbalanced Data and Data Balanced Using Djinn, SMOTE, and SMOTE-NC

We evaluate three models — Logistic Regression, XGBoost, and Catboost – performing the following procedures:

  1. Train-test split the data 75-25.* 
  2. One-hot encode the categorical variables for Logistic Regression and XGBoost.
  3. Evaluate model performance with a variety of metrics, including ROC AUC, F1, and confusion matrices.

Imbalanced Data

The positive class, Churn, is label 1

All of these models have similar ROC AUC scores and much higher true negative rates compared to their true positive rates. This is a sign of bias for predicting the negative class from training the models on imbalanced data. For an imbalanced dataset, however, their ROC AUC scores are relatively strong - all around 83%. 

Using Synthetic Data to Balance Customer Churn

Since there is about 24% less data on customers who churn, our models have a greater opportunity to learn patterns associated with those in the majority class, making it more difficult for our models to make correct classifications. We address this class imbalance by augmenting our training data with synthetic minority samples. 

Djinn produces synthetic data that mimics the patterns and distributions of real data, allowing us to balance our training data with highly realistic synthetic samples. We compare Djinn against SMOTE and SMOTE-NC

Note: Both Djinn and SMOTE have some inherent randomness in how they generate data. To fully test the performances of these augmentation methods, we sample data with each augmentation technique 100 times and look at the distributions of the ROC AUC and F1 scores. This gives us a more accurate picture of how these augmentation methods perform rather than just fishing for the best random states.

How Do the Scores Stack Up?

First, let's take a look at the distributions of the ROC AUC scores for each model by augmentation method.

*best performing method

Wow! The ROC AUC distribution from the CatBoost model trained on Djinn-augmented data beat not only the unaugmented model's score, but also the score distributions for SMOTE and SMOTE-NC data. We can confidently say that Djinn produces synthetic data that improves CatBoost’s model accuracy for this imbalanced dataset. The XGBoost model also performed very well with the Djinn data with a median ROC AUC score lying above the interquartile range of the SMOTE model.

For Logistic Regression, none of the augmentation methods improved the ROC AUC score obtained by the model trained on the original imbalanced dataset. Other experiments exploring several data augmentation methods to balance minority classes for Logistic Regression have found similar results, suggesting that data augmentation should be used with caution with Logistic Regression. 

Now let's take a look at the F1 scores of the models.

*best performing method

Data augmentation drastically improves the accuracy with which our models classify customer churn. Djinn shows superiority over the other augmentation methods in improving performance of the CatBoost model with its entire F1 score distribution being higher than SMOTE and SMOTE-NC. 

Djinn's killer performance at improving CatBoost models is very encouraging. Recalling the confusion matrices of the imbalanced data, the CatBoost model had similar true positive and false negative rates - let’s look and see how Djinn-augmentation improves these metrics. 

Our Djinn model is clearly much better at correctly classifying customer churn over the baseline model, showing a 23.7% improvement in the true positive rate and an 8% increase in F1 score. These model improvements could have a dramatic impact on Telco’s bottom line by enhancing customer retention efforts with better accuracy.

Never Struggle With An Imbalanced Dataset Again

Djinn is an AI-powered generative model specifically designed to mimic your proprietary data to solve data science problems. Here we’ve shown Djinn’s ability to generate synthetic data that improves model performance on an imbalanced dataset better than SMOTE and SMOTE-NC methods. 

If you would like to recreate this experiment, the complete jupyter notebook can be found on GitHub

Play With Tonic's Latest Offering Yourself 

Run don't walk to djinn.tonic.ai to sign up for a free trial to pit your status quo augmentation method against Djinn with your own data!

*Note that Djinn and other data augmentation tools are only using train data in this experiment.

Madelyn Goodman, MSc
Data Science
Madelyn is a Data Science Evangelist at Tonic. She has her MSc in Epidemiology and Biostatistics from the University of the Witwatersrand. At Tonic she focuses on creating technical content to engage the data science community.

Fake your world a better place

Enable your developers, unblock your data scientists, and respect data privacy as a human right.