TL;DR: In this post we demonstrate how to use Djinn by Tonic to balance an imbalanced target class with synthetic data to improve model performance. Djinn can take any structured dataset, learn its unique traits and relationships, and generate high performing, ready-to-model synthetic data.
Already excited and want to test Djinn out for yourself? Create a free account!
Classification models improve with the more data they see, lowering their error as they learn how to classify certain groups. Training classification models on data with imbalanced classes leads to models that more accurately learn to classify the majority class. This is especially problematic when building models to detect rare events. Random upsampling and downsampling are the most common ways to address this problem, however, downsampling can reduce the strength of statistical properties of the data and upsampling can lead to overfit and biased models.
Djinn provides an intuitive and seamless solution for data scientists building classification models on imbalanced datasets.
Djinn by Tonic is an AI-powered synthetic data platform that learns the unique patterns and properties of your proprietary datasets and generates completely new data. Here at Tonic.ai, we recognized that data scientists face challenges unique from those faced by developers, requiring data of impeccable quality to build smarter models faster to drive better business insights. We built Djinn to address these challenges, allowing teams to optimize their data science workflows and achieve better results. All you have to do is connect your database, write a SQL query for Djinn to pull in the data you want to work with, and start generating fake data.
In this post, we’ll show you how to use Djinn to balance target classes using AI-generated synthetic data. This allows for the building of classification models with reduced bias and increased accuracy. Check out our blog post “How to Solve the Problem of Imbalanced Datasets: Meet Djinn by Tonic” to see how classification models trained on Djinn-balanced data out-perform those trained on imbalanced datasets and datasets balanced with SMOTE and SMOTE-NC. If you'd like to follow along here, you can create a free account.
We use a dataset from Kaggle to predict customer churn - whether or not a customer has left the service in the last month - from a fictional telecom company, Telco. There are 7,032 rows in this dataset with each representing a customer and their characteristics captured in the 21 columns. These columns include a mix of numeric and categorical features.
Predicting customer churn allows the telecom company to deploy targeted customer retention measures to prevent potential losses in revenue.
This dataset is clearly imbalanced, with churned customers represented only one-third as frequently as non-churned customers. We use Djinn to balance the proportion of churned to not-churned customers 50/50 by generating new data points of churned customers.
First things first, we need to create an account. Follow the instructions to create an account either using your Google profile or any other email address.
Djinn can read your data directly from a CSV file or by connecting to a relational database or data warehouse. In this tutorial, we show how to use Djinn with data from a CSV file. We prepare our data to be read into Djinn by first cleaning it, defining our inputs and outputs, and splitting the data into train and test sets at random. We are now ready to create our first workspace in Djinn:
First, you must create and configure your workspace to source data from a CSV.
To create a new model, navigate to the “Models” page of your Workspace from the left hand menu.
1. Create your model and configure settings.
2. Enter a SQL query to get the data you are modeling - in this example we enter the following:
4. Beneath the "Model Name" box, navigate to Model Parameters to specify parameters for your Djinn model - for this experiment we'll be training for 160 epochs and we’ll allow early stopping for our model.
5. Select Save and Close to return to your “Models” page.
6. You are now ready to train Djinn on your data - click Train.
Once training is complete, we connect to the Tonic API to pull our synthetic data directly into our jupyter environment.
To balance our target class, we pull our synthetic data from Djinn into our Jupyter Notebook by connecting to the Tonic API.
Snippet
The code snippet has the Workspace ID and Job ID prepopulated, you just need to fill in the API Token with your unique Tonic API Key.
Now that our Jupyter Notebook is connected to our Djinn model, we can sample synthetic rows of data to balance our dataset. We need 2,652 synthetic rows of churned customers to balance the Churn column. To get these synthetic rows, we sample many more data points than we need from our Djinn model so we can be sure that there will be the correct number of those in the minority class for us to select and add to our dataset:
We have now pulled in our synthetic data from Djinn to balance the proportion of churned customers to non-churned customers 50/50. Woohoo!
Now that we've balanced our target variable, we can use this data to train a CatBoost model to predict customer churn. We test this model with the receiver operating characteristic area under the curve (ROC AUC) and F1 scores.
*1 in the confusion matrix signifies customer churn
Without any hyper parameterization, our CatBoost model trained on Djinn-augmented data displays strong ROC AUC and F1 scores, demonstrating Djinn's ability to produce high quality data that can reduce model bias for imbalanced classification problems.
Here you've seen how to get started with Djinn: Creating a workspace, loading in data, training Djinn, and pulling synthetic samples into a Jupyter Notebook to model with. Head over to GitHub to follow along with the full experiment.
Now that you know what you are doing, there are no more excuses! Head over to Djinn to create a free account and try it out with your own data.
Note: Djinn has some inherent randomness - experiments may not produce identical results every time.
Enable your developers, unblock your data scientists, and respect data privacy as a human right.