Can You Generate Realistic Data With GPT-3? We Explore Fake Dating With Fake Data
Madelyn Goodman, MSc
December 14, 2022
TL;DR You’ve heard of the wonders of OpenAI’s ChatGPT by now, and maybe it’s already your best friend. But let’s talk about its older cousin, GPT-3. Also a large language model, GPT-3 can be asked to generate any kind of text from stories, to code, to even data. Here we test the limits of what GPT-3 can do, diving deep into the distributions and relationships of the data it generates.
Can You Generate Realistic Data With GPT-3?
Customer data is sensitive and involves a lot of red tape. For developers this can be a major blocker within workflows. Access to synthetic data is a way to unblock teams by relieving restrictions on developers’ ability to test and debug software, and train models to ship faster.
In this previous blog post, we covered how to use Faker to create a synthetic dataset. We found that Faker out of the box had the following problems:
Poor data quality
First names did not match gender and dates didn’t make sense (a birthdate would come after a startdate). Ordinal variables that depended on each other, such as position and salary, weren’t related. There was no evidence that interesting statistical relationships existing in the data were maintained.
Could only mask a limited number of columns
Faker uses providers as templates for certain data points. If you need to generate fake addresses, you would use the faker.providers.address provider. This limits the categories of data points you can generate.
In this blog post we test Generative Pre-Trained Transformer-3 (GPT-3)’s ability to generate synthetic data with bespoke distributions. We also discuss the limitations of using GPT-3 for producing synthetic testing data, most importantly that GPT-3 cannot be deployed on-prem, opening the door for privacy concerns surrounding sharing data with OpenAI.
What is GPT-3?
GPT-3 is a large language model built by OpenAI that has the ability to generate text using deep learning methods with around 175 billion parameters. Insights on GPT-3 in this article come from Open AI’s documentation.
Fake data for a fake experiment
To demonstrate how to generate fake data with GPT-3, we assume the hats of data scientists at a new dating app called Tinderella*, an app where your matches disappear every midnight - better get those phone numbers fast!
While the app is still in development, we want to make sure that we are collecting all the necessary information to test how happy our customers are with the product. We have an idea of what variables we need, but we want to go through the motions of an analysis on some fake data to make sure we set up our data pipelines appropriately.
We investigate collecting the following data points on our customers: first name, last name, age, city, state, gender, sexual orientation, number of likes, number of matches, date customer joined the app, and the customer’s rating of the app between 1 and 5.
Getting Started with the GPT-3 Completions Endpoint
First we install and import the OpenAI API library:
We set our endpoint parameters appropriately: the maximum number of tokens we want the model to generate (max_tokens), the predictability we want the model to have when generating our data points (temperature), and when we want the data generation to stop (stop).
The text completion endpoint delivers a JSON snippet containing the generated text as a string. This string needs to be reformatted as a dataframe so we can actually use the data:
Constructing our prompt
Think of GPT-3 as a colleague. If you ask your coworker to do something for you, you need to be as specific and explicit as possible when describing what you want. Here we are using the text completion API end-point of the general intelligence model for GPT-3, meaning that it was not explicitly designed for creating data. This requires us to specify in our prompt the format we want our data in - “a comma separated tabular database.” Using the GPT-3 API, we get a response that looks like this:
"\n\nName,Age,Gender,Location,Interests,Height,Weight\nJohn,27,Male,New York City,Travelling,5'10,180lbs\nSamantha,25,Female,Los Angeles,Reading,5'4,125lbs\nMatthew,30,Male,Chicago,Cooking,6'2,200lbs\nEmily,21,Female,San Francisco,Hiking,5'6,135lbs\nMichael,28,Male,Dallas,Gardening,5'11,190lbs"
Not bad! It actually gave us comma separated data that we can easily format with pandas.
Here’s the resulting dataframe from the prompt: “Create a comma separated tabular database of customer data from a dating app”:
GPT-3 came up with its own set of variables, and somehow determined exposing your weight on your dating profile was a good idea (😬). The rest of the variables it gave us were appropriate for our application and demonstrate logical relationships - names match with gender and heights match with weights. GPT-3 only gave us 5 rows of data with an empty first row, and it did not generate all of the variables we wanted for our experiment.
In the spirit of good communication with our coworker Ms. GPT-3, we will actually tell it what we want - while also taking the suggestion of including a variable for interests.
Here’s the resulting dataframe from the prompt: “Create a comma separated tabular database of customer data from a dating app with the following columns: first name, last name, age, city, state, gender, sexual orientation, interests, number of likes, number of matches, date customer joined the app, and the customer’s rating of the app between 1 and 5”:
GPT-3 did not give us any column headers and gave us a table with every-other row having no information with only 4 rows of actual customer data. It also gave us three columns of interests when we were only looking for one, but to be fair to GPT-3 we did use a plural. All that being said, the data it did produce for us isn’t half bad - names and sexual orientations track with the correct genders, the cities it gave us are also in their correct states, and the dates make sense.
Hopefully if we give GPT-3 some examples it will better understand exactly what we’re looking for. Unfortunately, due to product limitations, GPT-3 can’t read an entire database to learn and generate synthetic data from, so we can only give it a few example rows.
This time, we give GPT-3 the following examples:
We also ask it to give us column headers and 50 rows of data.
The first five rows of the resulting dataframe from the prompt: "Create a comma separated tabular database with column headers of 50 rows of customer data from a dating app. Example: ID, FirstName, LastName, Age, City, State, Gender, SexualOrientation, Interests, NumberofLikes, NumberofMatches, DateCustomerJoined, CustomerRating, Df78hd7, Barbara, Prime, 23, Nashville, TN, Female, Lesbian, (Hiking Cooking Running), 2700, 170, 05/09/2017, 4.0, 87hbd7h, Douglas, Woods, 35, Chicago, IL, Male, Gay, (Baking Painting Reading), 3200, 150, 04/01/2019, 3.5, asnf84n, Randy, Ownes, 22, Chicago, IL, Male, Straight, (Running Hiking Knitting), 500, 205, 11/01/2021, 3.2"
Giving GPT-3 something to base its creation on really helped it give us what we want. Here we have column headers, no empty rows, interests being all in one column, and data that generally makes sense! Unfortunately, it only gave us 40 rows, but even so, GPT-3 just secured itself a decent performance review.
Evaluating GPT-3’s Synthetic Data
The data points that interest us are not independent of each other and these relationships give us criteria with which to evaluate our generated dataset.
Overall GPT-3 was able to generate data with the appropriate natural relationships between columns.
GPT-3 gave us a relatively normal age distribution that makes sense in the context of Tinderella - with most customers being in their mid-to-late 20s. It’s kind of surprising (and a little concerning) that it gave us such a spike of low customer ratings. We didn’t anticipate seeing any patterns in this variable, nor did we in the number of likes or number of matches, so these random distributions were expected.
Initially we were surprised to find an almost even distribution of sexual orientations among customers, expecting the majority to be straight. Considering that GPT-3 crawls the internet for data to train on, there is actually strong logic to this trend. Grindr, a dating app for LGBTQ+ people, has been around much longer (est. 2009) than other popular dating apps such as Tinder (est.2012) and Hinge (est. 2012). Because Grindr has been around longer, there is more related data for the app’s target population for GPT-3 to learn, perhaps biasing the model.
It’s nice that GPT-3 can give us a dataset with accurate relationships between columns and sensicle data distributions… but can we expect more from GPT-3?
Asking GPT-3 For Statistical Relationships
We hypothesize that our customers will give the app higher ratings if they have more matches. We ask GPT-3 for data that reflects this.
Prompt: "Create a comma separated tabular database with column headers of 50 rows of customer data from a dating app. Make sure there’s a relationship between number of matches and customer rating. Example: ID, FirstName, LastName, Age, City, State, Gender, SexualOrientation, Interests, NumberofLikes, NumberofMatches, DateCustomerJoined, CustomerRating, df78hd7, Barbara, Prime, 23, Nashville, TN, Female, Lesbian, (Hiking Cooking Running), 2700, 170, 05/09/2017, 4.0, 87hbd7h, Douglas, Woods, 35, Chicago, IL, Male, Gay, (Baking Painting Reading), 3200, 150, 04/01/2019, 3.5, asnf84n, Randy, Ownes, 22, Chicago, IL, Male, Straight, (Running Hiking Knitting), 500, 205, 11/01/2021, 3.2"
From this prompt we get data with the following distribution of number of matches to customer rating.
Ok, GPT-3! It was actually able to get this one right… but is this observed relationship significant?
A quick linear regression shows that, though weak, GPT-3 was able to generate data with a statistically significant positive relationship between the number of matches someone has and the rating they give the app.
Weaknesses of GPT-3 For Generating Fake Data
The generated data does not accurately reflect what our production data might actually look like. In reality our customer base would be clustered geographically, there would probably be a majority of straight users, and the first names of our customers would likely reflect the diversity of the population. Lowering the temperature of a request makes the output more predictable, increasing the likelihood of repeating data. Reducing the temperature, however, did not improve the variance in these columns enough and indeed led to amplified homogeneity in our name data.
Since GPT-3 generates data that is essentially foreign to our production data, for a use-case such as the one we introduced in this article - getting data for data’s sake - this is not a problem. If you are looking to generate testing data, however, this would be a more serious limitation.
In this example we tried asking GPT-3 for data from 50 customers, however, it returned a varying amount of rows each time we made the call ranging from 8-50. We tweaked the parameters of the API call to address this (increasing max_tokens as much as possible, adjusting penalties up and down), however, these did not help with getting a consistent number of rows generated by the model.
A work-around for this issue is using a while-loop, however, that requires making multiple API calls, which is quite computationally expensive.
Computationally expensive and no on-prem deployment
Large language models are computationally very expensive with GPT-3’s 175 billion parameters requiring 700GB of memory. To generate thousands of rows of synthetic data would require an extreme amount of compute.Further, since GPT-3 can only be accessed through OpenAI’s API, generating test data this way may require transmitting sensitive data, an obvious non-starter in most cases.
GPT-3 doesn’t struggle with the same limitations that Faker does, being able to produce any data point and maintain relationships between columns. Real data is complex though, and we can’t be sure that GPT-3 is truly modeling these intricacies. It is impressive, however, that we can explicitly ask GPT-3 to give us a statistical relationship in the data and it will deliver.
GPT-3 does of course come with its own set of limitations, the most important being that OpenAI owns the model and all inputs and outputs, therefore, real production data can never be used as examples in prompts. This limits our ability to generate data that closely matches our production data.
If you are looking for a quick way to generate a small amount of data to test a POC, then GPT-3 is a great option. If you are looking for a way to produce high-quality test data that mimics the actual relationships in your production data, then perhaps you’re looking for something more like Tonic for development and testing or Djinn for data science. 👀
*Generated using GPT-3
**Featured image “Melting clock with hearts in the style of Salvador Dali” generated using DALL-E
Madelyn Goodman, MSc
Data Science Evangelist
Data Scientist | Proficient in Machine Learning, Python, and SQL