How to Create Realistic Test Data for PostgreSQL

March 19, 2021

TL;DR: There are many resources available for creating synthetic data in PostgreSQL. There are also many challenges involved in doing this in-house. Tonic makes all of the above easy.

PostgreSQL is an open-source object-relational database out of the University of California at Berkeley that has origins as early as 1986. With 30 years of continual improvement and a rapidly growing user base, it’s a fierce contender in the SQL database sphere.

The main reasons companies choose Postgres are:

  • it is free and open source
  • it has an active community without a single controlling entity
  • it can store massive amounts of data 
  • it has a wide range of features that enable the management of datasets large or small

In Postgres, developers can work with NoSQL code and store JSON files, which means more storage options and flexibility. You'd be surprised to learn that Postgres supports a wide spectrum of format's that are leveraged by developers, including XML, spatial coordinates, key/value pairs, JSON, and more.

Some of the largest databases in the world are hosted in Postgres, including a Yahoo! database that claimed it broke a record when it reached 2 petabytes back in 2008. Postgres is no stranger to scale: it's been tried and tested in workloads that have reached hundreds of thousand TPS in heavy I/O workloads..

Companies across industries, including the likes of Northrop Grumman, NASA, and Revolt, use PostgreSQL to store sensitive data like financial transactions, classified information, PII, PHI, and confidential client data.

Creating Synthetic Data in PostgreSQL

Given its popularity, it naturally follows that there’s an ever-growing need for protecting and synthesizing data stored in Postgres. In fact, Postgres was the first database type we supported in the early days of Tonic for this very reason. But in the spirit of open source, let’s take a look at the tools available for creating mock data in-house.

For specific, step-by-step instructions on how to create mock data using PostgreSQL, here are two great resources:

Generating Fake Data Using SQL, by Vinicius Negrisolo

How to Create PostgreSQL Test Data, by Alex Thompson

There are also many tools available for teams looking to create fake data more generally. Here are just a few popular resources:

  • Mockaroo: the ever popular, freely available mock data generator
  • Mimesis: a high-performance data generator for Python
  • Pydbgen: a Python package for generating synthetic structured database tables
  • Copulas: a library to model multivariate data using copulas

But none of these address the problem of needing to synthesize data based on existing data. This is a much more complicated task.

You could use the tools above to create data that matches your schema (though it’s bound to get messy, depending on the size of your database). Or you could write scripts in house to do the same (though it’s bound to get messy, depending on the size of your database). Or you could hook your PostgreSQL database directly up to Tonic.

Mimicking Data Stored in PostgreSQL with Tonic

Tonic integrates seamlessly with PostgreSQL to create safe, realistic, de-identified data for QA, testing, and analysis. To use Tonic to generate synthetic data based on a PostgreSQL dataset, simply follow these easy steps to connect Tonic to your source and destination databases. For the source database, we recommend using a backup or fast follower database instead of connecting directly to your production environment.

Through our API or intuitive UI, you can then easily create a model of your production data that will transform that data in-flight from source to destination, to give you a fully anonymized and fully useful mimicked database to hydrate your lower environments.

Sound like the fix you were looking for? Drop us a line; we’d be happy to show you Tonic in action.

Real News — from the experts on Fake Data.