Generating High Quality Test Data for MySQL Through De-identification and Synthesis

July 22, 2021

A Little Bit About MySQL

MySQL is one of many relational database management systems (RDBMS) available on the market for organizations that need a repository for large amounts of data. MySQL is open-source—any user can use, change, or distribute the code to suit their specific purposes. Open-source databases are increasing in popularity as organizations attempt to control IT costs and leverage the collaboration and communities that come with open source solutions.

The name 'MySQL' is a combination of the name of the co-founder's daughter, and the well-known acronym, structured query language (SQL). It originated in 1995 in Sweden by MySQL AB and by 2001 had reached over two million active installations. In 2010 it was acquired by Sun Microsystems, now Oracle Corporation. One fork of MySQL is MariaDB, another RDBMS created by developers concerned about Oracle's acquisition of the platform. Percona was also forked from MySQL.

MySQL is one of the most popular databases on the market, ranking number two on db-engines.com. It’s most widely used for web applications, with some of the world's biggest companies like Facebook and Uber leveraging this capability. Analyticsindiamag.com ranked it the number one most popular database used by developers.

While it tops the charts in popularity, it has several primary use cases due its well-known security, high availability, and integration features. Here are two of its most common uses:

  • eCommerce applications — MySQL integrates easily with other applications such as ERP, CRM, POS systems, and more. Because it can deliver information across a wide range of critical systems and can deliver high availability, it's a popular option for eCommerce retailers.
  • Web development — Particularly for smaller companies, MySQL is a popular choice for web developers who want to control costs. Additionally, it is incorporated into Linux, Apache, MySQL, and PHP (LAMP), which is necessary for developers wanting to build on Drupal or WordPress. 90-percent of websites use MySQL.

Issues Creating Realistic Test Data in MySQL

While MySQL clearly has many well-documented uses, challenges arise when organizations attempt to create homegrown mock data for testing, QA, and analysis. For retail and eCommerce companies, in particular, protecting test data privacy through anonymization is critical to reducing the risk of data breaches. Generating de-identified data for MySQL that protects PII and complies with regulations like GDPR and HIPAA is not as easy as it sounds. Here are five common reasons developers struggle with this process:

1) Using scripts or manual techniques is extremely time-consuming

It can take days or even weeks to generate enough dummy data to properly test your programs, depending on how much data you require. This is time you could use to get a jumpstart on bug fixes and additional features or even get ahead of your deadlines.

2) Fake data created using scripts looks like...fake data

For your synthetic data to be effective, it needs to resemble the complexity of your production data as closely as possible. Hashing your data in-house can certainly get the job done, but your ability to replicate real-world scenarios will be very limited - which means your testing won't be effective either.

3) It's easy for PII to leak into your new data set

Depending on how you’re creating MySQL test data, it’s not uncommon for pieces of PII to get pulled in through unstructured or nested data, especially if you are storing blobs, XML or JSON. And even when PII is anonymized, reverse engineering can still pose a threat to your test data (read more here) and, by extension, to your job.

4) It's difficult to link columns of related data

To create realistic mock data for MySQL, your data likely needs a significant amount of linking to accurately represent relationships across your dataset. This adds an incredible degree of complexity when trying to build an in-house solution, and many tools available on the market don't offer the capability off the shelf.

5) Maintaining and updating your schemas is tedious and time-consuming

Nobody's data is static. It is constantly building and updating based on sources inside and outside your organization. You can lose a lot of labor hours keeping your test data up-to-date with your production data.

Solutions for Mimicking Data Stored in MySQL with Tonic

The best method of creating realistic test data in MySQL using de-identification is to leverage best-in-class tools and techniques that will significantly reduce your time investment in synthetic data generation while also protecting PII, linking critical fields, and keeping your test data as fresh as your production data. You could cobble together some online scripts like this and leverage other free tools to get part of the way there, or you could consider using a complete, secure, one-stop-shop like Tonic for data generation.

Tonic provides an ever-growing list of generators, the ability to add foreign key constraints, and advanced subsetting capabilities, to create test data in MySQL that looks and feels like your production data. Here are a few benefits you'll gain from using Tonic:

  • Save Time — Connecting Tonic to your database is simple and quick. Just follow these steps and you'll be ready to start generating mock data in MySQL.
  • Protect Your Data — Your data is safe with features like privacy scan, which automatically locates and de-identifies sensitive data, and schema change alerts, which notify you of any changes to your schema.
  • Create True Mock Production Data — You can reliably reproduce your production bugs in lower environments with Tonic's custom data generators, dependency linking, input-to-output consistency, and primary and foreign key synthesis.
  • Keep Mock Data Consistently Fresh — Run generations as often as you need, even several times a day so that your data never breaks or gets outdated.

Ready to create safe data that truly mimics your MySQL database? Get in touch with our team to see the tools that deliver results.

Real News — from the experts on Fake Data.