Four Ways to Generate Fake Data That Aren’t Working as Well as You Think

April 7, 2022

When testing an application, it’s common to generate fake data to simulate user entry, database contents, and high throughput scenarios. There are many techniques for creating that fake data — from ad hoc, small-scale data that developers create on the fly, to data generated using a data synthesis platform.

Each strategy for producing fake data has unique benefits… but not all of them produce the same quality of results. (Ask us how we know!)

Let’s take a look at four common ways to generate fake data that don’t always pan out… plus, a solution that works.

Four Approaches to Generating Test Data In-House

There are three types of solutions for generating data in-house: dummy data, mock data, “anonymized” data, and data bursts. All four of these traditional approaches to generating data fall into the in-house category. 

Let’s examine these tried-and-true methods before exploring the more modern data synthesis approach.

1. Dummy Data

Dummy data is an in-house solution for generating test data. Developers commonly create it as a placeholder for actual data. The data doesn’t often look like real data — an address field might contain random characters rather than something resembling an actual address. 

Dummy data is quick and easy to make. It can be useful for testing a specific edge case, as developers can more carefully craft the data to align with their needs. 

The problem? It isn’t a scalable solution, making it viable only for small-scale, ad hoc testing.

2. Mock Data

Mock data is another in-house solution, but can also be created using generators or open-source libraries. Mock data is similar to dummy data. However, because developers use automated tools to develop it, it is typically more consistent.

Mock data can be generated against a set of rules to ensure this consistency. This enables it to integrate into a software development pipeline, automating the creation of test data before tests are run. 

For example, mock data can generate long or short strings, add special characters to strings, and include random NULL fields. It can often generate data that resembles real data, such as people's names, company names, and addresses. 

The problem? Although this data better mimics reality, it remains challenging to maintain referential integrity between mock data tables within more complex systems.

3. “Anonymized” Data

Another common solution is anonymized data. This involves taking real data and masking any personally identifiable information (PII). Anonymized data provides the benefits of realism without the security risks associated with using PII.

The problem? If done in-house, using anonymized data significantly increases the burden on in-house development teams, as they must ensure that anonymization software works correctly and all PII data points are identified.

4. Data Bursts

Data bursts are useful for stress testing a system. Often, the generator system in use can generate large amounts of data in a short time. Data bursts ensure that the system can keep pace with the load. They can also test performance and detect system degradation under a heavy load.

The problem? The process prioritizes quantity, making it fairly simple for developers to use — but difficult to wield with precision.

While each of these methods of fake data generation has their uses, they ultimately fall short when it comes to providing a robust synthesized data solution.

A Better Way to Generate Fake Data

Although the four previous approaches may sometimes be sufficient, they’re also cumbersome — especially when developers must combine multiple techniques. This is where a data mimicking tool like Tonic comes in. 

Data mimicking is a new approach to generating fake test data that combines the best aspects of data synthesis with anonymization and privacy. The goal of all test data is to ensure that it resembles the production data as closely as possible. Tonic synthesizes test data from production data to ensure that your testing environments align with the real world.

We believe that modern problems require modern solutions. That’s why Tonic was designed with the following capabilities:

  • All the required tools for synthesizing data in one place. This prevents developers from having to integrate with multiple generators and open source providers, drastically simplifying the testing suite. 
  • Our subsetting capability enables us to subset production data across multiple tables while retaining referential integrity — something that’s often difficult to achieve and that requires manual effort with traditional testing approaches.
  • Tonic is an enterprise-ready platform that supports the most popular databases. So, you can easily integrate it regardless of your tech stack. It also provides a REST API for integration into more custom solutions. 
  • Access to key enterprise features including single sign-on, role-based access control, and audit logs. 

Still curious? Check out these case studies to see how Tonic.ai customers are thriving with our solution in place. 

When It Comes to Faking It, Tonic is Just… Better 

As testing suites continue to require fake data generation, traditional approaches have become insufficient. Developers can no longer spend time locating and configuring data generation libraries or wrestling with data integrity and consistency. Nor should they grapple with masking and anonymizing PII data.

Tonic offers data mimicking capabilities that combine the best of the traditional approaches into one enterprise-ready solution that securely handles PII data. With Tonic, developers can spend more time running tests, and less time configuring them.

Visit Tonic.ai to learn more about data mimicking and how Tonic can modernize your company’s testing process.

Real News — from the experts on Fake Data.