Blog
Product

How to Create Realistic Test Data for Amazon Redshift

Author
Abigail Sims
October 10, 2022
How to Create Realistic Test Data for Amazon Redshift
In this article
    Share

    After debuting in 2013 as a direct competitor to Oracle, Amazon Redshift securely planted itself in the industry. Not only did Amazon make it an apparent attack on “Big Red” (aka Oracle) to draw users to their platform, it comes packaged nicely with the rest of Amazon’s cloud-native tools. It’s a popular option for organizations that need speed and size in one package.

    With its built-in massive parallel processing (MPP) and sporting a gigantic 16 petabytes support per cluster, it’s clear to see why. MPP allows Redshift to have efficient data storage and query performance while remaining a columnar data storage solution. It also offers robust querying tools for business intelligence and data analysts alike.

    However, as we all know, using your production data can be risky. (Read as: A big no-no.) As many companies look to solve the test data problem, they turn to synthetic data solutions to create realistic but safe testing data. 

    Without further ado, let’s break down how to create realistic test data for Amazon Redshift. 

    Challenges When Creating Test Data For Amazon Redshift

    Before we get into “how”,  we need to talk about some issues with Redshift, you’ll need to be aware of them. None of these are significant issues, but they will most likely be something you’ll run into when using the cloud data warehouse.

    Challenge 1: No Foreign/Primary Key Constraints

    One of the core aspects of relational databases is the inclusion of a primary and foreign key. These “keys” allow databases to distinguish a set of columns to be unique without any duplicate entries. On the other hand, foreign keys are used to bring two tables together based on common factors and maintain data integrity when joining them together.

    Amazon Redshift doesn’t enforce these. Instead, they expect you to use them for planning and then relying on your extract, transform, and load (ETL) process or some other process in your system. This is one of the ways they afford to be lightning fast with inserts, but at a cost to uniqueness.

    Challenge 2: No MPP When Using Data From Outside Sources

    Amazon doesn’t support most databases with Redshift’s MPP, which may put you in a bind. Officially, they only support Amazon S3, Amazon EMR, and DynamoDB for parallel uploads with MPP.

    Instead, it’s up to developers to find a workaround if they’re not already transferring data from one of those options. This means your synthetic data can take a long time to move into Redshift. There are some workarounds, though, such as using JDBC inserts or including it in your ETL process (which is how Tonic gets the job done).

    Challenge 3: Large Datasets Make Data Anonymization Hard

    Redshift supports up to 16 petabytes of data, meaning that your organization is most likely using it because it has a lot of data. However, trying to create datasets from this large of a volume is incredibly difficult. Without built-in automation and data generation tools to help, capturing a realistic image of your production database can be nearly impossible.

    Finding an Effective Test Data Solution

    Amazon Redshift makes generating synthetic data difficult because regardless of what solution you use, it needs to:

    • Adapt to Redshift’s lack of primary/foreign key constraints
    • Sufficiently mask or protect personally-identifiable information.
    • Handle large datasets by minimizing their footprint.
    • Optimally utilize the advantage of Redshift’s speed.

    Keeping all of these in mind while still trying to be as efficient as possible can prove challenging, but it’s not impossible. You may be able to accomplish these with character scrambling or using neural networks like a generative adversarial network to get it done. Still, neither solution is perfect on its own.

    How Tonic Creates Synthetic Data for Testing

    So now that we know what to look for, how can we overcome these pitfalls? With Tonic, you can safely generate realistic test data for Redshift through our one-stop shop for synthetic data.

    Tonic uses data generators to do exactly what they’re named for. With generators, you can create synthetic data in a multitude of ways. Whether scrambling data, searching and replacing, ML, or more, we offer plenty of ways to get it done.

    How to Create Test Data for Amazon Redshift with Tonic

    With Tonic, we give you a fantastic platform to synthesize data while still benefiting from the bevy of features Redshift is known for. Tonic acts as a middleman between your production database and a new Redshift test database by using S3 as a pipeline for your data.

    To break it down further, let’s cover the steps involved in the process.

    Configuring Your Environment 

    French chefs have this saying called “mise en place,” which means “putting everything in place,” the same thing we’re going to do. You’ll need to set up your instance profile and database permissions, an AWS Lambda that can access the right services, and some specific environment variables to get your Tonic workspace ready.

    Due to how Redshift works, Tonic has to act as a middleman to generate synthetic data properly. We’ll need two different Redshift databases, your production database to get the source material and a new Redshift instance to put the synthetic data.

    Setting up Your Workspaces 

    For your source database, you’ll enter all the details about that server into Tonic; such as server name, credentials, and the path to the S3 bucket where the data will be stored temporarily. There are a couple of options here for blocking data generation on schema changes and preserving database owners for specific use cases. 

    Follow the same steps for the destination, and you’re off!

    Generating Synthetic Data

    After we’ve got it all set up, it’s time to fake it ‘til you make it. Tonic takes care of the rest, allowing you to set up generators for your columns and get to it! As you go through this process, Tonic temporarily uses an S3 bucket to store the synthetic data. The AWS Lambda will take files from the bucket and writes them to a new file before sending it to the new Redshift database. 

    For any more information on capabilities and limitations, check out our handy documentation.

    And there you have it, folks. That’s how to create realistic test data for Amazon Redshift using Tonic!

    Got more questions? We’ve got answers. Reach out to our team at hello@tonic.ai, or book a demo with our team today to get a complete tour of Tonic!

    Abigail Sims
    Marketing
    As a reformed writer now deep in the marketing machine, Abigail can (and will) create narrative-driven content for any technical vertical. With five years of experience telling brand stories for tech startups and small businesses, she thrives at the intersection of complex data and creative communication.

    Fake your world a better place

    Enable your developers, unblock your data scientists, and respect data privacy as a human right.