Technical deep dive

How to generate safe, useful test data for Amazon Redshift

Author

Chiara Colombi

Author

June 18, 2021

Let's Talk About Redshift

In 2013, Amazon Redshift hit the data warehousing market. It came in with a subtle but aggressive message: the name "redshift" was chosen as a hint to shift from Oracle, aka "Big Red", to Redshift. Developers based the service on an older version of PostgreSQL 8.0.2 and built it to handle large datasets — up to 16 petabytes on a cluster.

As a fully managed data warehouse service in the cloud, Redshift comes with several attractive benefits. In particular, it offers fast query performance using the same SQL-based tools and business intelligence applications that developers already use today.

Redshift is built on top of technology from the massive parallel processing (MPP) data warehouse company ParAccel. Its beta release was in November 2012, with the full release hitting the stands later in February 2013. As of this writing, it ranks 18th for relational databases and 32nd overall on db-engines.com.

Major companies like Capital One, Nasdaq, Warner Brothers, Pizza Hut, and others that process massive amounts of data have adopted this cloud service. By investing in Redshift, they can ingest billions of records, load data and run queries much faster, and run real-time data analysis.

The core infrastructure component of an Amazon Redshift data warehouse is a cluster, which is composed of one or more compute nodes. It achieves efficient storage and optimum query performance through a combination of MPP, columnar data storage, and very efficient, targeted data compression encoding schemes. Because of this, it is optimized for high-performance analysis and reporting of extensive datasets.

With that information under your belt, let's talk about the challenges and solutions of creating mock data for Redshift.

Challenges of Data Anonymization with Amazon Redshift

Amazon Redshift is outstanding in its ability to ingest and transform a lot of data quickly. But it comes with some challenges, particularly when you're generating data based off of an existing Redshift database.

The most glaring challenge is that Redshift doesn't enforce primary and foreign key constraints. You can create them as a query optimizer method, but they will be informational only. This means that if you define a primary key, the column will still accept duplicate values. Similarly, if you define a foreign key, the column will still allow a value that does not exist in the referenced table. It is up to you to manually confirm that a column's values are unique.

Another potential challenge of creating synthetic datasets for Redshift is with MPP, the hallmark of Amazon Redshift. If you are creating your test data using an outside source and loading it into Redshift, you may not be able to take advantage of MPP. While MPP will speed data loading from Amazon S3, relational DynamoDB, or Amazon EMR, parallel loading isn't supported on any other sources. You'll need to rely on JDBC inserts, scripts, or an ETL solution to get your test data loaded. Optional approaches to speed up the process of loading your data into Redshift include ensuring your files are in S3 format, splitting your data into multiple files, and compressing them before uploading. It's not a complicated process once you have it down, but can require more than a few mouse clicks.

On top of these challenges, there is also the simple problem of data volume. Redshift has the capability of storing hundreds of tables, thousands of rows, and petabytes of data. With high complexity, traditional data masking and anonymization solutions will not cut it when generating test data aimed at mimicking an entire data ecosystem. To create accurate test data, you'll need a more sophisticated solution designed for today’s data, with built-in automations and data generation tools that work together to tackle a variety of data scenarios. Without this, you run the risk of delivering non-representative data that will only impede rather than enable your work.

In addition, the legal challenges of data privacy regulations require a degree of expertise in execution to avoid costly consequences. The challenge of creating synthetic data that represents production data while fully protecting PII is incredibly nuanced, especially when you're handling financial, healthcare, or demographic data. And with new, location-specific regulations arriving with increasing frequency, there's a lot of detail to keep up with to ensure your test data is compliant.

So how do you take advantage of the benefits of Amazon Redshift while overcoming the challenges of creating realistic and compliant mock data for Redshift?

Tools for Mimicking Data in Redshift with Tonic

The strongest solution provides a combination of best-in-class techniques and tools that can work together seamlessly to generate test data with an approach best suited to the specific data at hand. For some data in a Redshift database, that may involve using GANs, while for other data, character scrambling will suffice.

Understanding this complexity and how different data requires different approaches, we’ve built Tonic’s platform as a one-stop shop for data generation. Our growing list of generators work side by side to mimic your data based on its needs. Add to that our other tools like creating foreign key constraints and subsetting, and the data generation process becomes even more streamlined.

Here are just a few ways Tonic circumvents the challenges of creating mock data for Redshift:

Primary and Foreign Key Synthesis: Tonic’s primary and foreign key generators allow you to add these constraints to your data model, in the event that they aren’t currently assigned in your database. This is particularly useful for Redshift users, given that Redshift doesn’t enforce PK and FK constraints, which are necessary for preserving referential integrity when generating mock data.

Optimized Data Loading: Tonic sits between your Redshift production database and a separate Redshift test database to be hydrated with Tonic's output. Because Tonic is connected to both, it can “unload” the source data into S3 files then rapidly load the S3 files of new mimicked test data directly into the destination database.

Minimizing Large Data Footprints: Tonic uses multi-database subsetting to synthesize your data no matter the size of your database. This allows for creating a coherent slice of data across all your databases that preserves referential integrity while shrinking petabytes of data down to gigabytes.

Proactive Data Protection: Using the privacy scan feature, Tonic eliminates hours of manual work by automatically locating and de-identifying sensitive information (PII/PHI) throughout a database. Tonic also delivers schema change alerts to proactively keep sensitive production data from leaking into lower environments.

Ready to create safe test data you can rely on? Get in touch with our team; we’d be happy to show you the tools that can make an impact.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Chiara Colombi

Director of Product Marketing

A bilingual wordsmith dedicated to the art of engineering with words, Chiara has over a decade of experience supporting corporate communications at multi-national companies. She once translated for the Pope; it has more overlap with translating for developers than you might think.