Faking your data for testing in Snowflake is hard. First off, it’s a heck of a lot of data with a heck of a lot of complex dependencies. Second, if you’re working with sensitive data, the sheer amount of it riddled throughout your database increases your risk of PII slipping through to your lower environments.
We all agree on why you need fake data, but you may not know how to get the quality, safe data you need for Snowflake. There aren’t a lot of synthetic data options on the market to begin with, and even fewer work directly with Snowflake’s vast database capabilities. We’re proud to be one of those fewer, pioneering synthetic data generation in Snowflake with our uniquely architected native integration.
Today, we’ll take a look at some reasons to create realistic fake data based on your production data within Snowflake, plus the approaches we’ve taken in building Tonic to meet these needs.
TLDR; Using fake data for testing in Snowflake, need better data? Tonic does that thing. Talk to us about it.
De-identifying your production data in Snowflake provides the combo-pack of compliance and effective test data required by today’s dev teams.
First, let’s get the compliance component out of the way. Even the strictest privacy laws like GDPR and CCPA deem de-identified data that’s been effectively masked or mimicked safe for use outside of production environments. Simply put, if you de-identify your data, congrats! You’re compliant, and that data can be safely shared with your teams as they build and test. If you don’t de-identify your data, and instead pull production data directly into staging… 😬
We get the temptation, and we understand the motivation behind it. In fact, it’s our second point: effective test data. The quality of your testing is only as good as the quality of your test data. Prod data is tempting but, these days, it’s off-limits. Improvising scripts to spin up random data at Snowflake scale isn’t going to get you the quality you need. To get data that looks like production, you need to create data based on production. You need to de-identify.
Let’s put this into context with a few Snowflake use cases:
One of the primary reasons to de-identify your data in Snowflake is for building and testing your applications. Snowflake users can build apps inside of the cloud using the vast database resources available, but it’s critical to put some safety checks in place to make sure that data stays safe. Instead of working directly with production data, de-identify your data in Snowflake with a tool like Tonic and use that fake data to hydrate your testing, staging, and QA environments.
What should that look like? Automatic scanning for sensitive data throughout your database, schema change alerts, dozens of generators designed to handle any data type you can throw their way, and mathematical guarantees of data privacy—all accessible by way of an intuitive platform with collaboration features and by way of API.
Another fundamental reason to fake your data in Snowflake is to fully tokenize your data warehouse to remove the security and compliance concerns around retaining large amounts of data, and enable safe long-term data storage.
What should that look like? Eliminate all PII/PHI data using data tokenization at scale to protect data privacy but preserve your team’s capacity for data analytics. Tonic’s unique solution preserves the unique relationships between data columns while wiping sensitive information from those columns using randomization. Your data analysts keep all the richness and value they need for BI and ML. But if compromised, the data is stripped of PII and has no value to outside attackers.
In both of these cases, de-identifying your data in Snowflake ensures compliance throughout your SLDC and data analytics workflows. Working with a powerful cloud-based platform has opened many doors for offshore development, but privacy standards and compliance needs have historically limited those opportunities. It’s tough: You want to empower teams, but not at the risk of your customer’s data… Happily, that’s no longer a choice you have to make.
What should it look like? Tonic is working with Snowflake users running HITRUST environments that comply with the most stringent requirements in the world. Real fake data generated at a data warehouse scale is making it happen.
So, we’ve sold you on fake data in Snowflake. Now, where should you go to get it?
Lucky for you, we know a fake data guy (and gal). Several of them, in fact. Tonic’s unique data synthesis solution allows you to connect natively with Snowflake, and work with the best fake data on the market in your own testing environments. We’re pioneering a new frontier of Snowflake support here—there’s no one out there doing what we’re doing. Our synthetic data generator is specifically architected for Snowflake data and Snowflake users like you.
These are just a few of Tonic’s capabilities within Snowflake (and beyond):
In data warehouses, de-identifying data is HARD. We make it easy, seamless, and fully integrated. Our all-in-one platform solution provides secure data de-identification that generates realistic, useful data for development and testing… and eliminates the need to build time-consuming, faulty in-house de-identification workarounds.
As a developer who’s either using Snowflake or thinking about doing so, we hope this article has given you a few useful insights into how you can de-identify your data in Snowflake effectively and realistically. To learn more about Tonic’s integration with Snowflake, check out our Snowflake + Tonic webinar, where we take a deep dive into the features this collaboration enables, as well as explore several real-world examples. Come fake with us!