Taking the first steps towards masking your data can be a daunting prospect, to say the least. Especially if you have an 8-PB database that serves multiple use cases for your organization, you may have more data than you know how to deal with. You’ve got the QA team running regressions in the test environment, development teams creating local environments, and a data science team doing ad-hoc analysis. When it comes to data protection strategies, where do you even start?
Before jumping headfirst into masking your database, it’s important to set aside time to identify which data is required for your data consumers and how often the corresponding test data needs to be updated. Look at each chunk of data and decide on a strategy based on how the data is used. From there, you can decide on a strategy for how to process and maintain the data.
Doing this work upfront can save configuration time, keep storage costs down, minimize the time required to generate your test data, and generally increase the quality of data you are creating. Once that work is complete, then you can focus on masking the data that’s left—thereby reducing the amount of data you have to work to mask in the first place.
Once you’ve separated the data you need from the data you don’t, you can get into the serious business of de-identifying and anonymizing.
Next, we’ll take a look at three data protection strategies within Tonic that can help you keep your data (and your customers!) safe.
Our first data protection option is truncate mode. When a table is in truncate mode, Tonic doesn’t copy the data from that table to the destination database. As mentioned above, for tables that aren’t relevant for your use cases, the best option can be to leave them out of your test data entirely. This can be easier than you think! For example, tables containing logs often comprise a significant amount of storage, but are generally not needed for testing purposes.
Only masking the data you actually need reduces configuration overhead by eliminating columns you’d otherwise have to mask and helps generation jobs run faster. It minimizes privacy risk — the less data that exists in the test database, the less risk that sensitive data leaks through.
In Tonic, you can “truncate” tables, meaning that Tonic will drop all data from a particular table from the destination database. The table schema and any constraints associated with this table remain. Since Tonic knows the relationships between a truncated table and other tables exist, Tonic helps prevent data integrity issues by alerting you to the discrepancy so you can truncate those related tables as well.
Another common reason that tables are truncated is when organizations, particularly those with a service-oriented model, want to segment test data by team. In this case, you can route the test data for a service into a separate test database and truncate all irrelevant tables.
The second option for tables is preserve destination mode. Sometimes data in tables is relevant to the test case, but it doesn’t need to be updated frequently, and there are no close relationships between that data and other parts of the database. For example, suppose you have a user history table that stores the history of actions a user takes. This data does not need to be updated, but you do want the data there in order to populate the UI for testing purposes. In this case,, the data won’t be changing after the initial test data is created, so preserve destination mode is your best option. For scenarios similar to this, the data isn’t changing frequently and sometimes doesn’t contain sensitive information at all.
When this is the case, generating new test data for these tables would have a negative impact on performance, extending the data generation job unnecessarily.
Instead, depending on whether those tables contain sensitive information, you can generate the initial test data or restore a backup of the source database, then put the tables into what Tonic refers to as “preserve destination” mode.
For tables in this mode, Tonic essentially ignores the tables during data generations but doesn’t get rid of them. As long as there are no schema changes to those tables, you’ll be able to generate new data in other tables while retaining the same data across multiple data generations. Nice!
And finally, we have incremental mode. For cases where you need to keep the test data as up-to-date as possible, you may prefer to run data generations frequently while having Tonic process the latest changes from the source database. For example, perhaps you need the corresponding test database to reflect recent customer transactions from production.
When tables are in “incremental” mode in Tonic, rather than replacing all the test data, Tonic compares the source data row-by-row with the test data to see if the updated date is newer than that in destination, and only replaces the rows that have changed. As long as there are no schema changes to those tables, you’ll be able to continue operating in this mode, and the data will be consistently fresh and as close as possible to what exists in the source database.
When the majority of tables are in incremental mode, data generations can be very fast. This can be particularly beneficial for large databases of multiple terabytes… or pedabytes!
The best data protection strategy for your database will ultimately come down to what you’re using your database for. You’ll want to consider the use cases it addresses, how often its schema changes, and how often the data needs to be updated. By taking these factors into account, you’ll be able to work more efficiently by honing in on the data you need to protect and optimize for fastest data generations possible.
Want to learn more about data protection? Check out our ebook, The Subtle Art Of Giving A F*** About Data Privacy to get a serious crash course on how to keep your database under lock and key. And if you’ve got more questions, of course feel free to drop us a line at email@example.com! We’re always down to talk shop.
Enable your developers, unblock your data scientists, and respect data privacy as a human right.