Ensuring data privacy with privacy rankings in Tonic Structural

Author

Yuri Shadunsky

July 1, 2024

What does it mean to keep data private?

Our customers come to us because they want to keep their data safe while creating realistic test data. Yet, safe means something different to every customer. For instance, if you are in the US healthcare industry, you’ll want to adhere to HIPAA guidelines, while if you are in financial services, you’ll have to navigate PSD2 and PCI compliance specifically. And then there’s the general data privacy laws that you have to navigate if you deal with any sort of customer data:

If you are dealing with customer data from Europe, you need to adhere to GDPR, different data protection authorities’ interpretations of the GDPR (there are 28), and data protection regulations that each country has in addition to the GDPR.
If you store any customer data in the US, you’d need to make sure you are SOC II certified and abide by local state regulations such as CCPA in California.

Regulations aren’t the only reason though—companies care about data privacy because their customers care about it. No customer wants to be on the wrong end of a data breach, or even an accidental test email. Customers want to know their data is in safe hands when they hand it over to businesses.

These concerns are what drive our products’ development. And since those businesses (aka, our customers) come to us for the solution, we want to make sure they understand how to use our products to effectively protect their data. Within our test data platform, Tonic Structural, we provide a comprehensive library of versatile data generators and features like subsetting to keep your test data private and secure. But we often get asked the question: “How do I know that my generated data is secure?” With that in mind, we developed a proprietary Privacy Ranking system to help you understand at a glance how to secure your data to the level of privacy that you need.

What are Tonic Structural’s privacy rankings?

Tonic Structural’s privacy rankings rank the privacy level of the generators that you use on a scale of 1 to 6, where 1 is fully private and 6 is fully passed through (in other words, the data remains in its original state, unchanged). We established this proprietary ranking system to help our customers understand how their data has been transformed. The privacy rankings for your configured generators can be found in your workspace’s Privacy Report.

Each column within your database receives its own ranking based on the generator configuration applied. Everything with a ranking of 1 to 4 has been fully de-identified. A ranking of 5 indicates that the data within the column may only be partially de-identified, and a ranking of 6 indicates that no de-identification or transformation has been applied.

Let’s take a closer look. The six privacy rankings, including descriptions, examples, and uses, are as follows.

Ranking 1: The generated data is completely free of source data and no information about it can be reverse engineered. There is no way to uncover information about the original data from the output data.

Example generators that achieve this ranking: Random Boolean, Random Integer, Constant, Null
Use: When you need the highest level of data privacy, and no information about the original data should be recoverable.

Ranking 2: The generator uses the original data in a way that obscures the original data points. Changing individual data points in the original data does not change the output data, but the overall shape of the output data can provide high-level information about the range of values that exist in the input data.

Examples: Continuous and Categorical generators, when set to differentially private
Use: When you need high privacy but can tolerate some indirect information about the original data’s shape such as the range of values that exist in the source.

Ranking 3: The generator uses the underlying data in a way that cannot be reversed but can identify values that exist in the original data.

Example: Categorical generator, when not set to differentially private
Use: When you need irreversible transformations but can tolerate the identification of existing values. For example, using the categorical generator on a column that contains states is a frequent use case—there might be no risk in knowing what states exist in the initial data set as long as they are not mapped to the same set of records.

Ranking 4: Data is transformed securely but with consistent values, which can introduce a slight risk. Toggling consistency on in a generator’s configuration means that the same input will always generate the same de-identified output. Someone who knows the source data and the frequency of the source values might be able to draw a connection between input and output values depending on the consistency configuration.

Examples: Name generator with consistency, Integer Key generator with consistency
Use: When you need consistency to maintain relationships or referential integrity in your data, and you can tolerate a slight risk of re-identification or can set the values as consistent to a unique identifier to remove that risk of re-identification.

Ranking 5: Some values within the data might not be de-identified. This applies to generators that have sub-fields and do not have a default generator option, meaning that there is always a chance that some data within the fields will not be protected.

Examples: HTML Mask, JSON Mask, Regex Mask, XML Mask
Use: When using complex data structures where some parts might remain unprotected. If this risk is not tolerable, you should use a generator that guarantees full protection or ensure that you configure the generator in a way the protects all sub-fields.

Ranking 6: Data is not transformed or de-identified. The Passthrough generator is applied which, aptly, passes the source data directly through in the output data.

Use: For non-sensitive data, when no data protection is required, and the original data can be passed through unchanged.

Achieve data privacy in your developer workflows.

Accelerate your release cycles with realistic, compliant data de-identification.

Book a demo

How can the rankings be used for effective de-identification?

Ultimately, in using Tonic Structural, your goal is to be confident you have created a synthetic dataset that keeps your data private while also keeping it realistic enough that it accelerates your developer efficiency and helps you find bugs sooner. You can leverage the privacy rankings to understand whether you have achieved this goal.

If your selected generator is ranked between 1 and 3, those rankings guarantee that those fields cannot be reversed or re-identified for a given record. They ensure the utmost in privacy. To dial up the realism in your data, meanwhile, you will occasionally want to add consistency to your generators. In adding consistency, your generator will have a privacy ranking of 4, introducing a slight risk of re-identification, as we mentioned earlier. That said, there are ways to use consistency that avoid introducing risk. Let’s look at two primary ways consistency can be configured, so you can choose the approach that makes sense for your use case and acceptable risk, and leverage consistency appropriately. These two ways are: self-consistency, and consistency with another column.

Self-consistency

Toggling self-consistency on for a generator causes the same input within that column to always generate the same output.

For example, if you make a name generator consistent to itself, every instance of Jane in that column’s source data will be transformed into the same value in the generated output, e.g. Sam. This self-consistency can introduce a slight risk of re-identification. How? A malicious actor could use a frequency analysis to identify some of the higher likelihood names. Publicly available data on the frequency of names in the US could be used to reverse engineer the names in your sample given a large enough sample size.

Granted, the risk in this general example is quite low. Your particular data may include fields that, when self-consistent, could entail a higher risk of re-identification, depending on the publicly available data a malicious actor could find to pair it with (this risk is primarily associated with names and birth dates). On the other hand, you can safely use self-consistency on unique identifiers such as account numbers without any risk, so long as those unique identifiers are not publicly available.

Consistency with another column

To achieve a high degree of realism in your data while maintaining a high degree of confidence in the privacy your transformations achieve, you can set a column’s consistency to generate data that is consistent with another column.

For example, instead of configuring a name generator to be consistent to itself, you can configure it to be consistent to a unique identifier like an account number. In this way, Jane would be de-identified consistently every time it is associated with a specific account number. A Jane with account number 35781 could become Sam, and a Jane with account number 82902 could become Pat. Likewise, a Cynthia with account number 20823 could also become Pat. Instead of having the name values drive consistency, it is the account numbers (aka unique identifiers) that drive consistency, meaning that the data tied to a particular account will generate a realistic, consistent shape throughout your output data. In other words, it will generate useful data that functions like production data in your testing workflows. This makes your data fully realistic and prevents any risk of a frequency analysis attack.

A diagram illustrating how multiple columns, like email, name, and phone number, can each be configured to be consistent to an Account Number, to increase realism while preserving privacy.

How do I use Tonic Structural to understand the quality of my de-identification?

Once you’re completed generator configuration in Structural, how do you know you’ve done a good job? Simple: by reviewing your workspace’s Privacy Report, which captures the details of the data protection achieved in your workspace. Structural offers the Privacy Report in csv and pdf formats, each of which have their own uses.

If you are a compliance officer who really wants to get in the weeds and understand how each individual field was de-identified, use the Privacy Report csv. This will provide you with a row for each field in your database that gives you the details of the data that exist in the row (column name, table name, etc) and the details of the configuration that has been assigned to that row (generator, Tonic’s assessment of sensitivity, differential privacy setting, and privacy ranking).

If you are looking for a high-level summary of the configuration of your workspace, make use of the PDF and specifically the summary charts. Example charts are pictured below; they provide an overview of how many columns you have and what the level of protection is across those columns.

A screenshot of a Privacy Report generated by Tonic Structural

Conclusion

Our goal is to help you keep your customers’ data safe and your developers productive. We believe that there shouldn’t be a tradeoff between data privacy and developer experience. Tonic Structural provides you with the tools you need to get realistic, safe test data quickly and confidently. We are constantly coming out with expanded features and products to make it easier to get there—let us know what you think and what capabilities you might want next!

Ensuring data privacy with privacy rankings in Tonic Structural

What does it mean to keep data private?

What are Tonic Structural’s privacy rankings?

How can the rankings be used for effective de-identification?

Self-consistency

Consistency with another column

How do I use Tonic Structural to understand the quality of my de-identification?

Conclusion

Related Guides

Maintaining data relationships in Structural generation output

Guide to data privacy compliance for financial institutions

Security for Tonic.ai cloud products

Make your sensitive data usable for testing and development.