Our customers come to us because they want to keep their data safe while creating realistic test data. Yet, safe means something different to every customer. For instance, if you are in the US healthcare industry, you’ll want to adhere to HIPAA guidelines, while if you are in financial services, you’ll have to navigate PSD2 and PCI compliance specifically. And then there’s the general data privacy laws that you have to navigate if you deal with any sort of customer data:
Regulations aren’t the only reason though—companies care about data privacy because their customers care about it. No customer wants to be on the wrong end of a data breach, or even an accidental test email. Customers want to know their data is in safe hands when they hand it over to businesses.
These concerns are what drive our products’ development. And since those businesses (aka, our customers) come to us for the solution, we want to make sure they understand how to use our products to effectively protect their data. Within our test data platform, Tonic Structural, we provide a comprehensive library of versatile data generators and features like subsetting to keep your test data private and secure. But we often get asked the question: “How do I know that my generated data is secure?” With that in mind, we developed a proprietary Privacy Ranking system to help you understand at a glance how to secure your data to the level of privacy that you need.
Tonic Structural’s privacy rankings rank the privacy level of the generators that you use on a scale of 1 to 6, where 1 is fully private and 6 is fully passed through (in other words, the data remains in its original state, unchanged). We established this proprietary ranking system to help our customers understand how their data has been transformed. The privacy rankings for your configured generators can be found in your workspace’s Privacy Report.
Each column within your database receives its own ranking based on the generator configuration applied. Everything with a ranking of 1 to 4 has been fully de-identified. A ranking of 5 indicates that the data within the column may only be partially de-identified, and a ranking of 6 indicates that no de-identification or transformation has been applied.
Let’s take a closer look. The six privacy rankings, including descriptions, examples, and uses, are as follows.
Ranking 1: The generated data is completely free of source data and no information about it can be reverse engineered. There is no way to uncover information about the original data from the output data.
Ranking 2: The generator uses the original data in a way that obscures the original data points. Changing individual data points in the original data does not change the output data, but the overall shape of the output data can provide high-level information about the range of values that exist in the input data.
Ranking 3: The generator uses the underlying data in a way that cannot be reversed but can identify values that exist in the original data.
Ranking 4: Data is transformed securely but with consistent values, which can introduce a slight risk. Toggling consistency on in a generator’s configuration means that the same input will always generate the same de-identified output. Someone who knows the source data and the frequency of the source values might be able to draw a connection between input and output values depending on the consistency configuration.
Ranking 5: Some values within the data might not be de-identified. This applies to generators that have sub-fields and do not have a default generator option, meaning that there is always a chance that some data within the fields will not be protected.
Ranking 6: Data is not transformed or de-identified. The Passthrough generator is applied which, aptly, passes the source data directly through in the output data.
Ultimately, in using Tonic Structural, your goal is to be confident you have created a synthetic dataset that keeps your data private while also keeping it realistic enough that it accelerates your developer efficiency and helps you find bugs sooner. You can leverage the privacy rankings to understand whether you have achieved this goal.
If your selected generator is ranked between 1 and 3, those rankings guarantee that those fields cannot be reversed or re-identified for a given record. They ensure the utmost in privacy. To dial up the realism in your data, meanwhile, you will occasionally want to add consistency to your generators. In adding consistency, your generator will have a privacy ranking of 4, introducing a slight risk of re-identification, as we mentioned earlier. That said, there are ways to use consistency that avoid introducing risk. Let’s look at two primary ways consistency can be configured, so you can choose the approach that makes sense for your use case and acceptable risk, and leverage consistency appropriately. These two ways are: self-consistency, and consistency with another column.
Toggling self-consistency on for a generator causes the same input within that column to always generate the same output.
For example, if you make a name generator consistent to itself, every instance of Jane in that column’s source data will be transformed into the same value in the generated output, e.g. Sam. This self-consistency can introduce a slight risk of re-identification. How? A malicious actor could use a frequency analysis to identify some of the higher likelihood names. Publicly available data on the frequency of names in the US could be used to reverse engineer the names in your sample given a large enough sample size.
Granted, the risk in this general example is quite low. Your particular data may include fields that, when self-consistent, could entail a higher risk of re-identification, depending on the publicly available data a malicious actor could find to pair it with (this risk is primarily associated with names and birth dates). On the other hand, you can safely use self-consistency on unique identifiers such as account numbers without any risk, so long as those unique identifiers are not publicly available.
To achieve a high degree of realism in your data while maintaining a high degree of confidence in the privacy your transformations achieve, you can set a column’s consistency to generate data that is consistent with another column.
For example, instead of configuring a name generator to be consistent to itself, you can configure it to be consistent to a unique identifier like an account number. In this way, Jane would be de-identified consistently every time it is associated with a specific account number. A Jane with account number 35781 could become Sam, and a Jane with account number 82902 could become Pat. Likewise, a Cynthia with account number 20823 could also become Pat. Instead of having the name values drive consistency, it is the account numbers (aka unique identifiers) that drive consistency, meaning that the data tied to a particular account will generate a realistic, consistent shape throughout your output data. In other words, it will generate useful data that functions like production data in your testing workflows. This makes your data fully realistic and prevents any risk of a frequency analysis attack.
Once you’re completed generator configuration in Structural, how do you know you’ve done a good job? Simple: by reviewing your workspace’s Privacy Report, which captures the details of the data protection achieved in your workspace. Structural offers the Privacy Report in csv and pdf formats, each of which have their own uses.
If you are a compliance officer who really wants to get in the weeds and understand how each individual field was de-identified, use the Privacy Report csv. This will provide you with a row for each field in your database that gives you the details of the data that exist in the row (column name, table name, etc) and the details of the configuration that has been assigned to that row (generator, Tonic’s assessment of sensitivity, differential privacy setting, and privacy ranking).
If you are looking for a high-level summary of the configuration of your workspace, make use of the PDF and specifically the summary charts. Example charts are pictured below; they provide an overview of how many columns you have and what the level of protection is across those columns.
Our goal is to help you keep your customers’ data safe and your developers productive. We believe that there shouldn’t be a tradeoff between data privacy and developer experience. Tonic Structural provides you with the tools you need to get realistic, safe test data quickly and confidently. We are constantly coming out with expanded features and products to make it easier to get there—let us know what you think and what capabilities you might want next!