Maintaining data relationships in Structural generation output

Author

March 26, 2024

When you use Tonic Structural to de-identify data, your first priority is likely to make sure that all of the sensitive values in your source data are protected in the output data.

But there is another aspect of the data transformation that, depending on your use case, is almost equally important—that the output data provides a realistic representation of your source data. And a key element of realistic output data is to maintain the existing relationships in your data—not just the more formal relationships enforced by built-in primary and foreign keys, but also:

Primary and foreign key relationships that aren’t enforced in your database
Relationships between different types of databases
Statistical relationships between different types of columns

Having realistic data improves the quality of your testing, which enables you to catch more bugs and catch them earlier.

In this guide, we'll talk about how these Structural options help you to generate these more realistic sets of output data that preserve the data relationships:

Column consistency
Cross-generation consistency
Generator linking
Column partitioning
Virtual foreign keys

Column consistency

Let's start with column consistency. In Structural, consistency means that a given value in a source data column is de-identified to the same value in the destination data. David Smith in the source data is always Michael Jones in the output data.

So how does that result in more realistic data? For one thing, it helps to reproduce what we call the cardinality of the data—the number of rows that have a given value or set of values. So if 50% of the records have David Smith, then with consistency enabled, 50% of the records also have Michael Jones.

Structural supports two types of column consistency: self-consistency and consistency with another column.

Consistency is available for selected Structural generators. Some generators only support self-consistency, and others support both.

Self-consistency

The first and most basic type of column consistency is self-consistency.

Self-consistency applies within a single column. For example, in a column that contains a first name, if the Name generator is self-consistent, then every time Structural sees the first name David, it replaces it with the name Michael.

Let's look at an example. Here's some preview data for a first name column that uses the Name generator. With consistency turned off, you can see that the same first name Melissa in the source data is changed to two different values—Jimmy and Cary—in the output data.

A screenshot from the Tonic Structural UI, showing the first name Melissa de-identified into two different values.

And here's the same data with consistency turned on. Melissa now changes to Rosella both times.

A screenshot from the Tonic Structural UI, showing the first name Melissa de-identified into the same value in two instances.

Primary key generators all support self-consistency. Because the generator you assign to a primary key is automatically assigned to the corresponding foreign key columns, the proportion of foreign key column values is the same in the source and output data.

And if you make the same generator self-consistent for different columns of the same data type, then the same source values in all of those columns always produce the same output values. So for all self-consistent Name generator columns across the source data, David becomes Michael.

Consistency with another column

Consistency with another column is slightly different. When a column is consistent with another column instead of with itself, it means that for a given output value in the other column, the output value for the current column is always the same.

This is very much about preserving those "unofficial" relationships between values.

For example, you use the IP Address generator to make an IP address column consistent with a username column. In the output, for every instance of the value user1, the IP address is the same. This makes for more realistic data—a given user is more likely to always be associated with the same IP address.

Cross-generation consistency

So column consistency is great, but it doesn't automatically apply across different workspaces or data generations.

So for example, for the first data generation, all instances of David are changed to Michael. For the second data generation, all instances of David are changed to Peter.

Having consistent values across generations can be especially useful for automated testing. If a column always contains the same values, then you can more easily set up your testing to work with specific values.

To enable consistency across data generations, Structural allows you to specify a statistics seed. A statistics seed is a 32-bit signed integer. When the statistics seed is present, then Structural generates values in the same way for every data generation. So David becomes Michael every single time.

You can enable cross-generation consistency across all workspaces, or for individual workspaces. This includes generations across workspaces that use different data connectors.

Enabling cross-generation consistency across all workspaces

To enable cross-generation consistency across all of your workspaces, set the statistics seed as the value of the Structural environment setting TONIC_STATISTICS_SEED.

Configuring cross-generation consistency for a workspace

If you want to enable or disable cross-generation consistency for a specific workspace, you can also configure cross-generation consistency on a workspace by workspace basis.

In the workspace settings, use the Override statistics seed configuration to override the environment setting.

A screenshot of the Tonic Structural UI showing the options for overriding the statistics seed.

So if Structural is set up to use cross-generation consistency, you can configure an individual workspace to not use it.

And if Structural does not use cross-generation consistency, you can provide a statistics seed for that workspace.

Mimic your production data in your lower environments.

Accelerate your release cycles and eliminate bugs in production with realistic, compliant data de-identification.

Book a demo

Generator linking

Generator linking is very much about preserving relationships between columns, specifically columns that are in the same table and that use the same generator.

You link columns to indicate that they are tightly coupled and that Structural must take that relationship into account when it generates the data.

The most common use of generator linking is for address fields. Linking helps ensure that the addresses in the output data make sense.

For example, the source data contains city and state values.

A screenshot of the Tonic UI, showing City and State pairings in two columns.

Both columns use the Address generator, but without linking, the output data contains pairs that don't make sense, like San Diego, Pennsylvania, and New York, California.

A screenshot of the Tonic UI, showing City and State mispairings in two columns.

But if you link the city column to the state column, then the output city and state pairings are realistic—San Diego is restored to its rightful place in California.

Column partitioning

Column partitioning is another Structural feature that is about preserving relationships between columns. It allows the value of a column to be based on the values of other columns. It offers another way to generate a more realistic set of output values.

Partitioning is available for columns that use the Continuous and Event Timestamps generators, which both produce a distribution of values. You partition those columns based on other columns that use either Passthrough (the values are passed through as is) or the Categorical generator (the existing values are shuffled among the rows).

Step 1 - Creating the distributions

For each value or combination of values in the partitioning columns, Tonic Structural first generates a distribution of values for the original column.

For example, you partition an Income column, and partition it by an Occupation column. For each Occupation value, Structural generates a distribution of Income values. In other words, it generates a range of incomes for each occupation, such as Doctor and Construction Worker.

For multiple columns, the distribution is for each combination of column values. For example, you partition Income by both Occupation and Region. Structural creates a distribution of income values for each combination of occupation and region. So there is a distribution for Doctor and Northeast, and a different distribution for Doctor and Southeast.

Step 2 - Selecting a value from the appropriate distribution

For each record in the destination database, Structural sets the value of the partitioned column to a value from the appropriate distribution. The distribution that Structural uses is based on the value of the partitioning columns in the destination database, not the original value of the partitioning columns in the source database.

To continue our example, assume that the Occupation column uses the Categorical generator. During data generation, Structural assigns to each record a random occupation value from the current values. For one of the records, the occupation value is Doctor in the source database and Construction Worker in the destination database.

For the Income column for that record, Structural assigns a value from the distribution of income values for the Construction Worker occupation. In other words, it assigns an income value that is realistic for the destination occupation value based on the source data.

Virtual foreign keys

Primary and foreign key relationships are vital to any set of data.

Structural enforces relationships between primary and foreign key columns—a foreign key column always uses the same generator configuration as the primary key column.

These relationships are especially important for subsetting, where Structural generates a smaller subset of data that still has referential integrity. The relationships ensure that your subset has all of the records it needs.

A flowchart showing the relationships between primary and foreign keys in a database.

But what if your data contain these relationships, but your database doesn't have built-in foreign key relationships or is missing some of them? How do you tell Structural about those relationships?

The virtual foreign key option in Structural allows you to define primary key and foreign key relationships. Structural can then respect those relationships during generator configuration and data generation.

Recap

The Structural features that we covered here are all ways to help ensure that your transformed output data maintains the relationships that are in your source data. This produces more realistic, useful data and can help support automated testing.

Column consistency ensures that a given source value produces the same destination value.

Cross-generation consistency extends the consistency across data generations and databases.

Linking preserves the relationships between fields in a table, such as address fields.

Column partitioning also helps to preserve relationships between field values, so that a destination value makes sense in the context of other fields.

Virtual foreign keys allow you to manually configure foreign key relationships that are present but that are not formally defined in your data.

To learn more about how to de-identify your data while preserving its relationships and underlying business logic, connect with our team or start a free trial of Tonic Structural today.