Customizing foreign key distribution in synthetic data with Tonic Fabricate

Author

May 29, 2025

Tonic Fabricate is a data synthesis platform for generating data from scratch, including relational databases, mock APIs, and unstructured datasets, to fuel product development and testing.

When you create a database and add tables in Fabricate, you can set up foreign key relationships between the tables to mimic the relationships and interdependencies required by your app.

For example, in a database of ecommerce transactions, for each order in the orders table:

The product column is a foreign key to the products table
The customer column is a foreign key to the customers table.

By default, the foreign key column contains a random number of instances of each value from the referenced column. So in our example, each product and each customer is used in a random number of orders.

However, you might want to have a little more control over how Fabricate uses the records from the referenced table. For example, you might want to only use customers from a specific location, or only allow any given product to be used up to 5 times.

In this guide, we'll show you how to configure how Fabricate populates a foreign key column to heighten the realism of your synthetic data and to better reflect how data behaves within your product.

Throughout, we'll be configuring a foreign key from an orders table to a customers table. In other words, configuring how often each customer is used.

Using a column to weight the referenced rows

One option is to assign a weight value to each row in the referenced table. The weight value is based on a value in the referenced table. Records with a higher weight are used more often.

From the Relative Prevalence dropdown list, select the column that contains the weight value.

The column with the weight information often uses the SQL generator to generate the value. It is also usually excluded from the data export.

Note that when you use a relative prevalence column, the number of rows does not change. For example, if your table previously contained 100 rows, then when you select a relative prevalence column, the number of rows stays at 100. The only thing that changes is the distribution of the rows within the table.

To use our example scenario (managing how customers are referred to in an orders table), when we reference customers from the customers table, we want most customers to be from California, a smaller number of customers to be from New York, and no customers from other states.

To do that, in the customers table, we add a weight column that uses the SQL generator, then provide SQL that checks the value of the state where the customer is located:

If the state is California, set weight to 10.
If the state is New York, set weight to 5.
Otherwise, set weight to 0.

We then select the weight column from the Relative Prevalence dropdown list.

Let's see how the orders table changes based on that configuration. Before we added the relative prevalence, the customers came from all across the country.

After we set the relative prevalence, the customers are all located in California and New York.

Synthesize realistic data for greenfield product development and AI model training.

Unblock product innovation with synthetic data generated from scratch and on demand.

Book a demo

Configuring how often to use each referenced row

The other way to manage foreign key distribution is to configure the cardinality. The cardinality determines the number of times to use each row from the referenced table.

Unlike the relative prevalence configuration, setting the cardinality does affect the number of rows in the table. When you set a cardinality, the row count fields at the top of the table are replaced with the formula that Fabricate uses to determine the number of rows.

For example:

To use a cardinality setting:

Check Base the number of rows on this foreign key relationship.
From the Distribution dropdown list, select the type of distribution to use.
Configure the distribution.

Let's go through the distribution options and how they work, starting with the simplest and ending with the most complex.

Fixed distribution

The Fixed distribution is very simple: use each row in the referenced table a specified number of times.

For a Fixed distribution, in the Exactly field, specify the number of times to use each row. For example, to use each row 5 times, set Exactly to 5.

The number of rows in the current table then becomes the number of rows in the referenced table multiplied by the value of Exactly.

For our example scenario, let's say there are 25 customers in the customers table, and we set Exactly to 5. As a result, each customer is used 5 times, and the orders table contains 125 rows - 5 rows per customer multiplied by 25 customers.

Uniform distribution

Next up is Uniform distribution. Instead of a fixed number of references, the Uniform distribution causes Fabricate to use each row a number of times that is within a specified range.

For a Uniform distribution:

In the Minimum field, provide the minimum number of times to use each record.
In the Maximum field, provide the maximum number of times to use each record.

For our example scenario, we configure a minimum of 1 and a maximum of 5. That means that each customer is used anywhere between 1 and 5 times in the orders table.

The total number of rows varies depending on how Fabricate handles the distribution. For example, when there are 25 customers, the number of rows could be anything from 25 (all customers used once) to 125 (all customers used 5 times).

Normal distribution

Similar to Uniform, but slightly more complex, is Normal distribution, which provides a more nuanced distribution of the referenced rows to give you greater control over the output.

In addition to the minimum and maximum, you also set a mean and a standard deviation.

From column

For the From Column distribution, Fabricate pulls the number of times to use each row from a column in the referenced table.

This is somewhat similar to the relevant prevalence configuration. You set up a column in the referenced table that uses the SQL generator to produce the value. But instead of a weight, the value is the number of times to use the row.

For the From Column distribution, from the column dropdown list, select the column in the referenced table that contains the number of times to use the row.

For our example scenario, we set up a use_count column in the customers table.

For customers from California, use_count is an integer between 1 and 10.
For customers from New York, use_count is an integer between 1 and 5.
For other customers, use_count is 0.

We then select use_count as the column to use to determine the number of times to use the row.

You can see that this is similar to the weight column that we configured for the relative prevalence, and has a similar effect: most customers are from California, fewer are from New York, and no customers are from other states.

Until column <= 0

And finally there's Until column <= 0.

This is a bit different from the other options. It indicates to use a referenced row until the value of a selected column in the current table (NOT in the referenced table) is equal to or less than 0.

You use this distribution in a use case where there is a specific amount to be consumed. That could be something like money, time, or distance. When that amount is consumed, the referenced row is no longer used.

A common use case for this is something like loan payments. A loans table contains a list of loans, and a loan payments table records the payments against the loans. Once a loan is paid off, you don't want to record any more payments against it.

To enforce this, you would add a SQL column to the loan payments table that tracks the remaining balance of the loan after the most recent payment. When the remaining balance is 0, the loan payments table no longer refers to that loan.

Realistic data synthesis with Tonic Fabricate

To recap, when a table contains a foreign key to another table, you can configure how Fabricate uses the records from that other table.

One option is to assign a weight to the referenced records. This option changes the distribution of the referenced records, but does not change the total number of records in the table that references it.

The other option is to use a cardinality option to determine how often to use referenced rows. This can be as simple as using each row the same number of times, or you can use more complex options that provide a range of values or use a calculated value. Using a cardinality option does affect the number of rows in the table.

Configuring a foreign key distribution option can increase the realism of your synthetic data and allows you to generate different sets of data to accommodate different use cases.

Want to learn more about Tonic Fabricate? Get started for free or connect with our team today.

Customizing foreign key distribution in synthetic data with Tonic Fabricate

Using a column to weight the referenced rows

Configuring how often to use each referenced row

Fixed distribution

Uniform distribution

Normal distribution

From column

Until column <= 0

Realistic data synthesis with Tonic Fabricate

Related Guides

Creating an enterprise test data strategy with Tonic Structural

How to generate synthetic data: a comprehensive guide

Guide to synthetic test data generation

Make your sensitive data usable for testing and development.