Technical deep dive

Database subsetting is not a piece of cake, so we baked Condenser 2.0 just for you

Author

Andrew Colombi, PhD

Author

August 13, 2019

It’s hard to believe it but it’s been a year since we released the original version of Condenser, our open source database subsetting tool. Since then, Condenser has been deployed in a variety of situations, and we’ve learned a bunch along the way, specifically, what you need to make it work best for you. The culmination of this is the release of Condenser 2.0! Rejoice! 🎉

In this post, we’ll explore what we’ve learned about subsetting and the challenges our new release helps you overcome. If you’re new to subsetting and wondering how it could fit into your workflow, here are a few possibilities:

You’d like to use your production database in staging or test environments but the database is very large and you want to use only a portion of it.
You’d like a test database that contains a few specific rows from production (and related rows from other tables) so you can reproduce a bug.
You’re training a model on a massive dataset but your iteration cycle is slow because of the scale. You can create an unbiased subset of data to be used for testing and model evaluation.
You want to share data with others but you don’t want them to have all of it. Because #dataprivacy.

Condenser uses foreign key to traverse your database and grab only what’s necessary to extract the subset you’re looking for. In other words, if you have a database (🎂) that you’d like just a piece of (🍰), Condenser can make it for you.

Sweet, right? Sweet…but not actually a piece of cake.

The Challenges of Subsetting

Oh so many challenges. We explored the core questions in our original post announcing Condenser a year ago. Here, we’ll focus on the issues we encountered that led to the release of 2.0.

Direct vs. Iterative Subsetting

When it comes right down to it, this is the classic conundrum of communication. What you say isn’t always what your listener hears. Or, in our case, what we asked for in an algorithm wasn’t always what our program returned.

Condenser was originally released with an Iterative Subset algorithm. In short, Iterative Subset always starts at one end of a topological sort of your database’s tables and works its way across the sort to fill every table. It searches for the optimal subset iteratively by changing the rows in the starting table. Iterative Subset stops when some criteria has been met. This was purposefully vague because you can, in principle, provide almost any criteria. The main advantages of Iterative Subset are that it’s fast and you can express flexible termination criteria.

In practice, however, we have since found Direct Subset to be consistently better at creating subsets that match our expectations. Direct Subset works by starting from a set of initial targets. These are tables with subsetting goals, e.g. give me 5% of the users table and 5% of the vendors table. Direct subset begins by producing those subsets in your output database. Then it makes two passes on your database, ordered by a topological sort of the tables. In the first phase, it greedily includes rows from tables that are upstream of the initial targets. For example, the events table is upstream of users if there is a foreign key in events to users. The second phase grabs all required keys given the greedily derived database.

To contrast the two: Direct Subset always starts with the initial targets, which can be anywhere in the topological sort, and Iterative Subset always starts at the beginning of the topological sort. Direct Subset always ends after two passes, and Iterative Subset continues iterating until meeting a desired subset criteria. Iterative Subset has flexible criteria for ending iteration, Direct Subset merely guarantees that the initial targets will be as specified.

After extensive testing, we found Direct Subset produced more practical subsets than Iterative Subset (usually by bringing in more data), though it was often slower (usually because it brought in more data). So for Condenser 2.0, we replaced Iterative Subset with Direct Subset.

If you want to know even moar about how these two approaches work, feel free to email us at hello@tonic.ai. I’m confident my enthusiasm for subsetting will outlast yours. 😇

Foreign Key Handling

Given that Condenser relies on foreign keys to create its subsets, how they are handled is fundamental to the subsetting process. Condenser 1.0 was limited in the kinds of foreign keys it supported, thereby limiting the tool’s capacity to subset a variety of databases. It also didn’t allow for much configuration in terms of foreign key constraints. AKA complex subsetting be damned. To lift these restrictions and greatly expand Condenser’s capabilities, we’ve made two major improvements in terms of foreign key handling.

We now support many kinds of foreign keys, including most kinds of composite foreign keys. Furthermore, we added support for every data type as a foreign key. Note: foreign keys still must point to a primary key. (If you have a use case for a foreign key pointing to a unique index that’s not a primary key, file an issue in Github. We just haven’t run into that use case in practice, yet.)
Foreign keys can now be specified as part of the configuration. Many times, foreign keys exist in a database but the corresponding foreign key constraints do not. The fk_augmentation field in config.json lets you specify additional foreign keys, which will be unioned with the foreign key constraints present in the database.

Elephants and Dolphins are Besties

This one was a no brainer. Why offer an open source subsetting tool for only one of the most popular open source databases when, with just a little more effort, you can do it for two? Spread the open source love! In addition to Postgres, we’re happy to announce that Condenser 2.0 also supports MySQL.

Additional Enhancements

Consider these the icing on the cake:

The first is the ability to specify subsets with a WHERE clause as part of the definition of initial_targets. This is very useful when you want to pull out specific parts of a database, e.g. everything associated with customer_id = 5, rather than simply shrinking a database.
We’ve also added the concept of upstream_filters. During subsetting’s greedy phase (read the description of Direct Subset for more info), the subsetter grabs as many rows as possible from upstream tables. upstream_filters gives you an opportunity to restrict how greedy it is. For example, suppose you want the subset “all users in Brazil, and the last 30 days of web traffic”. You might specify this with an initial target on users with a where country = 'Brazil', and an upstream filter on events with a condition current_date - event_date < 30. You can specify an upstream filter on a table or on a column. When it’s on a column, it applies to all tables with that column.
keep_disconnected_tables has been added to let you control what happens to tables that aren’t connected to any initial_targets; this includes transitive connections.
We cleaned up the configuration so that it’s more consistent and there is only one file.
And of course, bug fixes…

Better Documentation and Examples

The metaphor’s not dead yet, folks. Here’s your cherry on top. Condenser now ships with example-config.json that gives examples of what you can do in config.json, as well as a shell config.json. And README.md has been augmented with a bunch more on how to install and configure Condenser. With that said, no documentation is ever perfect. Feel free to contact us at hello@tonic.ai if you run into any problems.

TL/DR? Here’s What’s New

“Direct Subset” has replaced the original “Iterative Subset” algorithm.
Many restrictions on the kinds of foreign keys supported have been lifted.
Foreign keys can be imported via JSON (in case your database doesn’t have foreign key constraints).
MySQL support has been added.
Many new configuration points to fine tune the subset have been introduced.
Better documentation.
Tons of bug fixes.

Need More Speed and Flexibility?

If Condenser doesn’t quite cut it for you, Tonic also offers a commercial subsetter. The commercial subsetter has more advanced algorithms for subsetting very large databases, additional subsetting options, a UI, and job scheduling. For more information, contact us at hello@tonic.ai.

Conclusion

Over the last year, Condenser has been a great tool for us as well as our customers. This is what inspired these additional investments in Condenser. We hope it’s helpful for you too. So, happy birthday Condenser! 🥳 And here’s to many more! 🥂

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Andrew Colombi, PhD

Co-Founder & CTO

Andrew is the Co-founder and CTO of Tonic. An early employee at Palantir, he led the team of engineers that launched the company into the commercial sector and later began Palantir’s Foundry product. Today, he is building Tonic’s platform to engineer data that parallels the complex realities of modern data ecosystems and supply development teams with the resource-saving tools they most need.