Database Subsetting Is Not a Piece of Cake, So We Baked Condenser 2.0 Just for You
Andrew Colombi, PhD
August 13, 2019
It’s hard to believe it but it’s been a year since we released the original version of Condenser, our open source database subsetting tool. Since then, Condenser has been deployed in a variety of situations, and we’ve learned a bunch along the way, specifically, what you need to make it work best for you. The culmination of this is the release of Condenser 2.0! Rejoice! 🎉
In this post, we’ll explore what we’ve learned about subsetting and the challenges our new release helps you overcome. If you’re new to subsetting and wondering how it could fit into your workflow, here are a few possibilities:
You’d like to use your production database in staging or test environments but the database is very large and you want to use only a portion of it.
You’d like a test database that contains a few specific rows from production (and related rows from other tables) so you can reproduce a bug.
You’re training a model on a massive dataset but your iteration cycle is slow because of the scale. You can create an unbiased subset of data to be used for testing and model evaluation.
You want to share data with others but you don’t want them to have all of it. Because #dataprivacy.
Condenser uses foreign key to traverse your database and grab only what’s necessary to extract the subset you’re looking for. In other words, if you have a database (🎂) that you’d like just a piece of (🍰), Condenser can make it for you.
Sweet, right? Sweet…but not actually a piece of cake.
The Challenges of Subsetting
Oh so many challenges. We explored the core questions in our original post announcing Condenser a year ago. Here, we’ll focus on the issues we encountered that led to the release of 2.0.
Direct vs. Iterative Subsetting
When it comes right down to it, this is the classic conundrum of communication. What you say isn’t always what your listener hears. Or, in our case, what we asked for in an algorithm wasn’t always what our program returned.
Condenser was originally released with an Iterative Subset algorithm. In short, Iterative Subset always starts at one end of a topological sort of your database’s tables and works its way across the sort to fill every table. It searches for the optimal subset iteratively by changing the rows in the starting table. Iterative Subset stops when some criteria has been met. This was purposefully vague because you can, in principle, provide almost any criteria. The main advantages of Iterative Subset are that it’s fast and you can express flexible termination criteria.
To contrast the two: Direct Subset always starts with the initial targets, which can be anywhere in the topological sort, and Iterative Subset always starts at the beginning of the topological sort. Direct Subset always ends after two passes, and Iterative Subset continues iterating until meeting a desired subset criteria. Iterative Subset has flexible criteria for ending iteration, Direct Subset merely guarantees that the initial targets will be as specified.
After extensive testing, we found Direct Subset produced more practical subsets than Iterative Subset (usually by bringing in more data), though it was often slower (usually because it brought in more data). So for Condenser 2.0, we replaced Iterative Subset with Direct Subset.
If you want to know even moar about how these two approaches work, feel free to email us at firstname.lastname@example.org. I’m confident my enthusiasm for subsetting will outlast yours. 😇
Foreign Key Handling
Given that Condenser relies on foreign keys to create its subsets, how they are handled is fundamental to the subsetting process. Condenser 1.0 was limited in the kinds of foreign keys it supported, thereby limiting the tool’s capacity to subset a variety of databases. It also didn’t allow for much configuration in terms of foreign key constraints. AKA complex subsetting be damned. To lift these restrictions and greatly expand Condenser’s capabilities, we’ve made two major improvements in terms of foreign key handling.
Elephants and Dolphins are Besties
This one was a no brainer. Why offer an open source subsetting tool for only one of the most popular open source databases when, with just a little more effort, you can do it for two? Spread the open source love! In addition to Postgres, we’re happy to announce that Condenser 2.0 also supports MySQL.
Consider these the icing on the cake:
Better Documentation and Examples
TL/DR? Here’s What’s New
“Direct Subset” has replaced the original “Iterative Subset” algorithm.
Many restrictions on the kinds of foreign keys supported have been lifted.
Foreign keys can be imported via JSON (in case your database doesn’t have foreign key constraints).
MySQL support has been added.
Many new configuration points to fine tune the subset have been introduced.
Tons of bug fixes.
Need More Speed and Flexibility?
If Condenser doesn’t quite cut it for you, Tonic also offers a commercial subsetter. The commercial subsetter has more advanced algorithms for subsetting very large databases, additional subsetting options, a UI, and job scheduling. For more information, contact us at email@example.com.
Over the last year, Condenser has been a great tool for us as well as our customers. This is what inspired these additional investments in Condenser. We hope it’s helpful for you too. So, happy birthday Condenser! 🥳 And here’s to many more! 🥂
Andrew Colombi, PhD
Co-Founder & CTO
Andrew is the Co-founder and CTO of Tonic. An early employee at Palantir, he led the team of engineers that launched the company into the commercial sector and later began Palantir’s Foundry product. Today, he is building Tonic’s platform to engineer data that parallels the complex realities of modern data ecosystems and supply development teams with the resource-saving tools they most need.