Andrew is the Co-founder and CTO of Tonic. An early employee at Palantir, he led the team of engineers that launched the company into the commercial sector and later began Palantir’s Foundry product. Today, he is building Tonic’s platform to engineer data that parallels the complex realities of modern data ecosystems and supply development teams with the resource-saving tools they most need.
It’s hard to believe it but it’s been a year since we released the original version of Condenser, our open source database subsetting tool. Since then, Condenser has been deployed in a variety of situations, and we’ve learned a bunch along the way, specifically, what you need to make it work best for you. The culmination of this is the release of Condenser 2.0! Rejoice! 🎉
In this post, we’ll explore what we’ve learned about subsetting and the challenges our new release helps you overcome. If you’re new to subsetting and wondering how it could fit into your workflow, here are a few possibilities:
Condenser uses foreign key to traverse your database and grab only what’s necessary to extract the subset you’re looking for. In other words, if you have a database (🎂) that you’d like just a piece of (🍰), Condenser can make it for you.
Sweet, right? Sweet…but not actually a piece of cake.
Oh so many challenges. We explored the core questions in our original post announcing Condenser a year ago. Here, we’ll focus on the issues we encountered that led to the release of 2.0.
When it comes right down to it, this is the classic conundrum of communication. What you say isn’t always what your listener hears. Or, in our case, what we asked for in an algorithm wasn’t always what our program returned.
Condenser was originally released with an Iterative Subset algorithm. In short, Iterative Subset always starts at one end of a topological sort of your database’s tables and works its way across the sort to fill every table. It searches for the optimal subset iteratively by changing the rows in the starting table. Iterative Subset stops when some criteria has been met. This was purposefully vague because you can, in principle, provide almost any criteria. The main advantages of Iterative Subset are that it’s fast and you can express flexible termination criteria.
To contrast the two: Direct Subset always starts with the initial targets, which can be anywhere in the topological sort, and Iterative Subset always starts at the beginning of the topological sort. Direct Subset always ends after two passes, and Iterative Subset continues iterating until meeting a desired subset criteria. Iterative Subset has flexible criteria for ending iteration, Direct Subset merely guarantees that the initial targets will be as specified.
After extensive testing, we found Direct Subset produced more practical subsets than Iterative Subset (usually by bringing in more data), though it was often slower (usually because it brought in more data). So for Condenser 2.0, we replaced Iterative Subset with Direct Subset.
If you want to know even moar about how these two approaches work, feel free to email us at firstname.lastname@example.org. I’m confident my enthusiasm for subsetting will outlast yours. 😇
Given that Condenser relies on foreign keys to create its subsets, how they are handled is fundamental to the subsetting process. Condenser 1.0 was limited in the kinds of foreign keys it supported, thereby limiting the tool’s capacity to subset a variety of databases. It also didn’t allow for much configuration in terms of foreign key constraints. AKA complex subsetting be damned. To lift these restrictions and greatly expand Condenser’s capabilities, we’ve made two major improvements in terms of foreign key handling.
This one was a no brainer. Why offer an open source subsetting tool for only one of the most popular open source databases when, with just a little more effort, you can do it for two? Spread the open source love! In addition to Postgres, we’re happy to announce that Condenser 2.0 also supports MySQL.
Consider these the icing on the cake:
If Condenser doesn’t quite cut it for you, Tonic also offers a commercial subsetter. The commercial subsetter has more advanced algorithms for subsetting very large databases, additional subsetting options, a UI, and job scheduling. For more information, contact us at email@example.com.
Over the last year, Condenser has been a great tool for us as well as our customers. This is what inspired these additional investments in Condenser. We hope it’s helpful for you too. So, happy birthday Condenser! 🥳 And here’s to many more! 🥂