Technical deep dive

Building vs buying test data infrastructure for ephemeral data environments

Author

Andrew Colombi, PhD

Author

February 23, 2024

Last month we launched Tonic Ephemeral, our database-on-demand platform for test data infrastructure. While developing Ephemeral we learned some things about developing a Kubernetes-backed data infrastructure that I’d like to share for anyone else interested in replicating something like this on their own. This post highlights a few interesting technology and product problems we had to solve to bring Ephemeral into existence.

The secret to efficient data loading

Once you’ve got containers running your favorite database, you need to make sure you’re efficiently loading them with data. At Tonic, we’ve spent a lot of time thinking about how to load data into databases efficiently; at this point, I’d say we’ve got folks on our team who are experts at it. Tools like pg_restore (for Postgres) or impdb (for Oracle) are the obvious choice. But starting an empty database and then populating it using these tools is a sure way to create long delays waiting for your database to be ready. At Tonic, our CI uses ephemeral databases, and we found loading those databases using pg_restore was far too slow for use in CI/CD.

Tools like pg_restore and impdb are meant to store data in a portable way that allows you to migrate the data from one database to another. The downside of this approach is that the data must be reinterpreted on its way into the database. This means, for example, things like indexes must be rebuilt and constraints re-checked. These operations are very expensive even in modestly sized databases.

We’ve found the best approach is to operate at the file-system level. With some insider knowledge of how the database works, you can carve out the bits of the database that are just the data. Of course, databases aren’t expecting this kind of operation—it’s rude to go digging into an application’s files—so you need to be careful if you do. For example, databases often let users configure alternative paths to store data (e.g. “tablespaces” in Oracle, and “filegroups” in SQL Server). Not to mention that databases don’t make promises about the state of their files while in operation. In short, you need to be careful when you do this, otherwise you might get a half set of corrupt files. And nobody wants that.

Automatic shutdowns

One thing we really focused on for Tonic Ephemeral is automatic database shutdown. Test infrastructure should only run when it’s needed. Running an unused database is a quick way to waste money, and test infrastructure is especially susceptible to being left to run accidentally. To do this effectively, you need to know when a database is in use, otherwise you won’t know when it isn’t in use and can be safely shutdown. And you might be surprised to hear that this is actually pretty darned hard to do. There are two main things to consider here: connection activity and query activity. But not all connections, and not all activity, are created equal, and this is where things get tricky. Can you distinguish between a developer’s client idly keeping a connection to a DB from an application that’s holding a connection? If you’re not careful it’s really easy to let idle connections ruin your automatic shutdown logic.

In Ephemeral, we offer a few ways to schedule automatic shutdown of a database. First you can set business hours for your DB. No sense in running the DB if there’s no one around to use it. The second is you can schedule the DB to shutdown after a period of inactivity. Even for the business hours use case, you still need to understand inactivity. If Sam works late on Tuesday, it’s going to be especially frustrating if the database he’s using automatically shuts down at 6pm when he’s in the middle of actively querying it. Ephemeral won’t shutdown a database that’s actively in use, irrespective of the schedule you put on it.

A screenshot of Tonic Ephemeral's UI for setting database expiration timers.

The long last mile

And of course solving the meaty technical problems is only half the battle. (Actually, maybe not even half!) To deliver the promise of ephemeral environments to your engineers, there’s so much more.

An intuitive, easy to use user interface. Not all engineers want to become infrastructure experts. Giving users a clear and simple way to provision databases requires reducing that complexity in a way that doesn’t reduce the flexibility they need.

Comprehensive APIs. Ephemeral environments are great for automation too. That means creating equally great APIs!

Access controls and auditing. Enterprise software needs enterprise access controls. Services that provision infrastructure need mechanisms to control and monitor access.

Maintenance. We all know that software is never “done.” As database and cloud technologies evolve, so too must the test data infrastructure software that manages it all.

The point is, while software developers gravitate toward estimating costs tied to the meaty technical challenges, often it’s the little things that add up (and that persist over time).

Closing Thoughts

Building Tonic Ephemeral has given us a deep appreciation for the challenges of creating a self service ephemeral data infrastructure platform. If building and maintaining all of this doesn’t seem appealing to you, but you still want the benefits of ephemeral infrastructure, we have something that might interest you. ;-)

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Andrew Colombi, PhD

Co-Founder & CTO

Andrew is the Co-founder and CTO of Tonic. An early employee at Palantir, he led the team of engineers that launched the company into the commercial sector and later began Palantir’s Foundry product. Today, he is building Tonic’s platform to engineer data that parallels the complex realities of modern data ecosystems and supply development teams with the resource-saving tools they most need.