TL;DR: Docker is a great tool for packaging and distributing test databases among your team and testing environment. We’ve created a starter repository to show these concepts in PostgreSQL and MySQL on Github if you’d like to dig in immediately while reviewing our process below.
Containers are a valuable resource for us developers. They give you the ability to package your production code, dependencies, and assets into a standardized unit. Once conveniently packaged, running your application in a variety of environments is a breeze. And yet despite containerization’s immense value, we often see teams overlook containers entirely when it comes to managing test databases. At Tonic, whether we’re helping our customers improve testing with high-quality test databases or building our own testing environments, we rely on Docker as a key part of the pipeline. Our customers are finding a lot of value in the approach, so we thought we’d share our strategy.
The Trouble With Test Databases
Let’s take a look at the typical growth of a test database for a team:
- You and your team realize that testing in production is a bad idea. You’ve heard about it or you’ve felt the pain yourself: mistakes that lead to data loss and time spent restoring from backups; poor reliability in production due to increased database load; the security team raising their eyebrows at giving too many people access to sensitive data; and the omnipresent anxiety that testing might take down production.
- You decide you need data that’s similar to your production environment, so you write some scripts to generate some fake data (like faker), use some free or paid services to generate some random data for you (like Mockaroo), or extract data from production and attempt to anonymize it later (dealing with the mess of maintaining referential integrity).
- You shove either those scripts or data files into your code repo.
- You write in a README the loader command to get it into your database of choice, and like a rite of passage, everyone on the team struggles through getting it to work during onboarding.
And struggle you will! Databases are notorious for requiring large installations with a multitude of dependencies, navigating arcane configuration, and the extensive work of establishing the test dataset: creating schemas, creating tables, and finally loading your generated data.
Many teams will have a separate installation process for each operating system their developers use, each of which usually takes a good bit of trial and error after poring over database documentation. Others will set up a staging or test server for their team to use, but it risks becoming out-dated without a regular rebuild, it means a lot of coordination between team members, and there is rarely a one-size-fits-all test environment. For example, when you want to test the scale of your application, your entire team is saddled with a giant database that slows everyone down; likewise, too small of a database can limit effective testing for certain projects.
Wouldn’t it be great if there were a tool that made it easy to package a database, its dependencies, loader scripts, and its data for any operating system? That any team member could use to easily test their code against a test environment, be it on their local machine or on a quickly spun-up test server in the cloud?
Doing Better with Docker
Good news, everyone! There is a better way! The many benefits that Docker provides for shipping your code in production work likewise for testing. You can create a database that is easy to distribute, deploy, and reset so that individuals and teams can work effectively without stepping on each other’s toes. For larger organizations, you can even package multiple test databases that each contain different tables or amounts of data, easily available to everyone in your engineering org. Best of all, you don’t need a different local installation README for each operating system—developers can just use the OS they feel most productive in without the headaches of the past.
This configuration does the following:
Right off the bat, we’ve removed a lot of work from the README, and if you have data or data-generating scripts in your repo, you’ve made it much easier to get up and running quickly by simply pointing them to load into the database via the host part. Many databases have official images, including Postgres, MySQL, MS SQL Server, Oracle, and Mongo, among others.
Even Better: Packaging Your Data & Scripts with Docker
Instead of just using the available database image and calling it a day, you can easily create a new Docker image based off of your database’s official image. Many official images have an entry point folder that allows you to run scripts upon initialization, enabling the container to be immediately useful as soon as it’s up. To do this, you’ll likely need to add a Dockerfile and a build script, and potentially any data import scripts.
Making a basic Dockerfile is not hard at all, in fact it’s only two lines:
Here we did the following:
In our code repository, you’ll see we’ve commented out additional options to consider for your use case, such as:
- Adding any additional dependencies, like database extensions or certificates using typical package manager commands.
- Adding data files to an accessible directory for use with data import scripts.
- Adding custom scripts to import data outside of the container, e.g. via S3 or other data storage.
Next, you’ll need to create SQL loader scripts that create your schema and load in your data. Typically the easiest way to do this is by using database dump commands with an existing test database. We recommend three scripts: the first one creates your schema without constraints, the second one loads your data, and the third one adds your constraints to all of your tables. This way, you’re able to load your data without worrying about the order in which it’s loaded (which would matter if foreign key constraints were already in place).
Some databases will turn off constraints using a data import tool until the import is complete, which means you can just keep your constraints in the schema script. For simplicity, in our example code we only use two scripts since our data script loads tables in order and doesn’t break referential integrity.
Sticking with PostgreSQL as our example, you can run the following command to get a dump of your existing schema and constraints:
Followed by a similar command to get just the data:
Notably, we’ve added numbers to the beginning of each filename to ensure that the schema script is run before the data load script. (Postgres runs the scripts in this folder in alphabetical order.)
Take a look at our code to see the full output of both of these files. Feel free to modify these files as you like or write them from scratch, especially if you plan to load your data using a COPY or LOAD command.
Modifying your docker-compose.yml
Setting up CI
Now you can check your Dockerfile, scripts, and data files into a code repo, and create a build for the Docker container in your CI service of choice using your build script. Any time the scripts or data files are changed, we recommend triggering a new release build for the container and pushing it to your container repository for use by your entire team (of course, after you’ve checked that nothing new caused the build to break 😉).
Next Level Test Data Management
If you’ve followed the steps thus far, you’ll likely find that your testing setup is much more reliable and useful to your entire team, and it’s a big leap forward in efficiency.
From here, there are a lot of ways that you can utilize your test image to make your team more productive. You can load a much larger dataset using custom data import scripts from services like S3. You can create dedicated test servers for bigger experimentation before a reset. You can create multiple test images covering a variety of use cases, e.g., target different tables for different teams, allowing engineers to better focus on the test cases that matter to them.
Of course, you’ve likely recognized that one of the challenging parts of keeping test infrastructure useful is that it needs to maintain a resemblance to your production data. This means adding new tables, adding new data, modifying existing tables and data, etc. on a regular basis. Otherwise, developers will simply revert to testing against production again, or be left to test against a fragile, costly staging environment—even if it’s easy to deploy!
Creating up-to-date test data from your production environments is Tonic’s specialty. Tonic makes it easy to copy your schema, shrink your database, and obfuscate your data for privacy while maintaining statistical relationships and referential integrity throughout your team’s data. What’s more, Tonic alerts you to schema changes to protect against data leaks and ensure that your test data never breaks or gets outdated. Interested? Shoot us an email at firstname.lastname@example.org.
Getting test data right is a difficult but immensely valuable problem to solve for your team. Docker is one of many tools that makes it much easier to maintain your test infrastructure in a way that enhances flexibility while keeping server costs and team hours down. Happy building!
Hat tip to Craig Kerstiens of Crunchy Data and postgresweekly.com, and Kevin Xu of interconnected.blog for weighing in on our process and providing some great feedback as we crafted this post.