Blog
Technical

Using Docker to Manage Your Test Database(s)

Author
Johnny Goodnow
December 10, 2020
Using Docker to Manage Your Test Database(s)
In this article
    Share

    TL;DR: Docker is a great tool for packaging and distributing test databases among your team and testing environment. We’ve created a starter repository to show these concepts in PostgreSQL and MySQL on Github if you’d like to dig in immediately while reviewing our process below.

    Containers are a valuable resource for us developers. They give you the ability to package your production code, dependencies, and assets into a standardized unit. Once conveniently packaged, running your application in a variety of environments is a breeze. And yet despite containerization’s immense value, we often see teams overlook containers entirely when it comes to managing test databases. At Tonic, whether we’re helping our customers improve testing with high-quality test databases or building our own testing environments, we rely on Docker as a key part of the pipeline. Our customers are finding a lot of value in the approach, so we thought we’d share our strategy.

    The Trouble With Test Databases

    Let’s take a look at the typical growth of a test database for a team:

    1. You and your team realize that testing in production is a bad idea. You’ve heard about it or you’ve felt the pain yourself: mistakes that lead to data loss and time spent restoring from backups; poor reliability in production due to increased database load; the security team raising their eyebrows at giving too many people access to sensitive data; and the omnipresent anxiety that testing might take down production.
    2. You decide you need data that’s similar to your production environment, so you write some scripts to generate some fake data (like faker), use some free or paid services to generate some random data for you (like Mockaroo), or extract data from production and attempt to anonymize it later (dealing with the mess of maintaining referential integrity).
    3. You shove either those scripts or data files into your code repo.
    4. You write in a README the loader command to get it into your database of choice, and like a rite of passage, everyone on the team struggles through getting it to work during onboarding.

    And struggle you will! Databases are notorious for requiring large installations with a multitude of dependencies, navigating arcane configuration, and the extensive work of establishing the test dataset: creating schemas, creating tables, and finally loading your generated data.

    Many teams will have a separate installation process for each operating system their developers use, each of which usually takes a good bit of trial and error after poring over database documentation. Others will set up a staging or test server for their team to use, but it risks becoming out-dated without a regular rebuild, it means a lot of coordination between team members, and there is rarely a one-size-fits-all test environment. For example, when you want to test the scale of your application, your entire team is saddled with a giant database that slows everyone down; likewise, too small of a database can limit effective testing for certain projects.

    Wouldn’t it be great if there were a tool that made it easy to package a database, its dependencies, loader scripts, and its data for any operating system? That any team member could use to easily test their code against a test environment, be it on their local machine or on a quickly spun-up test server in the cloud?

    Doing Better with Docker

    Good news, everyone! There is a better way! The many benefits that Docker provides for shipping your code in production work likewise for testing. You can create a database that is easy to distribute, deploy, and reset so that individuals and teams can work effectively without stepping on each other’s toes. For larger organizations, you can even package multiple test databases that each contain different tables or amounts of data, easily available to everyone in your engineering org. Best of all, you don’t need a different local installation README for each operating system—developers can just use the OS they feel most productive in without the headaches of the past.

    The Basics

    The simplest way to get started is to use a vanilla database image for your database of choice. By using docker-compose, you can set up the configuration once, and it’s just a docker-compose up -d to start the database. Here’s a basic configuration below:

    <p>CODE: https://gist.github.com/omedhabib/af9258ab47f65231bd868afa0ed69542.js</p>

    This configuration does the following:

    • Creates a container named testdb_postgres that runs off of a PostgreSQL 12 image
    • Opens up Postgres’ default port of 5432 to 5432 on your host machine
    • Attaches the /tmp folder to your host machine’s /tmp folder
    • Creates a new user with credentials user/password, and a new database called testdb_postgres

    Right off the bat, we’ve removed a lot of work from the README, and if you have data or data-generating scripts in your repo, you’ve made it much easier to get up and running quickly by simply pointing them to load into the database via the host part. Many databases have official images, including Postgres, MySQL, MS SQL Server, Oracle, and Mongo, among others.

    Even Better: Packaging Your Data & Scripts with Docker

    Instead of just using the available database image and calling it a day, you can easily create a new Docker image based off of your database’s official image. Many official images have an entry point folder that allows you to run scripts upon initialization, enabling the container to be immediately useful as soon as it’s up. To do this, you’ll likely need to add a Dockerfile and a build script, and potentially any data import scripts.

    Dockerfile

    Making a basic Dockerfile is not hard at all, in fact it’s only two lines:

    <p>CODE: https://gist.github.com/omedhabib/56d78dad8c34ea169fe0d599207cefdb.js</p>

    Here we did the following:

    1. We started by defining a Dockerfile with a FROM declaration pointing to the base image of your database.
    2. We copied over all of our data import scripts (defined in a subdirectory named sql) to the /docker-entrypoint-initdb.d/ directory. Postgres defines this folder for SQL scripts to be run during initialization.

    In our code repository, you’ll see we’ve commented out additional options to consider for your use case, such as:

    • Adding any additional dependencies, like database extensions or certificates using typical package manager commands.
    • Adding data files to an accessible directory for use with data import scripts.
    • Adding custom scripts to import data outside of the container, e.g. via S3 or other data storage.

    Build Script

    Here we simply create a shell script to make it easier to build the container. Our build.sh contains the following command which merely tags the new container as testdb_postgres and specifies the Dockerfile to use.

    <p>CODE: https://gist.github.com/chiarabeth/5fa80ff194eba5a2932c52ea97f62197.js</p>

    If you were building this in a CI environment, we’d recommend giving the tag a unique version for each build and release as a script argument, such as testdb_postgres:1.0.4.

    SQL Scripts

    Next, you’ll need to create SQL loader scripts that create your schema and load in your data. Typically the easiest way to do this is by using database dump commands with an existing test database. We recommend three scripts: the first one creates your schema without constraints, the second one loads your data, and the third one adds your constraints to all of your tables. This way, you’re able to load your data without worrying about the order in which it’s loaded (which would matter if foreign key constraints were already in place).

    Some databases will turn off constraints using a data import tool until the import is complete, which means you can just keep your constraints in the schema script. For simplicity, in our example code we only use two scripts since our data script loads tables in order and doesn’t break referential integrity.

    Sticking with PostgreSQL as our example, you can run the following command to get a dump of your existing schema and constraints:

    <p>CODE: https://gist.github.com/chiarabeth/9f9914173829a9d10714d01f1b3e8850.js</p>

    Followed by a similar command to get just the data:

    <p>CODE: https://gist.github.com/chiarabeth/9ca54d80d5ad2348c800e732ca204a2c.js</p>

    Notably, we’ve added numbers to the beginning of each filename to ensure that the schema script is run before the data load script. (Postgres runs the scripts in this folder in alphabetical order.)

    Take a look at our code to see the full output of both of these files. Feel free to modify these files as you like or write them from scratch, especially if you plan to load your data using a COPY or LOAD command.

    Modifying your docker-compose.yml

    Lastly, update the image you’re using to the name of your newly tagged one: image: testdb_postgres. If you’re versioning your container when you build it, we recommend specifying a stable release tag such as testdb_postgres:stable so that users can pull the latest update to that tag with docker-compose pull.

    Setting up CI

    Now you can check your Dockerfile, scripts, and data files into a code repo, and create a build for the Docker container in your CI service of choice using your build script. Any time the scripts or data files are changed, we recommend triggering a new release build for the container and pushing it to your container repository for use by your entire team (of course, after you’ve checked that nothing new caused the build to break 😉).

    If everything is set up, it should be as simple as distributing the docker-compose.yml file to your team and running a docker-compose up.

    Next Level Test Data Management

    If you’ve followed the steps thus far, you’ll likely find that your testing setup is much more reliable and useful to your entire team, and it’s a big leap forward in efficiency.

    From here, there are a lot of ways that you can utilize your test image to make your team more productive. You can load a much larger dataset using custom data import scripts from services like S3. You can create dedicated test servers for bigger experimentation before a reset. You can create multiple test images covering a variety of use cases, e.g., target different tables for different teams, allowing engineers to better focus on the test cases that matter to them.

    Of course, you’ve likely recognized that one of the challenging parts of keeping test infrastructure useful is that it needs to maintain a resemblance to your production data. This means adding new tables, adding new data, modifying existing tables and data, etc. on a regular basis. Otherwise, developers will simply revert to testing against production again, or be left to test against a fragile, costly staging environment—even if it’s easy to deploy!

    Creating up-to-date test data from your production environments is Tonic’s specialty. Tonic makes it easy to copy your schema, shrink your database, and obfuscate your data for privacy while maintaining statistical relationships and referential integrity throughout your team’s data. What’s more, Tonic alerts you to schema changes to protect against data leaks and ensure that your test data never breaks or gets outdated. Interested? Shoot us an email at hello@tonic.ai.

    Getting test data right is a difficult but immensely valuable problem to solve for your team. Docker is one of many tools that makes it much easier to maintain your test infrastructure in a way that enhances flexibility while keeping server costs and team hours down. Happy building!

    Hat tip to Craig Kerstiens of Crunchy Data and postgresweekly.com, and Kevin Xu of interconnected.blog for weighing in on our process and providing some great feedback as we crafted this post.

    Johnny Goodnow
    Engineering
    I've been working in technology for the past decade in Silicon Valley and San Francisco in a wide variety of roles. My best work is at the intersection of combining my deep technical expertise with my ability to relate to and cultivate a wide variety of audiences. At Tonic, I'm tackling full-stack development, with a focus on multi-database subsetting, to get developers the data they need and advance data privacy across organizations.

    Fake your world a better place

    Enable your developers, unblock your data scientists, and respect data privacy as a human right.