What is Data Subsetting? The Art and Science of Only Using the Data You Need

Yuri Shadunsky
October 17, 2022
What is Data Subsetting? The Art and Science of Only Using the Data You Need
In this article

    Loved our recent blog on data protection strategies? Today, we’ll take a deeper dive into another powerful data management strategy called subsetting. What’s subsetting, you ask? It’s the art and science of creating a coherent slice of data across databases of different types for targeted use cases without breaking referential integrity.

    Specifically, we’ll discuss the Tonic feature called subsetting—and how you can use it to make fit for purpose private databases for all of your stakeholders.

    Heads-up, we will be drawing inspiration from the hit Amazon TV show Rings of Power for example stakeholders. You’ve been warned. 

    Identifying Your Data Needs

    The first step to successful subsetting is identifying the type of data you need to work with. So let’s talk through your different data needs:

    1. Production database

    This is your one database to rule them all (or many databases if you are a microservices shop). It has all of your most sensitive, most important data and needs to be accurate and performant at all times. Let’s call this database Sauron.

    2. Key testing databases

    These are the databases that need large swaths of the data (de-identified, of course!). This is your staging database that gets the last run of tests before sending things to production; your QA database that tests things at scale. And perhaps you have 3 of these types of databases. So, let’s call these databases your Elves.

    3. Customer debugging databases 

    Now, you’re a successful ring maker and you’ve got all of these customers. And to support these customers, you hire 7 customer support engineers (call them dwarves). The customer support engineers are each working on a specific customer and need a de-identified database scoped just to the customer with the bug.

    4. Local test databases 

    As a growing ring-maker, you’ve got a small but mighty team of 9 human engineers. Each of them needs a de-identified database that can fit on their laptop so they can quickly test out their features and identify bugs before sending their code to staging. (Not to worry, these databases won’t turn your engineers into Nazgul.)

    Now, that we have the use cases, let’s take a pause to briefly explain what subsetting is.

    What is Subsetting?

    Subsetting is a feature in Tonic that enables you to create a small, representative slice of your data while maintaining referential integrity. There are two ways in which you can subset that we’ll discuss below: custom where clauses or target percentages. Returning back to our above use cases, here is a walk through for how we might execute each of them within the Tonic subsetter.

    Production / Sauron database

    This database is your secret sauce! There is no subsetting involved here. But keep this data protected at all costs and use it only for key customer use cases.

    Key Testing / Elven databases

    These databases need to have statistical and size representativeness. However, that does not mean you merely want a de-identified production database. You can subset these down a little bit to reduce storage costs. For example, if you have a 150 GB production database, let’s aim to have these databases at 50 GB each, cutting your storage costs substantially. To achieve this, we might aim the target percentage of your primary table (e.g. customers) to 35%. The subsetting tool will then traverse all the appropriate relationships to ensure referential integrity is complete.

    Customer debugging / Dwarven databases

    For these databases, you have two goals: make it as easy as possible to reproduce the bug while keeping storage costs to a minimum. In this case, we’d recommend using where clauses along the lines of:

    • customer = Gondor or customer = the-shire
    • It’s unclear how big a database you will end up with as Gondor might have many more rows than the shire, but that is less important than giving your customer support engineers the exact set of de-identified data they need to debug the problem.

    Local / Human databases

    For these databases, size is the primary concern. You want your developers to be able to use these representative databases on their laptops and quickly catch bugs before sending their code to the Elven databases. Specifically, maybe you want devs to run their local unit tests in less than one minute. The exact size you need will vary based on your unit tests, but for the sake of our example let’s assume that means you need your database to be less than 10 GB. So, here, let’s choose a target percentage. Given our 150 GB starting size, let’s set the target percentage on your primary table to 5%.

    Now, you have all of your databases fit for purpose for their needs! All of your users have all the data they need and nothing more: Sauron is used for your most important production use cases, Elves are used for your last line of testing, Dwarves are used for debugging customer issues and your Human databases are used for developers' local testing. 

    With fit for purpose subsetting configurations, each stakeholder can get data at the frequency they require—and better yet, you schedule the ones that make sense so data is always ready. Additionally, you can leverage a single set of de-identification policies to protect all of them, or customize if security access levels are different (this is made possible with workspace inheritance!).

    Why Should I Use Subsetting?

    A very relevant question, we’re glad you asked. Subsetting improves your software development in the following ways:

    • You find bugs faster — Your developers can run more tests locally with a high quality testing database that is small enough to fit on their laptops.
    • You can share data with offshore developers by sending them small quantities of protected data.
    • You save money by reducing your storage costs — instead of testing on a database that is exactly the same size as production, you can test on a database 1/3rd the size that is exactly the same quality.
    • Higher compliance — less data floating around makes it easy to stay compliant with regulations like GDPR, SOC2, etc.
    • Faster iteration cycle — smaller databases for testing mean your devs can iterate faster and release high quality features sooner.

    These are just a few examples of reasons to implement subsetting!

    In conclusion, there’s one ring to rule them all when it comes to smaller and more efficient databases—and it’s called subsetting with Tonic.

    Want to learn more about how subsetting can give your team forbidden power over all the lands and peoples of your database? Check out eBay’s comprehensive case study on how Tonic transformed their database—or book a demo today to take our platform for a spin yourself. 

    Yuri Shadunsky
    Senior Product Manager
    Technical and customer-centric product leader with 10 years’ experience building cross-functional partnerships to solve user needs.

    Fake your world a better place

    Enable your developers, unblock your data scientists, and respect data privacy as a human right.