The Value of Database Subsetting

Yuri Shadunsky
March 28, 2023
The Value of Database Subsetting
In this article

    Subsetting is the unsung hero of modern test data workflows. When you need to de-identify production data for targeted debugging or for development in local dev environments, it’s unlikely that you need (and often inefficient to use) your entire production database. What you need is a subset, and grabbing that in a referentially intact way across the tables of your database is no easy task. Which is why it’s been a focus of ours since’s earliest days.

    Over the years, we’ve pioneered subsetting with our industry-leading, patented approach that countless customers rely on, from eBay to Kin Insurance, to get their developers the tailored test data they need. Today, we’re excited to spotlight several recent innovations in our subsetter that are making this powerful solution even more impactful.

    But before we jump into those features, let’s talk about why you should care: what are the use cases and what value does subsetting deliver for each?

    Real-World Use = Real-World Value

    Tonic’s subsetter allows you to create a realistic slice of data that only has the data you need, and nothing else. Here’s why you might need that:

    • Local developer environments: Shape and size your test datasets to specifically support the areas of your product each of your developers is working on. Subsets give your developers more manageable test datasets that run more effectively in their local environments. The value: optimized developer productivity, accelerated testing cycles, decreased storage and compute costs.
    • Targeted debugging: Got a customer with a problem? Or maybe a specific area of your product that is struggling with a segment of your customers? Hone in on the data behind the issue so you can rapidly reproduce what’s wrong and push out the fix your customers need. The value: faster debugging, improved quality, increased time-to-value, happier customers.
    • Safe-to-share datasets: Whether for off-shore teams or for partners and customers that need access to only certain segments of your database, subsetting makes it easy to add a further layer of protection in granting data access. You can combine subsetting and de-identification to both minimize and protect your shared data footprint, or simply create subsetted datasets on a per partner/customer basis to include only the data they should see. The value: decreased risk, increased efficiency, optimized productivity.
    • Data minimization: Big data comes with big risk. The more you can minimize your footprint, even when de-identifying your data, the lower your risk becomes. Subsetting is a secure, cost-effective way to ensure that you’re only provisioning as much data as your developers need to get their work done well. The value: decreased risk, decreased storage costs, increased data security.

    Essentially, subsetting your test data allows you to work faster and more effectively in delivering quality software, while saving your team money and reducing your risk. To dive deeper into what subsetting might look like at your organization—or in the land of Middle Earth—check out this blog written by yours truly. We dropped this link earlier, too, but in case you missed it, Ebay shrunk their test database by 1000x using the Tonic subsetter.

    Continuous Innovation

    Now that you’re up to speed on subsetting, let’s explore those new features we recently released that make Tonic’s subsetter even more impactful (and, dare we say, delightful!):

    • Graph View for more control, visibility, and efficiency: Graph view is the jewel in our visualization crown. This new UI within the platform makes it even easier for you to rapidly get the exact subset of data you need. It equips you with a graphical representation of your database, its relationships, and the data that your current subset and upcoming subset configuration is capturing. As you change your subsetting configuration, you’ll see in real-time which tables will be included or excluded from your next subset, and why. Customers have told us that it’s helping them better understand their own database structures even before they get started with subsetting. With Graph View, you become the expert on the shape and structure of your database. Aka: you get to play DBA. 😏
    • Filtering optional data: Our subsetter’s algorithm intelligently pulls the records you need to maintain referential integrity, along with optional records aimed at ensuring that your subset maintains the look and feel of your full dataset. Sometimes, you might want to prune some of those optional records in the interest of slimming down the size of your subset. To meet this need, we give you the ability to remove some of those records from your subset through filtering.
    • Table filtering for data warehouses and Spark: When it comes to subsetting, data warehouse users don’t have quite the same technical needs as users of traditional relational DBs (e.g. referential integrity isn’t an issue when all your data lives in one table). But the core value that subsetting offers still applies. Data warehouse users can now reap those same benefits by way of table filtering in Tonic: a simple WHERE clause is all it takes to minimize data stored in BigQuery, Redshift and Snowflake, or connected to Tonic via Spark/Databricks. Use filtering to get only the data you need in your destination database, so you can speed up your generations, run your tests faster, and maximize your data security.
    • Workspace inheritance: In case you’re wondering how subsetting fits together within the rest of Tonic’s platform, and how you can maximize its use at enterprise scale, we’ve got news here, too. Our Workspace Inheritance functionality is aimed directly at helping you optimize your subsetting configurations. Workspace inheritance streamlines your test data workflows by enabling you to de-identify your data once, and then create child workspaces that inherit all of those de-identification settings and from which you can easily generate subsets specific to different teams. De-identify once; subset as often and as uniquely as you please.

    The Subsetter You Deserve

    Subsetting provides developers with safe, realistic test data that is manageable on a laptop. Local development saves time previously spent navigating VPN access and waiting to access data. It also reduces storage costs by removing the need to run parallel full replica environments for development and debugging purposes. Targeted subsets make it more efficient to reproduce defects. And with a streamlined development process, developers can easily build and test features locally, on realistic data, allowing them to work on their own “branch” of data. The end result: faster time to value at higher quality.

    We have built and continue to innovate our subsetter to be a state-of-the-art solution for the world’s leading brands and largest development teams. We know what it takes to shrink some of the most massive sensitive datasets down to mere gigabytes of actionable data. And we’d love to hear what you’d like to see subsetting do next. Reach out to us by way of our contact form, or book some time directly with our team.

    Yuri Shadunsky
    Senior Product Manager
    Technical and customer-centric product leader with 10 years’ experience building cross-functional partnerships to solve user needs.

    Fake your world a better place

    Enable your developers, unblock your data scientists, and respect data privacy as a human right.