All Tonic.ai guides
Category
Developer productivity

Build vs buy: your guide to finding scalable, efficient test data solutions

Andrew is the Co-founder and CTO of Tonic. An early employee at Palantir, he led the team of engineers that launched the company into the commercial sector and later began Palantir’s Foundry product. Today, he is building Tonic’s platform to engineer data that parallels the complex realities of modern data ecosystems and supply development teams with the resource-saving tools they most need.
Author
Andrew Colombi, PhD
May 15, 2025

How to approach the test data challenge

You can’t build quality software without quality test data. With data privacy regulations limiting access to the data developers once relied on, organizations are faced with a choice: generate test data by building a process in-house, or buy a test data solution available on the market.

At Tonic.ai, we don’t believe there’s one right answer—build vs buy is rarely a trivial question. There are, however, key considerations that serve as waypoints as you navigate the path to a decision. In our work with customers like eBay, JPMorganChase, and CVSHealth, we’ve gained a deep understanding of the technical complexities, potential speedbumps, and critical requirements stakeholders face in the realm of test data.

In this guide, we’ll share what we’ve learned over the years to inform your build vs. buy decision. At the end, you’ll find a checklist to help you consolidate your thoughts and choose the solution that’s best for you.

Signs that you should buy

Here are some signals that buying a test data solution is your best bet.

1. Regulatory compliance is a must.

The stringent requirements of today’s data privacy regulations are most easily and securely met by built-for-purpose solutions that offer capabilities designed to ensure compliance with laws like HIPAA and GDPR. A solution provider will not only bring in-depth regulatory knowledge and certified technologies, they will bring experience with having supported many others in their compliance needs.

2. You have a low tolerance for risk.

Working with a vendor who has expertise in data de-identification significantly reduces the risks involved in sensitive data management. What’s more, internal tools rarely get sufficient investment in risk management features, such as role based access control and auditing. These features, while less interesting, are key tools for reducing risk.

3. You’re working with large-scale data and/or multiple data sources.

Modern services-oriented architectures distribute data across multiple services. Often this means multiple teams must create data. Datasets that must work together are easier to create using a common platform to maintain relationships and ensure cross-database consistency. Very large data sets are also harder to work with; having experts on the job can be critical to creating functional data. Lastly, large-scale data can require advanced capabilities like database subsetting, which robust solutions should provide.

4. You need a standardized approach via a consistent solution.

When multiple teams are de-identifying production data, the ability to standardize and set policies within the de-identification solution significantly reduces the risk of misconfiguration. A platform that allows for decentralizing test data generation while still keeping processes in line with organization-wide standards ensures test datasets across multiple systems are compatible and enforces data governance and protection requirements with efficiency.

5. Consistent test data is key to data quality.

The ability to de-identify production data consistently across your database, no matter how complex your data types get (e.g. JSON, Regex, etc.), is critical to creating useful test data. Solutions that offer capabilities like format-preserving encryption and the ability to handle identifiers, including primary keys and foreign keys, provide the tools you need to achieve the level of data quality you require.

6. Your data and schema change regularly.

A quickly moving schema will likely need more frequent updates to the test data, but even slowly moving schemas may need frequent updates. Consider, for example, banking software; stale data will skew your testing as financial transactions on accounts get older and older. Frequent data updates require a high-performant data de-identification solution, especially if you’re working with large volumes of data.

Risks involved in buying

Buying comes with a bevy of benefits, especially when managing large, complex, sensitive data across multiple teams. However, there are some potential stumbling blocks. Here’s what to look out for when buying a solution, along with some tips for mitigating each concern.

  1. Regulated data is subject to strict requirements around data retention, deletion, and storage. Utilizing a third-party solution to process your regulated data opens the door to another area within which data retention must be managed.
    • Tip
      When vetting a vendor, ask them about their data retention policies, especially with respect to data provided for de-identification purposes. Is it stored on their servers? If so for how long and for what purpose?
  2. Whatever data regulations apply to your handling of your data must apply to the vendor as well. Your vendor should have the appropriate certifications and security protocols in place to align with your compliance requirements.
    • Tip
      Ask your vendor to provide evidence of their security practices, including their SOC2 report. If your data is subject to HIPAA, make sure your vendor has the correct controls in place for working with HIPAA data. The availability and option of working with an external expert determinator is a plus.
  3. Your vendor could pivot or go out of business. This can mean that they will end support for the solution you’ve come to rely on.
    • Tip
      Request code escrow with a third party, so that even if the vendor disappears, old versions of the software will remain available to you.
  4. You have less control over the solution. While you can request features, they may not make it onto the product’s roadmap. And when they do make it onto the roadmap, you may need to wait longer than you’d prefer to see them built.
    • Tip
      Take time to clearly define the capabilities you're looking for before researching solutions: create a list of essential features, nice-to-haves, deal-breakers, and niche use cases you may have. Gather data from vendors about customer support, data type coverage, and any native integrations you require. This will help you make a more informed decision from the outset.
Get the test data management solution built for today's developers.

Accelerate your release cycles and eliminate bugs in production with safe, high-fidelity data de-identification.

Signs that you should build

Here are some reasons for which you might opt to build a test data solution, rather than purchasing a solution from a vendor.

1. Your data de-identification needs are simple.

If you’re handling basic data types of PII and you know exactly where your sensitive data lives in your database, you may be able to catch it all with a home-grown solution. You can also steer clear of a lot of complexity in the tool you build, so long as you don’t require consistent data de-identification and your test data from different sources doesn’t need to form a coherent dataset across all sources.

2. Your datasets are small.

This speaks to the complexity considerations in the first item in this list, and also to performance and scalability. Smaller datasets are easier to mimic with simple tooling or internal scripts and faster to spin up without as much need for processing power.

3. The de-identification process needs to integrate with homegrown solutions.

If you have an internal database or CI/CD process that would require you to build extensive customization around a vendor’s solution, you may find that you need to build your own method for data de-identification in order to get something that works with your existing tech stack.

4. Your data isn’t changing.

If your schema or production data aren’t expected to evolve much over time, you may get by with infrequent updates to your test data. Consider whether your team needs to refresh their test data a few times a year or a few times a week. If it’s just a few times a year, purchasing a solution may not make sense.

5. Your use cases are niche, requiring per-team solutions.

If each team would be better served with bespoke approaches to de-identification built specifically for their needs, and those approaches don’t need to speak to each other or follow organization-wide standards or policies, having each team build its own solution would make sense.

6. You don't need advanced features like subsetting.

De-identification is just one piece of the test data puzzle. Additional capabilities like subsetting require a comparable investment of resources to build. If the list of features you need is limited, building a solution internally can be more feasible.

Risks involved in building

If you’ve decided to build your own test data solution, or to leave this to your team to build, be aware of the following risks before you get started.

  1. You don’t know what you don’t know. Data de-identification can be a tricky business, depending on the specific requirements you have. You may be exposing yourself to risks you aren’t even aware of, or you may not reap the benefits of all the approaches that exist.
    • Tip
      Be sure that your teams working on the in-house solution have an in-depth understanding of your data, both in terms of knowing what sensitive data you have and where it lives, and understanding the relationships that exist throughout your database. Knowledge of key de-identification techniques, like format-preserving encryption, is also fundamental.
  2. There will be things you want to do, that you won't have time to do. Take, for example, access controls. An important way to reduce risk is careful access control and auditing, but internal tools often don’t have the capacity to invest deeply in these capabilities.
    • Tip
      If you are unable to put access controls in place, limit the number of people involved in building and managing your data de-identification processes, to minimize data exposure.
  3. Be prepared for defect risk. Internal tools tend to have fewer quality controls than production software. A defect in your de-identification tool may mean sensitive data is exposed inadvertently.
    • Tip
      Incorporate a method to validate your de-identification processes, by regularly reviewing your test data for the presence of production data.
  4. Expect incompatible datasets. Running multiple services on top of test data created by individual teams could create problems for testing. Additionally, schema changes can quickly escalate conflicts between test datasets.
    • Tip
      Implement standards for test data generation and regularly validate different teams’ datasets against each other. Determine the cadence with which you expect your schema to change and allocate the necessary resources to refresh your test data accordingly.
  5. Building requires dedicated resources, time, overhead, and maintenance. When your schema evolves quickly, you will likely find yourself with high ongoing maintenance costs with a custom solution. The resources you’ll need to devote to your in-house solution will dig into your developers’ time and detract focus from core business goals.
    • Tip
      Anticipate these needs by planning out your budgets and resources internally to support internal development of your de-identification solution, and expect to maintain these allocations over time.

Build vs Buy checklist

Use the following checklist to determine whether building or buying a test data solution is the best path forward for your organization.

Tonic.ai's guidance
Buy

Rapid access to quality test data is fundamental to efficient developer workflows. By equipping developers with a robust test data solution, instead of requiring them to build and manage their own in-house tool, organizations accelerate development and ship better products faster.

Tonic.ai's guidance
Buy

Buy a solution not only for reliable data utility, but for peace of mind and the ability to seamlessly manage schema changes and refresh your test data on demand.

Tonic.ai's guidance
Build

If extensive customization is key to your success, you may not find what you need in an existing product.

Tonic.ai's guidance
Buy

The rate at which data is growing today requires scalable, performant test data solutions to future proof your workflows and keep your company’s growth on track with the growth of your data.

Tonic.ai's guidance
It depends

Your tech stack will dictate the best fit for your needs.

Tonic.ai's guidance
Buy

Sensitive, regulated data requires informed approaches and comprehensive solutions to minimize risk.

How Tonic.ai can help

Tonic.ai offers comprehensive solutions for test data generation and provisioning, including Tonic Structural, the test data management platform built for developers, and Tonic Ephemeral, a platform for spinning up temporary test databases on demand. These solutions equip you to:

  • Automate test data generation with industry-leading data de-identification, synthesis, and subsetting, that maintains referential integrity across your databases
  • Set data governance standards and ensure policy enforcement with custom sensitivity detection and generator presets
  • Leverage RBAC, SSO, privacy reports, audit trails, and regulatory-specific data generators to ensure compliance and security
  • Integrate quality test data seamlessly into your workflows with native data source connectors and a full API
  • Keep data fresh and up-to-date with automated schema change notifications and nightly data generation

Test data shouldn't be a burden or a bottleneck to productivity. Quality data efficiently provisioned is the fuel of product innovation that accelerates time-to-market and ensures consistent quality. Tonic.ai exists to enable your developers and your company’s success. What will you build with your data? Connect with our team or visit our product docs to explore the possibilities.

Build vs buy: your guide to finding scalable, efficient test data solutions
Andrew Colombi, PhD
Co-Founder & CTO

Andrew is the Co-founder and CTO of Tonic. An early employee at Palantir, he led the team of engineers that launched the company into the commercial sector and later began Palantir’s Foundry product. Today, he is building Tonic’s platform to engineer data that parallels the complex realities of modern data ecosystems and supply development teams with the resource-saving tools they most need.

Make your sensitive data usable for testing and development.

Accelerate your engineering velocity, unblock AI initiatives, and respect data privacy as a human right.
Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.