How to approach the test data challenge
You can’t build quality software without quality test data. With data privacy regulations limiting access to the data developers once relied on, organizations are faced with a choice: generate test data by building a process in-house, or buy a test data solution available on the market.
At Tonic.ai, we don’t believe there’s one right answer—build vs buy is rarely a trivial question. There are, however, key considerations that serve as waypoints as you navigate the path to a decision. In our work with customers like eBay, JPMorganChase, and CVSHealth, we’ve gained a deep understanding of the technical complexities, potential speedbumps, and critical requirements stakeholders face in the realm of test data.
In this guide, we’ll share what we’ve learned over the years to inform your build vs. buy decision. At the end, you’ll find a checklist to help you consolidate your thoughts and choose the solution that’s best for you.
Signs that you should buy
Here are some signals that buying a test data solution is your best bet.
1. Regulatory compliance is a must.
The stringent requirements of today’s data privacy regulations are most easily and securely met by built-for-purpose solutions that offer capabilities designed to ensure compliance with laws like HIPAA and GDPR. A solution provider will not only bring in-depth regulatory knowledge and certified technologies, they will bring experience with having supported many others in their compliance needs.
2. You have a low tolerance for risk.
Working with a vendor who has expertise in data de-identification significantly reduces the risks involved in sensitive data management. What’s more, internal tools rarely get sufficient investment in risk management features, such as role based access control and auditing. These features, while less interesting, are key tools for reducing risk.
3. You’re working with large-scale data and/or multiple data sources.
Modern services-oriented architectures distribute data across multiple services. Often this means multiple teams must create data. Datasets that must work together are easier to create using a common platform to maintain relationships and ensure cross-database consistency. Very large data sets are also harder to work with; having experts on the job can be critical to creating functional data. Lastly, large-scale data can require advanced capabilities like database subsetting, which robust solutions should provide.
4. You need a standardized approach via a consistent solution.
When multiple teams are de-identifying production data, the ability to standardize and set policies within the de-identification solution significantly reduces the risk of misconfiguration. A platform that allows for decentralizing test data generation while still keeping processes in line with organization-wide standards ensures test datasets across multiple systems are compatible and enforces data governance and protection requirements with efficiency.
5. Consistent test data is key to data quality.
The ability to de-identify production data consistently across your database, no matter how complex your data types get (e.g. JSON, Regex, etc.), is critical to creating useful test data. Solutions that offer capabilities like format-preserving encryption and the ability to handle identifiers, including primary keys and foreign keys, provide the tools you need to achieve the level of data quality you require.
6. Your data and schema change regularly.
A quickly moving schema will likely need more frequent updates to the test data, but even slowly moving schemas may need frequent updates. Consider, for example, banking software; stale data will skew your testing as financial transactions on accounts get older and older. Frequent data updates require a high-performant data de-identification solution, especially if you’re working with large volumes of data.
Risks involved in buying
Buying comes with a bevy of benefits, especially when managing large, complex, sensitive data across multiple teams. However, there are some potential stumbling blocks. Here’s what to look out for when buying a solution, along with some tips for mitigating each concern.
Accelerate your release cycles and eliminate bugs in production with safe, high-fidelity data de-identification.
Signs that you should build
Here are some reasons for which you might opt to build a test data solution, rather than purchasing a solution from a vendor.
1. Your data de-identification needs are simple.
If you’re handling basic data types of PII and you know exactly where your sensitive data lives in your database, you may be able to catch it all with a home-grown solution. You can also steer clear of a lot of complexity in the tool you build, so long as you don’t require consistent data de-identification and your test data from different sources doesn’t need to form a coherent dataset across all sources.
2. Your datasets are small.
This speaks to the complexity considerations in the first item in this list, and also to performance and scalability. Smaller datasets are easier to mimic with simple tooling or internal scripts and faster to spin up without as much need for processing power.
3. The de-identification process needs to integrate with homegrown solutions.
If you have an internal database or CI/CD process that would require you to build extensive customization around a vendor’s solution, you may find that you need to build your own method for data de-identification in order to get something that works with your existing tech stack.
4. Your data isn’t changing.
If your schema or production data aren’t expected to evolve much over time, you may get by with infrequent updates to your test data. Consider whether your team needs to refresh their test data a few times a year or a few times a week. If it’s just a few times a year, purchasing a solution may not make sense.
5. Your use cases are niche, requiring per-team solutions.
If each team would be better served with bespoke approaches to de-identification built specifically for their needs, and those approaches don’t need to speak to each other or follow organization-wide standards or policies, having each team build its own solution would make sense.
6. You don't need advanced features like subsetting.
De-identification is just one piece of the test data puzzle. Additional capabilities like subsetting require a comparable investment of resources to build. If the list of features you need is limited, building a solution internally can be more feasible.
Risks involved in building
If you’ve decided to build your own test data solution, or to leave this to your team to build, be aware of the following risks before you get started.
Build vs Buy checklist
Use the following checklist to determine whether building or buying a test data solution is the best path forward for your organization.
How Tonic.ai can help
Tonic.ai offers comprehensive solutions for test data generation and provisioning, including Tonic Structural, the test data management platform built for developers, and Tonic Ephemeral, a platform for spinning up temporary test databases on demand. These solutions equip you to:
- Automate test data generation with industry-leading data de-identification, synthesis, and subsetting, that maintains referential integrity across your databases
- Set data governance standards and ensure policy enforcement with custom sensitivity detection and generator presets
- Leverage RBAC, SSO, privacy reports, audit trails, and regulatory-specific data generators to ensure compliance and security
- Integrate quality test data seamlessly into your workflows with native data source connectors and a full API
- Keep data fresh and up-to-date with automated schema change notifications and nightly data generation
Test data shouldn't be a burden or a bottleneck to productivity. Quality data efficiently provisioned is the fuel of product innovation that accelerates time-to-market and ensures consistent quality. Tonic.ai exists to enable your developers and your company’s success. What will you build with your data? Connect with our team or visit our product docs to explore the possibilities.