Integrating Tonic Structural into your data refresh and CI/CD pipelines

Author

Nick Rios

March 22, 2024

In today's fast-paced development environments, it is paramount to ensure data consistency and integrity, as well as comprehensive security, across all stages of development, testing, and production. Tonic provides a robust platform, purpose-built for this landscape, offering data de-identification, subsetting, and ephemeral environments to generate realistic and on-demand datasets that are privacy-compliant. This article explores some examples of how customers can leverage this powerful toolset to enhance their workflows.

We’ll cover how to enhance your CI/CD pipelines by:

Automating test data refreshes
Triggering or automating data de-identification
Spinning up and down data environments on demand

What is a CI/CD pipeline?

In programming, a CI/CD pipeline is, “a framework that emphasizes iterative, reliable code delivery processes for agile DevOps teams”. This workflow may involve steps like continuous integration, testing, automation, and other stages in a software development life cycle.

In order for all of these to work, a crucial requirement is that your lower environments faithfully replicate production. From QA to security, to even your development sandboxes, you need to recreate environments that look, feel, and behave like production.

What happens when your CI/CD stages don’t replicate production accurately? Bugs slip past testing, security scans aren’t complete, and developers build features that never receive the level of scrutiny necessary in order to qualify their code for production release.

Shipping code without proper QA is like a car company building a car that never actually received a crash test. Which begs the question: what is the crash test dummy equivalent for your CI/CD pipeline? Your test data.

Sourcing test data for CI/CD workflows

Sourcing test data for CI/CD pipelines is crucial for ensuring that code changes are validated accurately and efficiently. But in recent years, the availability of quality data has taken a hit, largely driven by increased data privacy regulations restricting developer access to production data, but also as a result of the increased complexity of production data, making it sprawling, siloed, messy, and unmanageable for use in lower environments.

When production data isn’t accessible to or usable by developers, test data must be created in some way to keep testing workflows running smoothly. Here are several common approaches to sourcing or generating test data:

Dummy data generation

Technique: Use scripts or specialized tools to generate data from scratch as needed, creating structured but entirely fictional datasets.
Considerations: This approach allows for testing in scenarios where no real data patterns are required, reducing the complexity of creating the test data and eliminating the need for real data. However, that reduced complexity comes at a cost, especially if your resulting test data doesn’t accurately reflect the realities of production. In lacking the nuances of real data, dummy data may not effectively simulate real-world data interactions and edge cases which can lead to costly oversights in testing and bugs in production.

Data cloning

Technique: Clone the entire database or critical parts of it for testing purposes.
Considerations: This approach can be resource-intensive in terms of both storage and management but provides a high-fidelity environment for testing complex data interactions. It is useful for performance testing under production-like conditions. But, of course, it is not an approach that can be taken when sensitive, regulated data like PII is involved.

Data de-identification or masking

Technique: Use data masking techniques to anonymize sensitive data. This process involves transforming personal identifiers and other sensitive information into anonymized counterparts. Many different approaches can be used to de-identify data, ranging from less secure to fully protected and impossible to reverse engineer.
Considerations: This protects data privacy and allows for the use of realistic data scenarios in testing without the risk of data breaches. When performed effectively, data de-identification ensures compliance while also maintaining data integrity and its utility in development and testing. Depending on the approach taken and the scale of the data involved, it can be a complex process.

Data subsetting

Technique: Extract a subset of real data from production environments that is representative of the full dataset but smaller in size.
Considerations: Subsetting reduces the volume of data used in testing, speeding up the test runs while maintaining enough variety for effective validation. It can also be used to craft targeted datasets for software debugging. Note that on its own, it does not provide protection for sensitive data beyond reducing the data footprint involved, but when paired with data de-identification, it can significantly reduce the risk of sensitive data leakage.

Data synthesis

Technique: Generate synthetic data that mimics the statistical properties of real data, without exposing sensitive information.
Considerations: This approach ensures data privacy compliance and can provide scalable, diverse datasets for thorough testing. Given the high degree of fidelity that can be achieved when synthesizing data, it is ideal for generating statistical data, as well as unique identifiers that must be used to map and maintain relationships across a test database. On the flip side, it can be overkill for certain types of test data that need to behave the same as production, but don’t need to represent statistics or relationships at the same time. In those cases, other methods of data de-identification can suffice.

Of the above approaches, a number of them can be leveraged simultaneously by implementing test data management (TDM) solutions. TDM products like those offered by Tonic.ai combine automated data provisioning, masking, subsetting, and synthesis to manage the full lifecycle of your test data, ensuring that fresh and relevant data is available for each test cycle.

To better understand how test data can be automated within your workflows, let’s explore several processes available when generating and provisioning test data with Tonic Structural and Tonic Ephemeral.

Automated data refresh for development environments

Imagine a scenario where your development team needs fresh, anonymized data every morning. Using Tonic, and the template scripts from the tonic-workspace-automations repository shared below, you can set up a scheduled job (e.g., via cron or CI/CD tools like Jenkins, GitLab CI, or GitHub Actions) that:

Pulls the latest production data into a staging database.
Triggers Tonic Structural’s data anonymization process via the Tonic API.
Updates the development database with the newly anonymized data.

This ensures that your development environment regularly reflects current production trends without exposing sensitive information.

Furthermore, capabilities like our patented subsetter or Tonic Ephemeral can drastically reduce infrastructure costs and management headaches - keeping developer enablement at the heart of your strategy without cutting corners with Infosec.

Tonic Structural’s subsetter is a uniquely engineered and patented technology for optimizing test data management by reducing databases to manageable sizes without losing referential integrity, thus enhancing developer productivity, reducing storage costs, and accelerating development timelines.
Tonic Ephemeral is designed specifically for the rapid deployment and management of ephemeral data environments, allowing the instant spin-up and down of fully hydrated test and development databases. This streamlines the development process, significantly reduces costs, and eliminates lengthy ticketing processes, making it an ideal solution for modern development needs.

Integrating Tonic into CI/CD pipelines for automated testing

For teams practicing Continuous Integration, maintaining up-to-date test data is crucial. By integrating Tonic into your CI pipeline, you can:

Trigger a data anonymization process in Tonic Structural as part of your build pipeline.
Once anonymization is complete, automatically fetch the data and populate your test databases.
Run your automated test suites against this data to validate changes with realistic datasets.

This approach reduces manual intervention, enhances testing accuracy, and ensures privacy compliance.

Dynamic environment setup for feature branches

In a GitFlow-based workflow, developers often need separate environments for new features. Leveraging tonic-workspace-automations, you can create a pipeline that:

Detects new feature branches automatically.
Creates isolated, anonymized datasets specific to each feature.
Tears down or refreshes data when branches are merged or deleted.

This enables a scalable, data-driven approach to testing new features in isolation, improving code quality and reducing cross-feature conflicts.

Once again, the Tonic’s subsetter can play a pivotal role here in minimizing data size while maintaining utility. In tandem with that,Tonic Ephemeral can be leveraged to simplify the deployment strategy for these datasets.

Automating data anonymization with Tonic.ai

The tonic-workspace-automations repository contains scripts and workflows designed to automate the process of fetching, anonymizing, and updating data within Tonic Structural workspaces. Building upon the code here can allow you to more easily build out the examples above.

Additional resources include:

Key components of the repository:

Tonic Configuration: Automate the configuration of Tonic Structural workspaces, including data connections, table selections, and anonymization settings.
Tonic Session: Methods to automate interactions with the Tonic API and to more easily integrate Structural with other services.

Leveraging Tonic Structural example scripts

To start integrating these automations, follow these steps:

Clone the tonic-workspace-automations repository: Access the scripts and workflows designed for Tonic integration.
Customize scripts for your environment: Modify the provided scripts to connect to your databases and Tonic Structural workspace. Set up necessary authentication, endpoints, and data mapping according to your infrastructure.
Incorporate scripts into your existing pipelines: Integrate these scripts into your CI/CD platform of choice. Use conditional triggers, environment variables, and pipeline parameters to tailor the workflow to your team’s needs.
Monitor and refine: After integration, monitor the performance and impact of these automations. Collect feedback from developers and testers to refine the data sets and anonymization rules.
Maintain compliance and security: Regularly review the anonymization settings and data handling practices to ensure compliance with data privacy regulations and internal policies.

Conclusion

Data de-identification, synthesis, and automated provisioning are key to a more accurate and more efficient CI/CD pipeline. Integrating Tonic.ai’s portfolio of products into your existing data refresh and CI/CD pipelines can significantly enhance your development and testing workflows. By expanding on the tonic-workspace-automations repository, teams can automate the creation of secure, realistic datasets, enabling more accurate testing and development processes while maintaining data privacy and compliance. To learn more, book some time with our team’s data transformation experts.

Streamline test data generation and provisioning.

Accelerate your release cycles and eliminate bugs in production with realistic, compliant data de-identification.

Book a demo