Developers are taking on new challenges driven primarily by data. Having quick access to high-quality, privacy-safe datasets is mission-critical, especially in environments like software testing, machine learning, and AI model training. Data synthesis has emerged as a practical and scalable solution for generating realistic data for these and other use cases, while mitigating privacy risks and production slowdowns.
Whether you’re debugging a staging environment, training a model, or building out a retrieval-augmented generation (RAG) system, synthetic data gives you the confidence that you're not putting sensitive data at risk. In this article, we’ll define data synthesis, explore some of its more common use cases, and break down the top techniques developers use to generate safe, reliable, useful synthetic data.
What is data synthesis?
Data synthesis is the process of generating artificial data that mimics the structure and statistical patterns of real-world datasets. Unlike other techniques, like data redaction or scrambling, synthesizing data generates realistic artificial records that are free of personally identifiable information (PII) and behave exactly like the original data.
In the context of software development, data synthesis plays a critical role in enabling secure testing, continuous delivery, and faster innovation. Developers use synthetic data to avoid the delays and bottlenecks that often arise when requesting access to production datasets. At Tonic.ai, we see data synthesis as a foundational piece of modern DevOps pipelines.
Some of the most common use cases for data synthesis include:
Software development and testing
By using synthetic data, development teams can reliably access usable, representative datasets across diverse environments. From frontend validation to performance benchmarking, synthetic data unlocks realistic test environments without the risk of exposing customer data.
AI model training
Synthetic data is often used to either augment real datasets or serve as a replacement in cases where sensitive data is restricted. Synthesized data can be used to train machine learning models, reduce bias, or simulate scenarios that are underrepresented in real-world production data.
LLM privacy proxy
Using real user prompts, conversations, or internal data to train large language models (LLMs) can lead to serious privacy concerns. A synthetic dataset used within an AI workflow as an LLM privacy proxy allows teams to access the massive, diverse datasets required for LLM training or fine-tuning without risking any sensitive information.
RAG systems
In RAG pipelines, synthetic data can be used to test chunking strategies, metadata tagging, or response generation before deploying real documents. This reduces the risk of exposing confidential corpora in early experimentation.
Data privacy and regulatory compliance
Synthetic datasets help organizations comply with regulations like GDPR, HIPAA, and CCPA, for example, by allowing teams to work with data that functions similarly to the original but remains fully de-identified throughout the process.
Unblock product innovation with high-fidelity synthetic data that mirrors your data's context and relationships.
Key techniques for synthesizing data
The right technique for synthesizing data depends on your use case. Are you synthesizing data for new product development, software testing, or AI model training? Let’s compare the most common data synthesis techniques, looking at their strengths and best-fit scenarios to see which would serve your purposes.
Rule-based data generation
This method of data synthesis involves defining a set of explicit rules or constraints that the generated data must follow. These rules are typically based on domain knowledge, statistical properties of real data, or specific requirements for the synthetic dataset. Developers might specify value ranges, string patterns, or dependencies between fields.
It can generate data “from scratch”, i.e. based on the rules alone, or it can generate data based on existing data, by applying transformational rules to the data. For example, the rule might stipulate shifting dates by +/- three days.
Rule-based data generation is extremely useful for greenfield product or feature development when no existing data is available to build upon. It is also useful for sales demos, as it allows you to tailor your demo data to exactly the criteria you need, to both best showcase your product and to best mirror your prospect’s needs. And in the realm of model training, rule-based synthetic data can fill the gaps or augment where your existing data is lacking. This holds true for both structured and unstructured data, like customer service conversations or medical notes.
Transformative data generation
In the context of data synthesis, transformative data generation refers to the process of creating new data by modifying or altering existing data points in a systematic way, often to change their format, structure, or content for a specific purpose. For AI model training and software development and testing, this approach is key in transforming or de-identifying existing production data for compliant and effective use by developers and AI engineers.
Transformative data generation can be prescriptive like rule-based data generation or it can be statistical like model-based data generation, depending on the specific algorithm in use. Masking and encryption are techniques that fall into this bucket, as well as tokenization and differential privacy. It is the primary approach used to de-identify data for regulatory compliance.
Model-based data generation
Model-based data synthesis uses statistical or machine learning models to generate new data points that reflect the distribution of real-world datasets.
This is a solid choice when your priority is synthetic data that closely mirrors real-world data, for example for research or data analysis. Compared to rule-based generation, model-based offers much higher fidelity. However, the compromise is evident when you look at scalability. Model-based data generation isn’t yet capable of generating datasets at the database scale. It can work well on the column or table level, but relational data across tables is not its strength..
Data augmentation
Data augmentation techniques create new data by transforming existing records by adding noise, scaling, rotating, or modifying original data points. While more common in image processing, it is also used in tabular and NLP datasets. The core idea is to create new, slightly altered versions of the data that still retain their original meaning or labels, to increase the diversity and size of training data for machine learning models.
Data augmentation is best when you have real data and want to increase either volume or variation. It’s especially useful in AI/ML pipelines for reducing overfitting. Unlike model-based generation or Variational Autoencoders (VAEs), data augmentation adds to existing records instead of creating fully synthetic datasets, making it less ideal for test data generation or regulatory use cases, where privacy is paramount.
Variational Autoencoders (VAEs)
VAEs are a type of generative neural network that learn to create new data, including images and text, that resembles the data they've seen before. They first compress the original data into a "fuzzy" or probabilistic code (not a single, fixed point), then learn to reconstruct similar data from this fuzzy code. This "fuzziness" in their internal representation allows VAEs to be creative and generate diverse, new examples that are similar to, but not exact copies of, their training data.
VAEs are most commonly associated with image synthesis, though they can also be used for NLP or structured data use cases. Compared to model-based generation, VAEs do require more training data and compute power. They can offer deeper learning than rule-based generation, but for structured data, scalability is limited.
Synthetic Minority Over-sampling Technique (SMOTE)
SMOTE synthesizes new examples by interpolating between existing data points in underrepresented classes—a technique commonly used to balance classification datasets.
SMOTE is good for building balanced classification models in situations where one class is underrepresented, like fraud detection, for example, or medical diagnosis. While model-based generation and VAEs can preserve the global structure of datasets, SMOTE focuses on rebalancing, making it better suited for improving model fairness and accuracy. It lacks the flexibility of VAEs or the directness of rule-based generation, but it is highly effective in targeted machine learning applications.
ADAptive SYNthetic (ADASYN) Sampling Method
ADASYN builds on SMOTE by focusing generation efforts on the hardest-to-classify data points (i.e., those that are surrounded by more majority class neighbors or are closer to the decision boundary), improving model generalization.
ADASYN is a good candidate to use when class imbalance is severe and performance on rare outlier cases matters, like identifying edge behaviors in security or predictive maintenance, for example. When compared to SMOTE, ADASYN is more dynamic and nuanced. It’s also not as broadly applicable as model-based generation or VAEs, but it does excel at reducing bias in skewed datasets. Overall, it’s best combined with other synthesis methods when building robust AI systems.
Hybrid approaches
Many organizations leverage a combination of the above data synthesis techniques to balance control with realism and to best match the data they generate with the use case at hand. A hybrid approach to synthetic test data generation may use statistical modeling to generate certain data types while applying rule-based transformations to fine-tune edge cases, enforce constraints, or ensure referential integrity. Today’s breakthroughs in synthetic data generation come from innovations that tackle the complexity of multi-table relational data synthesis at scale, including Tonic Fabricate, which provides an industry-leading AI agent for synthesizing complex, hyper-realistic relational databases.
Integrating data synthesis into existing workflows
For synthetic data to be effective, it must integrate seamlessly with existing software development and testing workflows, ensuring that teams can access high-quality, privacy-safe data without disruption.
- CI/CD integration: Synthetic data should be automatically generated and refreshed within continuous integration and continuous deployment (CI/CD) pipelines. By embedding data synthesis into automated testing workflows, teams can ensure their test environments always reflect the latest, most relevant data without manual effort.
- APIs for data generation: Solutions like the Tonic platforms provide APIs that allow development and testing teams to generate and retrieve synthetic data programmatically. This flexibility enables on-demand data provisioning tailored to specific test scenarios, reducing delays in the development lifecycle.
- Scalability and automation: Cloud-based synthetic data platforms eliminate bottlenecks by streamlining database management and data provisioning. Automated workflows ensure that synthetic data keeps pace with evolving application requirements, supporting everything from local development to enterprise-wide testing environments.
Benefits of data synthesis
Here’s a brief summary of the key advantages of using data synthesis in development workflows:
- Safeguarding privacy
Replace real data with statistically accurate stand-ins that meet regulatory and ethical standards. - Strengthening testing in lower environments
Test thoroughly with rich, realistic data—without the compliance risks of using production records. - Enabling ML and AI model training
Ensure balanced and diverse datasets while protecting sensitive information. - Accelerating development
Eliminate wait times for data access requests or manual masking processes. - Accelerating time-to-market
Remove data dependencies that stall releases and impact delivery timelines.
Data synthesis with Tonic.ai
Tonic.ai delivers flexible, powerful solutions for data synthesis, ideal for supporting real developer workflows. With support for advanced techniques like data synthesis via agentic AI, Tonic helps teams ship faster while staying compliant and privacy-conscious. The Tonic product portfolio addresses all your synthetic data needs:
- Tonic Fabricate provides AI-powered data synthesis from scratch, to generate realistic relational databases and unstructured datasets at scale.
- Tonic Structural is the industry-leading transformative data generation platform for synthesizing realistic structured data based on sensitive production data while ensuring compliance and data utility.
- Tonic Textual secures unstructured data for use in model training and AI implementation by synthesizing realistic replacements for sensitive values found in free text, audio, and video data.
Want to see the product suite in action? Book a demo to learn how Tonic.ai can help you synthesize high-quality test data at scale.
Rule-based synthetic data generation follows predefined rules to create structured, consistent data, making it useful for well-defined formats but requiring manual maintenance. Model-based generation, using statistical and machine learning techniques, learns patterns from real data to produce realistic synthetic versions. While model-based approaches improve realism, they can be computationally intensive and struggle with relational datasets at scale. Both methods have strengths, and choosing the right approach depends on the specific use case.
The latest development in synthetic test data generation is generating synthetic data via agentic AI. By pairing the right LLM with the right toolset, developers can chat their way to synthetic data for any domain in a matter of minutes. This is the approach offered by Tonic Fabricate.
Synthetic data supports software testing, AI model training, and privacy-compliant analytics across industries. In finance, it aids in fraud detection and risk modeling, while healthcare organizations use it to train AI on de-identified patient data. Retail, cybersecurity, and automotive sectors leverage it for simulations, anomaly detection, and predictive modeling. By enabling safe and scalable data use, synthetic data enhances innovation while protecting sensitive information.
Ensuring privacy is a challenge, as poorly generated synthetic data can still expose sensitive patterns. Maintaining data utility requires synthetic datasets to preserve real-world statistical properties without introducing biases. Scaling synthetic data for relational databases remains difficult, as current model-based methods struggle with cross-table dependencies. Continuous validation and refinement are necessary to ensure synthetic data remains accurate and effective.
Understanding the source data’s structure and dependencies helps ensure realistic synthetic outputs. Privacy techniques like differential privacy can protect sensitive information while maintaining data utility. Regular validation against real-world scenarios is essential to confirm data quality. Automating synthetic data generation through CI/CD integration and APIs improves scalability and accessibility for development teams.



