Data synthesis techniques: a comparison for developers

Author

Prateek Panda

July 17, 2025

Developers are taking on new challenges driven primarily by data. Having quick access to high-quality, privacy-safe datasets is mission-critical, especially in environments like software testing, machine learning, and AI model training. Data synthesis has emerged as a practical and scalable solution for generating realistic data for these and other use cases, while mitigating privacy risks and production slowdowns.

Whether you’re debugging a staging environment, training a model, or building out a retrieval-augmented generation (RAG) system, synthetic data gives you the confidence that you're not putting sensitive data at risk. In this article, we’ll define data synthesis, explore some of its more common use cases, and break down the top techniques developers use to generate safe, reliable, useful synthetic data.

What is data synthesis?

Data synthesis is the process of generating artificial data that mimics the structure and statistical patterns of real-world datasets. Unlike other techniques, like data redaction or scrambling, synthesizing data generates realistic artificial records that are free of personally identifiable information (PII) and behave exactly like the original data.

In the context of software development, data synthesis plays a critical role in enabling secure testing, continuous delivery, and faster innovation. Developers use synthetic data to avoid the delays and bottlenecks that often arise when requesting access to production datasets. At Tonic.ai, we see data synthesis as a foundational piece of modern DevOps pipelines.

Some of the most common use cases for data synthesis include:

Software development and testing

By using synthetic data, development teams can reliably access usable, representative datasets across diverse environments. From frontend validation to performance benchmarking, synthetic data unlocks realistic test environments without the risk of exposing customer data.

AI model training

Synthetic data is often used to either augment real datasets or serve as a replacement in cases where sensitive data is restricted. Synthesized data can be used to train machine learning models, reduce bias, or simulate scenarios that are underrepresented in real-world production data.

LLM privacy proxy

Using real user prompts, conversations, or internal data to train large language models (LLMs) can lead to serious privacy concerns. A synthetic dataset used within an AI workflow as an LLM privacy proxy allows teams to access the massive, diverse datasets required for LLM training or fine-tuning without risking any sensitive information.

RAG systems

In RAG pipelines, synthetic data can be used to test chunking strategies, metadata tagging, or response generation before deploying real documents. This reduces the risk of exposing confidential corpora in early experimentation.

Data privacy and regulatory compliance

Synthetic datasets help organizations comply with regulations like GDPR, HIPAA, and CCPA, for example, by allowing teams to work with data that functions similarly to the original but remains fully de-identified throughout the process.

Synthesize your data for software testing and AI model training.

Unblock product innovation with high-fidelity synthetic data that mirrors your data's context and relationships.

Start generating today

The best techniques for synthesizing data

The right technique for synthesizing data depends on your use case. Are you synthesizing data for new product development, software testing, or AI model training? Let’s compare the most common data synthesis techniques, looking at their strengths and best-fit scenarios to see which would serve your purposes.

Rule-based data generation

This method of data synthesis involves defining a set of explicit rules or constraints that the generated data must follow. These rules are typically based on domain knowledge, statistical properties of real data, or specific requirements for the synthetic dataset. Developers might specify value ranges, string patterns, or dependencies between fields.

It can generate data “from scratch”, i.e. based on the rules alone, or it can generate data based on existing data, by applying transformational rules to the data. For example, the rule might stipulate shifting dates by +/- three days.

Pros:

Highly customizable
Scalable—unlimited rows of data can be generated
No real data required

Cons:

Realism depends on the complexity of the defined rules
Bias in the rules can skew or create gaps in the data

Rule-based data generation is extremely useful for greenfield product or feature development when no existing data is available to build upon. It is also useful for sales demos, as it allows you to tailor your demo data to exactly the criteria you need, to both best showcase your product and to best mirror your prospect’s needs. And in the realm of model training, rule-based synthetic data can fill the gaps or augment where your existing data is lacking. This holds true for both structured and unstructured data, like customer service conversations or medical notes.

Transformative data generation

In the context of data synthesis, transformative data generation refers to the process of creating new data by modifying or altering existing data points in a systematic way, often to change their format, structure, or content for a specific purpose. For AI model training and software development and testing, this approach is key in transforming or de-identifying existing production data for compliant and effective use by developers and AI engineers.

Pros:

High-fidelity to production data
Suited to complex data types
Preserves referential integrity

Cons:

Requires access to existing data
Not suited to generating data patterns that don’t yet exist

Transformative data generation can be prescriptive like rule-based data generation or it can be statistical like model-based data generation, depending on the specific algorithm in use. Masking and encryption are techniques that fall into this bucket, as well as tokenization and differential privacy. It is the primary approach used to de-identify data for regulatory compliance.

Model-based data generation

Model-based data synthesis uses statistical or machine learning models to generate new data points that reflect the distribution of real-world datasets.

Pros:

High fidelity and realism
Maintains correlations between fields

Cons:

Requires a representative training dataset
Runs into limitations at scale
May be computing-intensive

This is a solid choice when your priority is synthetic data that closely mirrors real-world data, for example for research or data analysis. Compared to rule-based generation, model-based offers much higher fidelity. However, the compromise is evident when you look at scalability. Model-based data generation isn’t yet capable of generating datasets at the database scale. It can work well on the column or table level, but relational data across tables is not its strength..

Data augmentation

Data augmentation techniques create new data by transforming existing records by adding noise, scaling, rotating, or modifying original data points. While more common in image processing, it is also used in tabular and NLP datasets. The core idea is to create new, slightly altered versions of the data that still retain their original meaning or labels, to increase the diversity and size of training data for machine learning models.

Pros:

Improves diversity in datasets
Reduces overfitting in models

Cons:

Limited in creating new information
Risk of propagating existing biases

Data augmentation is best when you have real data and want to increase either volume or variation. It’s especially useful in AI/ML pipelines for reducing overfitting. Unlike model-based generation or Variational Autoencoders (VAEs), data augmentation adds to existing records instead of creating fully synthetic datasets, making it less ideal for test data generation or regulatory use cases, where privacy is paramount.

Variational Autoencoders (VAEs)

VAEs are a type of generative neural network that learn to create new data, including images and text, that resembles the data they've seen before. They first compress the original data into a "fuzzy" or probabilistic code (not a single, fixed point), then learn to reconstruct similar data from this fuzzy code. This "fuzziness" in their internal representation allows VAEs to be creative and generate diverse, new examples that are similar to, but not exact copies of, their training data.

Pros:

Generally easier to train than GANs
Less prone to “mode collapse” as compared to GANs

Cons:

Not yet capable of generating structured relational data at scale
Requires large, clean datasets for training
Can be computationally intensive

VAEs are most commonly associated with image synthesis, though they can also be used for NLP or structured data use cases. Compared to model-based generation, VAEs do require more training data and compute power. They can offer deeper learning than rule-based generation, but for structured data, scalability is limited.

Synthetic Minority Over-sampling Technique (SMOTE)

SMOTE synthesizes new examples by interpolating between existing data points in underrepresented classes—a technique commonly used to balance classification datasets.

Pros:

Helps mitigate class imbalance
Simple and effective for tabular data

Cons:

Can create noisy or overlapping classes
Only applies to classification problems

SMOTE is good for building balanced classification models in situations where one class is underrepresented, like fraud detection, for example, or medical diagnosis. While model-based generation and VAEs can preserve the global structure of datasets, SMOTE focuses on rebalancing, making it better suited for improving model fairness and accuracy. It lacks the flexibility of VAEs or the directness of rule-based generation, but it is highly effective in targeted machine learning applications.

ADAptive SYNthetic (ADASYN) Sampling Method

ADASYN builds on SMOTE by focusing generation efforts on the hardest-to-classify data points (i.e., those that are surrounded by more majority class neighbors or are closer to the decision boundary), improving model generalization.

Pros:

More adaptive than SMOTE
Targeted improvement of classifier performance
Reduces bias in predictions

Cons:

Requires careful tuning
May overfit rare/noisy classes

ADASYN is a good candidate to use when class imbalance is severe and performance on rare outlier cases matters, like identifying edge behaviors in security or predictive maintenance, for example. When compared to SMOTE, ADASYN is more dynamic and nuanced. It’s also not as broadly applicable as model-based generation or VAEs, but it does excel at reducing bias in skewed datasets. Overall, it’s best combined with other synthesis methods when building robust AI systems.

Benefits of data synthesis

To wrap up, here’s a summary of the key advantages of using data synthesis in development workflows:

Safeguarding privacy
Replace real data with statistically accurate stand-ins that meet regulatory and ethical standards.
Strengthening testing in lower environments
Test thoroughly with rich, realistic data—without the compliance risks of using production records.
Enabling ML and AI model training
Ensure balanced and diverse datasets while protecting sensitive information.
Accelerating development
Eliminate wait times for data access requests or manual masking processes.
Accelerating time-to-market
Remove data dependencies that stall releases and impact delivery timelines.

Tonic.ai delivers the most flexible, powerful solutions for data synthesis, ideal for supporting real developer workflows. With support for advanced techniques like statistical data generation and integrations with AI/ML pipelines, Tonic.ai helps teams ship faster while staying compliant and privacy-conscious. The Tonic.ai product portfolio addresses all your synthetic data needs:

Tonic Fabricate provides AI-powered rule-based data synthesis from scratch, to generate realistic relational databases and unstructured datasets at scale.
Tonic Structural is the industry-leading transformative data generation platform for synthesizing realistic structured data based on sensitive production data while ensuring compliance and data utility.
Tonic Textual secures unstructured data for use in model training and AI implementation by synthesizing realistic replacements for sensitive values found in free text, audio, and video data.

Want to see the product suite in action? Book a demo to learn how Tonic.ai can help you synthesize high-quality test data at scale.

Data synthesis techniques: a comparison for developers

What is data synthesis?

Software development and testing

AI model training

LLM privacy proxy

RAG systems

Data privacy and regulatory compliance

The best techniques for synthesizing data

Rule-based data generation

Transformative data generation

Model-based data generation

Data augmentation

Variational Autoencoders (VAEs)

Synthetic Minority Over-sampling Technique (SMOTE)

ADAptive SYNthetic (ADASYN) Sampling Method

Benefits of data synthesis

Related Guides

What is a rule-based test data generator?

Real-world applications of format preserving encryption

Advanced techniques for generating synthetic test data

Make your sensitive data usable for testing and development.