What is synthetic data? A complete guide

A complete guide to synthetic data: what it is, how it differs from de-identified and real data, the main generation approaches, common use cases, and where it fits in AI and software workflows.

By Mark Brocato, Head of Engineering for Fabricate at Tonic.ai

Updated June 2026

0 min read

Start generating synthetic data in minutes.

Tonic Fabricate makes high-fidelity synthetic data on demand — no setup, no DBA, no waiting.

Start free

Synthetic data is artificially generated data that reproduces the statistical patterns, structure, and relationships of real-world data without copying any real records. Teams generate it two ways: from scratch against a schema, or by modeling an existing dataset to produce new, statistically faithful records. It's distinct from de-identification, which transforms real records in place — synthesis produces net-new records instead — and teams use it to get production-realistic data for development, testing, and AI training, including in scenarios where no usable production data exists.

What synthetic data is

Synthetic data is data a machine creates rather than data a system records. A real dataset is a log of things that happened — purchases, patient visits, sensor readings. Synthetic data is generated to look and behave like that log without being it. Every value in a synthetic dataset is new; none of it is a copy of a row that belongs to a real person or a real event.

When comparing synthetic data versus de-identified data, the key distinction is generated versus transformed. De-identification takes real records and changes them in place — swapping a name, masking an account number, shifting a date — so what comes out is still derived from one specific real person. Synthesis starts from a description of the data, or from a statistical model of it, and produces records that never existed. A de-identified row can, in principle, be traced back toward the individual it came from if the transformation is weak; a from-scratch synthetic row has no individual behind it to trace back to.

Generated vs. transformed: De-identified data is real data with the identifiers changed; synthetic data is net-new data created to match real data's patterns — no real record underneath.

At a high level, synthetic data can be built two ways. You can generate it from scratch, defining the schema and the rules or describing the data you want, with no source dataset involved. Or you can model an existing dataset — learning its distributions, correlations, and structure, then sampling new records that share those properties. The from-scratch path is how you create data when none exists; the modeled path is how you produce more of what you already have, or a safer stand-in for it.

When people call synthetic data "realistic," they mean statistically faithful, not just visually convincing. Take a table of orders. A faithful synthetic version carries the same spread of order values, the same relationship between customer tenure and how often someone buys, the same seasonal lift in December — and the same awkward edge cases: the small fraction of orders refunded twice, the customer whose shipping address changes mid-history, the null in a column that shouldn't allow one but does. It reproduces those distributions, correlations, and behaviors while keeping referential integrity intact, so a generated order still points to a customer that exists in the generated customer table. Fidelity is the point: a synthetic set that drops those correlations might pass a glance but will mislead a test suite or a model, because the patterns it was meant to stand in for aren't there. That faithfulness is bounded by method and source — from-scratch data is as realistic as the rules you give it, and modeled data is as realistic as the dataset it learned from.

How synthetic data is generated

Synthetic data starts from one of two places — from scratch, with no source data, or modeled on an existing dataset — and an agent-driven workflow is now the fastest route to both. Older approaches asked you to hand-configure a generator field by field and maintain it as your schema changed. Today, an AI agent does that work: you describe what you need, and it drafts, generates, and checks the data for you.

With an agentic tool like Tonic Fabricate, you give the agent a starting point in one of three ways:

A natural-language prompt describing the data you need — "a database of transaction records linked to customer and product records, with a few months of seasonal variation."
An uploaded schema — hand the agent your table definitions and let it populate them with realistic, referentially consistent values.
A connection to a live database that the agent profiles and models, reading both the structure and the actual value distributions so the output matches the real thing.

The first two paths are schema-first, from-scratch generation: there's no source data, so the agent builds records from the rules and patterns you describe. Ask for an e-commerce database and it produces customers, products, and orders where each order references a customer and a product that exist in the set, order values land in a believable range, and timestamps spread across the window you asked for instead of clustering unnaturally. This is how you create data for a system that hasn't shipped, or invent scenarios production has never seen.

The third path is derived generation. When you point the Data Agent at an existing database, it studies how that data is shaped — the ranges, the frequencies, the relationships between tables — and generates new records that preserve those properties at whatever volume you ask for. Say a sample holds 1,000 orders in which 3% are refunded and high-tenure customers buy more often; the agent can generate 100,000 orders that keep those same proportions and correlations while containing none of the original records. That's the route to volume, and to a production-realistic stand-in you can move into lower environments without moving production data with it.

What comes out, in each case, is relational and referentially intact: multiple tables and files whose keys line up across the whole dataset. For a complex schema, the agent first drafts a generation plan you can review and adjust before anything is generated. Toggle on validation, and Fabricate’s Validation Agent actively reviews what the Data Agent produces to flag unrealistic or incorrect data and refine the generation approach. These separate creator and reviewer agents ensure that data quality and integrity are achieved even when the initial prompt is imprecise. The generated data can span several related databases and files at once, including nested and semi-structured shapes, not just flat tables. Operationalize it through automated workflows and mock APIs that slot into your pipelines, so generated data shows up where your tests and services expect it.

Because the agent works conversationally, refining the data is the same loop as creating it: you point out what's off, and it regenerates the affected tables instead of making you rewrite and re-run a script. This agentic, conversation-driven workflow is where the category has moved. Knowing how to generate synthetic data used to mean writing and maintaining generation scripts; now it means describing the result you want and reviewing what the agent builds.

Types of synthetic data and generation techniques

The technique you choose follows the job, not the other way around. There are a handful of established approaches to synthetic data generation, and they map onto how much you know about the data up front and how much realism you need.

A useful way to frame the output is net-new, derived, or hybrid. Net-new data is generated from scratch against a schema or prompt. Derived data is modeled on an existing dataset. Hybrid data combines the two — modeling the part of a schema you have real data for, and generating the rest from rules. The techniques below are the mechanisms that produce those outputs.

Synthetic data generation techniques compared: how each works, when to use it, and what to watch for
Technique	How it works	Best when	Worth noting
Rule-based	You hand-define constraints per field — formats, ranges, allowed values, simple distributions — and the generator fills them deterministically.	You need precise, rules-bound output and know exactly what each field should contain.	Doesn’t learn real-world correlations on its own; labor-intensive for complex schemas; realism caps at what you specify by hand.
Model-based	A statistical or machine-learning model learns the distributions and relationships in a source dataset, then samples new records from what it learned.	You have a representative dataset and need new records that preserve its statistical relationships.	Needs quality source data; correlations are only as good as the training set; a carelessly trained model can memorize and leak.
Agentic	An AI agent profiles the source (or your prompt or schema), drafts a generation plan, produces relational data, and validates it — orchestrating the other techniques for you.	You want production-realistic, referentially intact data fast, with minimal manual configuration.	The newest approach; output quality still depends on the prompt or source and on the validation step.
Hybrid	Combines approaches — model real distributions where you have data, fall back to rules where you don’t.	Part of your schema has source data and part is net-new.	More moving parts to configure and reason about.

Rule-based synthesis is the original, pre-AI approach. It predates today's models and agents: you write the constraints, and the tool produces output that obeys them, every time, with no learning involved. It's still useful when you need deterministic, precisely bounded data — a fixed set of valid status codes, currency amounts within a known range, IDs in a specific format. But it can't infer the correlations that make data feel real. If you specify a customer's age and their product preferences as independent rules, the output won't reproduce the real-world link between them unless you encode that link yourself, and maintaining hand-written rules across a large schema is slow.

Model-based generation closes that gap by learning relationships from a source dataset instead of asking you to state them. Point a model at the same customer table and it picks up that age and preference move together, because that pattern is in the data it learned from. The trade is that it needs a quality source dataset to learn from, and the learning has to be done carefully so the model generalizes rather than memorizes.

Agentic generation is the newest of the three and the most hands-off: it profiles your source or reads your prompt, decides which mechanisms to apply, generates relational data, and validates the result. Under the agentic layer, it employs both rule-based and model-based generation techniques, combining the best of both worlds while turning data generation from a scripting or data science project into a conversation. Thanks to this approach, production-realistic data is increasingly something developers ask for rather than engineer by hand.

Each approach has a deeper catalog of algorithms beneath it — the model-based family alone spans several generative methods, plus dedicated techniques for tasks like balancing underrepresented classes.

Synthetic data, de-identified data, and real data: how they differ and when to use each

Real data, de-identified data, and synthetic data differ by origin and by the use cases they’re most suited to. Real data is recorded straight from events. De-identified data is real data with its identifiers stripped or masked — the records still derive from real people. Synthetic data is generated net-new, reproducing the patterns of real data without being drawn from any specific person. The practical question is rarely which is "best," but which one fits the task in front of you.

Real, de-identified, and synthetic data compared: origin, what each preserves, and when to use it
Type	Origin	What it preserves	When to use
Real data	Recorded directly from events, transactions, or measurements.	The exact real-world values, relationships, and the real individuals behind them.	When the data isn’t sensitive and regulations let you use it directly.
De-identified data	Real data transformed to remove or mask identifiers.	The original records’ structure and real-world shape, with identifiers removed or replaced.	When you need production’s exact real-world shape and edge cases, minus the identifiers.
Synthetic data	Generated — from a schema, a model of real data, or a prompt.	Real data’s statistical patterns, structure, and relationships, without reproducing any real record.	When production data is unavailable or unusable, or you need more volume, variety, or net-new scenarios than it holds.

The right way to choose between de-identification and synthesis is fit, not safety. De-identification fits when you already have production data and you want its exact real-world shape — the precise correlations, the genuine edge cases, the long tail your business produces — carried into a safer copy. If you're reproducing a bug that only appears against a specific tangle of real records, a de-identified copy of that production data gives you the exact conditions to work against. Synthesis fits when production data isn't available or appropriate: the system hasn't shipped, the data can't leave its environment, you need far more volume than production holds for a load test, or you need scenarios that have never occurred — a fraud pattern you want to catch before it happens, a spike ten times larger than any real day.

Both approaches are privacy-capable when they're done well, and they're complementary more often than they're competitive. A common pattern uses them together: de-identify a production dataset, then synthetically scale it up — treating the de-identified data as the model and generating additional records on top of it for load testing or for provisioning many parallel environments. You get the real-world shape from the de-identification step and the volume from the synthesis step.

Fit also means knowing when synthesis isn't the answer. If a regression only reproduces against the exact tangle of values one real account accumulated over years, a de-identified copy of that account is a faster path than describing the tangle to a generator. The reverse holds when the blocker is access or volume rather than fidelity to a specific real record. Most mature teams keep both approaches within reach and choose per task, rather than treating either one as the default.

Benefits of synthetic data

Synthetic data gives you on-demand access to production-realistic data without the wait. Instead of filing a ticket and waiting on a DBA to provision a copy of production, in an agentic workflow you describe the data you need and generate it — and that accelerated approach ripples into volume, privacy, speed, AI readiness, and test coverage.

Unlimited volume. You can generate as many rows as you need, so load and stress testing isn't capped by how much data production happens to hold. Want to know how a service behaves at ten times current scale, or simulate a rare high-volume event before it occurs in the wild? Generate the data and find out, instead of waiting for production to grow into the test.
Privacy by construction. Net-new records don't correspond to real people, so you carry far less sensitive data into development, test, and training environments — and into the hands of contractors, offshore teams, and demo audiences. The data that isn't there can't leak, which removes a whole class of exposure from environments that don't need real records to do their job.
Development velocity. Self-serve generation removes the data-as-a-ticket delay and the cross-team dependency that comes with it. Frontend, backend, and microservice teams can each work against their own isolated dataset in parallel, instead of contending over a shared staging environment or blocking on whoever owns the data pipeline.
AI readiness. You can balance underrepresented classes, control coverage across the scenarios you care about, and attach the ground-truth labels that real-world datasets rarely arrive with. A fraud model that sees too few fraud cases in real data can be trained on a set where the rare class is well represented and every example is correctly labeled.
Test coverage. You can engineer the malformed inputs and edge cases that break software — the double refund, the unicode name, the impossible timestamp, the order with no line items — instead of hoping production happens to contain an example. You design coverage rather than discovering its gaps in an incident.
Control over shape and scale. You decide the schema, the distributions, the volume, and which scenarios appear, instead of taking whatever production happens to hold. That control is what lets you target a specific edge case, model a market you haven't entered yet, or size a dataset precisely to a load test.

There's a cost dimension underneath all of these, too: every hour a well-paid engineer spends sourcing, sanitizing, or hand-mocking data is an hour not spent building, and self-serve generation gives that time back. Each benefit maps to a concrete engineering outcome — fewer escaped bugs, shorter cycles, higher quality products, and more performant models.

Synthetic data use cases

Teams reach for synthetic data whenever real data is missing, restricted, or insufficient — and those three conditions show up across software and AI work in recognizable patterns.

Software testing

Software testing is the most common entry point: realistic test data, generated from scratch or modeled on production, with no production-data dependency. Engineers get datasets that exercise their actual business logic — correct types, valid relationships, representative distributions — without waiting on a provisioning pipeline or pulling sensitive records into a test environment. Because each team can generate its own isolated dataset, parallel work doesn't collide in shared state, and the same generation step can run in CI so every pull request tests against fresh, consistent data.

AI model and agent training

Synthetic data for AI training solves the problem of scarce, imbalanced, or unlabeled data. You can augment a small real dataset with additional generated examples, balance classes that production underrepresents, and produce data with built-in ground truth so evaluation isn't guesswork. A classifier that rarely sees a given category in real data can be trained on a set where that category is well represented; a model for a regulated domain can be trained on synthetic records when the real ones can't leave their environment. For domains where no usable corpus exists, generation is often the only way to get a training set at all — synthetic data for machine learning becomes the starting point rather than a supplement. Augmentation is the everyday version of this: a handful of real examples seed a much larger, balanced set, so a model that would otherwise overfit a tiny sample has enough variety to learn from.

Simulated worlds and reinforcement learning

Building simulated worlds and reinforcement learning environments takes data that's rarely available off the shelf: complete, internally consistent worlds with temporal integrity. Real-world data for these environments is usually scarce, restricted, or simply nonexistent — you can't download a labeled record of an entire company's year of activity. Synthetic generation can simulate that activity over a timeline — structured records plus the surrounding free text, emails, messages, and events — with cross-references intact and the difficulty of tasks under your control. On that structured foundation you can define hierarchies of verifiable tasks, from single-hop lookups to multi-hop reasoning chains, each with a known-correct answer. That combination of a coherent simulated world and built-in ground truth is what makes training and evaluating agents tractable without scarce, expensive real-world corpora.

Greenfield development

When you're building something that has no data yet, synthetic data lets you start anyway. You generate against the schema you're designing and develop, test, and demo against realistic data from day one, rather than waiting for real usage to accumulate before you can build. This is the cold-start case: there's no production data to model, so generation from scratch is the only option, and it removes the chicken-and-egg problem of needing data to build the thing that produces the data. Designing the schema and generating against it also surfaces data problems early — a relationship that doesn't hold, a field that's always null — while they're still cheap to fix.

Sales demos

Demo environments land better when the data looks like the prospect's own — same industry, same shapes, same scale — and synthetic data lets you produce that without touching real customer records or taking on compliance risk. A sales engineer can generate a dataset shaped to the audience in front of them, so the demo reflects the buyer's world instead of generic placeholder rows, and nothing sensitive is exposed in the process. Because the data is generated rather than reused, a fresh, clean environment can be spun up for each prospect instead of scrubbing and recycling one shared demo database.

How to measure synthetic data quality

Good synthetic data is judged on three properties that pull against each other: fidelity, utility, and privacy. Fidelity is how closely the data matches the statistical shape of real data. Utility is how well it performs for the task you generated it for. Privacy is how little it reveals about any real individual. These trade off — push fidelity to the limit on data modeled from real records and you risk privacy; clamp down on privacy and you can erode fidelity — so the work is balancing the three for your use case, not maxing any single one.

You measure fidelity by comparing the synthetic set to the real one: do the column value distributions line up, do the relationships between fields hold, do the rare edge cases occur at roughly the right frequency? Utility is the more decisive test, because faithful-looking data that doesn't do the job is no use. The standard check is train on synthetic, test on real: train a model (or run a test suite) against the synthetic data, then evaluate it on held-out real data. If it performs about as well as the same approach trained on real data, the synthetic set carries the signal that matters.

Privacy is the third corner: how confident you can be that the synthetic set doesn't expose anyone in the real data it was built from. For data generated from scratch, that question is moot — there's no real individual to expose. For data modeled on real records, it's the property that takes the most care to measure, and the one with the most at stake.

Synthetic data and privacy: what it does and doesn't guarantee

Synthetic data reduces exposure because net-new records don't map to real people — but the strength of that guarantee depends on how the data was generated. Data generated from scratch contains no real individuals by definition. Data modeled on a real dataset is safer than the original, but a model trained without the right controls can memorize and reproduce fragments of its training data, which means the output can still carry re-identification risk if that risk isn't accounted for. The risk is concrete: a model that overfits can reproduce a rare real record almost verbatim — an unusual salary, a one-of-a-kind address — so "modeled on real data" is not automatically "free of real data."

The safeguards are the generation approach, validation, and — when you derive data from a real dataset — the privacy controls applied to that process. Generating from scratch sidesteps the question, because there's nothing real to leak. When you model real data, the protection comes from controlling what the model is allowed to learn and reproduce — limiting how closely any single real record can shape the output — and then measuring the result: testing whether any generated record is a near-copy of a source record, and whether the dataset reveals who was in the training data. Those are standard checks — near-copy detection and membership-inference testing — and they're the difference between assuming a dataset is safe and showing that it is. Bounding how much any one real record can influence the model during training is the same principle behind privacy-preserving generation: no single person should leave a recognizable fingerprint on the output. The more sensitive the source, the more that measurement earns its place in the workflow.

The formal version of that idea is differential privacy. When you train a generative model with differential privacy, you add a measured amount of mathematical noise during training so that no single record can move the model's behavior beyond a fixed, provable bound. The result is a guarantee you can quantify rather than hope for: even an attacker who already knows everything about every other record can't determine whether one specific individual was in the training data. Think of it like a group photo blurred to a precise degree — you can still read the crowd's makeup, the rough age spread, the overall mood, but you can't pick out any single face. How much blur you apply is set by a privacy budget, often written as epsilon, and that's the dial that controls how strong the guarantee is.

That dial comes with a tradeoff worth understanding. Tighter privacy means more noise, which softens the fine-grained correlations that make synthetic data useful; looser privacy preserves more fidelity but weakens the guarantee. Tuning that balance is much of the real work in privacy-preserving generation — and it's a concern only when you model real data. Generating from scratch carries no such tradeoff, because there's no individual whose presence could be revealed in the first place.

These same properties are what regulators evaluate. Frameworks like the EU's GDPR, HIPAA in US healthcare, and California's CCPA turn on whether a real individual can be re-identified from the data. Under GDPR, data that can't be tied back to a person falls outside the regulation's scope; under HIPAA, de-identification is judged against a defined standard, whether through Safe Harbor or Expert Determination. So the compliance question for synthetic data reduces to the same engineering question as your privacy validation: can any real person be re-identified from what you generated? Generate from scratch and the answer is straightforwardly no. Model sensitive data, and you establish it the way you'd establish any privacy property — by validating the output, not by trusting the process.

Generating synthetic data with Tonic.ai

Tonic Fabricate is Tonic.ai's product for synthetic data generation. Through its data agent, you generate data from scratch, from an uploaded schema, or modeled on an existing database, and get referentially intact output across multiple databases and files — relational data, free text, and mock APIs alike. For complex schemas, Fabricate drafts a generation plan you control step by step, and its Validation Agent reviews and refines what the data agent produces, keeping quality high even from an imprecise prompt.

The platform is designed for developer use cases spanning greenfield development, production-realistic testing, and training-data generation. All of these can begin from the same starting points: a prompt, a schema, or your own data — and when you connect your own data, Fabricate reads from major database platforms like Postgres, MySQL, SQL Server, Oracle, Snowflake, and BigQuery, so modeling an existing database doesn't mean exporting it first.

How far synthetic data can go is no longer a theoretical question. In a Tonic.ai benchmark, the team used Fabricate to generate a complete synthetic corporate email environment — a fictional 100-person company with roughly 1,964 emails on top of a structured metadata layer of timeline events, threads, and cross-references that supports graded tasks, from single-email lookups to multi-hop reasoning. An open-source model fine-tuned only on that synthetic data improved on the real-world Enron email benchmark from 80.5% to 86%, outperforming o3, without training on a single real email. The structured metadata is a big part of why it worked: it let the team build tasks of known difficulty and verify answers against specific emails, giving training and evaluation a cleaner signal than a raw real-world corpus offers.

Whether you're starting from a blank schema, an existing database, or a prompt, Fabricate turns the wait for usable data into a request you can make in minutes — production-realistic, referentially intact, and ready for development, testing, and AI work. Start generating with Tonic Fabricate today, or book a demo to explore the full product suite.

Frequently asked questions

No. Anonymization, or de-identification, transforms real records to strip out identifiers, while synthesis generates net-new records that never belonged to a real person. They solve different problems and are often used together — de-identification when you need a safer copy of real data, and synthesis with Tonic Fabricate when you need net-new data or more of it than production holds.

For software testing, often yes; for AI training, increasingly so. In a Tonic.ai benchmark, a model fine-tuned only on synthetic data beat o3 on real-world email tasks without seeing a real email. It fits best in model training and testing, where realistic data matters more than literally real data.

It depends on re-identification risk. These frameworks turn on whether a real individual can be identified from the data. Net-new data generated with Tonic Fabricate contains no real records, so re-identification isn't possible; data modeled on sensitive source data should be validated to confirm the same before you rely on it.

When you don't have usable production data, or you need more volume, variety, or net-new scenarios than it holds. De-identified data fits when you already have production data and need its exact real-world shape. For testing and QA, teams often use both — de-identified data for fidelity, synthetic data for scale and coverage.

As faithful as the method and source allow. Generating from a model of an existing dataset yields the highest fidelity, because the agent profiles the source and preserves its distributions, correlations, and relationships. From-scratch generation is as realistic as the rules and patterns you describe.

View all FAQs