Reduce Your Attack Surface by Replacing PII With Synthetic Data | Blog

For years we've argued that the most overlooked part of any company's attack surface isn't the production database. It's everything downstream: staging environments, QA databases, CI/CD pipelines, AI training sets, developer laptops. Most security investment goes to the front door. Most of the PII lives in the side rooms.

Anthropic's Claude Mythos Preview is making that argument impossible to ignore.

Mythos is a frontier AI model capable of autonomously discovering and exploiting software vulnerabilities at a speed and scale no human red team can match. Through Project Glasswing, Anthropic has given AWS, Apple, Microsoft, CrowdStrike, Google, Cisco, and others direct access to the model for defensive security work. Mythos is surfacing genuine zero-day vulnerabilities (flaws no one knew existed) and, more disruptively, chaining together lower-severity weaknesses into novel exploit paths that no human team had identified.

Mythos didn't create new risk. It's revealing how deep the existing risk goes.

The individual bugs Mythos is chaining together may have existed for years. The exploit paths assembled from them are entirely new. The Glasswing partners — AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks — are actively uncovering vulnerabilities in their own infrastructure right now. As of last this week, Anthropic has expanded Glasswing's disclosure rules to let partners share threat intelligence beyond the original group. Remediation is rippling across the industry.

For security and compliance leaders, the conclusion is the same regardless of company size: you can't patch your way to safety when AI is finding zero-days and assembling novel exploit chains faster than any team can remediate them. If you can't defend every surface, reduce the number of surfaces worth attacking.

Your data is your attack surface, not just your infrastructure

Most of the Mythos conversation has centered on network security: patching faster, hardening firewalls, adopting zero trust. Necessary work, but it addresses only half the problem. Your attack surface isn't just your infrastructure. It's your data.

Production databases get the most security attention. The firewalls, the monitoring, the access controls. But production isn't where the bulk of your PII exposure sits. It sits in the copies: staging databases, QA environments, CI/CD pipelines, developer laptops, analytics warehouses, AI training datasets. In our work with enterprise customers, the ratio of non-production copies of customer data to production copies is routinely [EDITOR: insert real ratio, e.g. 10:1, 20:1] — and that's normal, not unusual.

These lower environments are typically less secured, less monitored, and accessible to more people. A staging server running an unpatched framework with a full copy of your customer database is a breach waiting to happen, and Mythos-class AI makes it easier to find. The interesting question isn't whether attackers get in. It's what they find when they do.

Every breach report quotes an average cost. The IBM 2025 Cost of a Data Breach Report puts the global average at $4.44M and $10.22M in the United States, with customer PII compromised in more than half of all breaches studied. Those numbers assume you had real data to lose. The point of what we're describing is to make that assumption no longer true.

Make the data worthless to attackers

The thesis is simple: if you remove PII from every environment that isn't production, you dramatically reduce the damage potential of any breach. A staging server that gets compromised but contains only de-identified or synthetic data is an operational nuisance, not a regulatory event. This isn't about replacing your security stack. It's about reducing the blast radius.

Two complementary approaches make this work:

De-identification starts with real data and transforms it, removing or replacing PII while preserving the structure, relationships, and statistical properties that make the data useful for development and testing.
Synthetic data generation starts from a model or from nothing and creates realistic data that mirrors real-world patterns without containing any real records.

Securing structured data in lower environments

For structured databases, de-identification replaces sensitive values while maintaining referential integrity across tables. Modern de-identification preserves data utility, so developers and QA teams can test effectively against realistic data without any PII exposure.

Securing unstructured data in AI workflows

Free-text data flowing into AI training pipelines, RAG systems, and LLM workflows can contain names, medical records, financial details, and other sensitive information embedded in documents and transcripts. This text needs to be redacted or synthesized before it enters any AI workflow. Otherwise, you're training models on data that could surface real PII in their outputs.

Replacing production data entirely with synthetic data

For some environments and use cases, the most effective approach is to skip production data altogether. Synthetic data generated from scratch can provide the volume, variety, and realism that dev and AI teams need without ever touching a real record. No production data dependency means no breach payload.

"But won't this slow my developers down?"

This is the objection that's killed every previous attempt to lock down non-production data, and it deserves a direct answer.

Ten years ago, yes. Early data masking broke joins, mangled distributions, and left QA teams unable to reproduce production bugs. That isn't the technology we're describing. Modern de-identification preserves referential integrity across tables, statistical properties within columns, and the edge cases that real production data actually exhibits. Modern synthetic data generation matches real-world patterns closely enough that engineering teams don't notice the difference in their day-to-day workflows. The choice is no longer between safe data and useful data. You can have both.

How Tonic.ai shrinks your attack surface

Our product suite maps directly to the approaches above:

Tonic Structural de-identifies production databases for safe use in lower environments, preserving referential integrity and data utility across your schemas.
Tonic Textual detects and redacts sensitive information in unstructured text, so PII doesn't leak into AI training, RAG systems, or any workflow where free-text data is in play.
Tonic Fabricate generates realistic synthetic data from scratch, with no production data dependency at all.

In the age of Mythos, you can't patch your way to safety fast enough. But you can make the data attackers would steal worthless to them. That's what reducing your attack surface actually looks like.

Your attack surface is your data. Mythos is the proof.

Mythos didn't create new risk. It's revealing how deep the existing risk goes.

Your data is your attack surface, not just your infrastructure

Make the data worthless to attackers

Securing structured data in lower environments

Securing unstructured data in AI workflows

Replacing production data entirely with synthetic data

"But won't this slow my developers down?"

How Tonic.ai shrinks your attack surface

Make your non-production data worthless to attackers.

Related blog posts

Generate referentially intact synthetic data across your entire data ecosystem

Benchmarking OpenAI's Privacy Filter: What it gets right, and where PII detection still needs real data

Inference protection for LLMs: Keeping sensitive data out of AI workflows

Make your sensitive data usable for testing and development.