Data de-identification

How to de-identify insurance claims and documents with Tonic Textual

Author
Whit Moses
Author
September 18, 2025

If this headline caught your eye, chances are you’re an innovation leader in insurance facing a familiar challenge: how to put your mountains of internal data to work. Claims forms, customer correspondence, and policy documents are packed with PII, yet also hold the real-world context AI models need to detect patterns, understand language, and improve decision-making in underwriting, fraud detection, and claims automation.

The potential for generative AI to unlock value within the industry is massive – informing underwriter decision-support tools, streamlining claims triage, or deploying an LLM-powered agent to answer policy inquiries – all are realistic initiatives that require real-life data to be effective.  But there's a catch: training AI models directly on real documents from the field with sensitive customer data risks exposing personally identifiable information (PII) in unintended ways.

We built Tonic Textual for this very challenge – to de-identify sensitive data from unstructured documents while preserving the context AI models need to learn and deliver value.

What does de-identification mean for insurers?

When it comes to protecting sensitive customer information in AI initiatives, insurers have only two viable strategies: redaction and synthesis. Without one – or a combination of both – personal information inevitably risks slipping into training datasets and models, where it becomes nearly impossible to control or remove. These approaches are not just best practices; they are the baseline for keeping customer trust intact and meeting compliance obligations.

  • Redaction: With redaction, sensitive details are removed or masked—such as blacking out names, policy numbers, or Social Security information, or swapping them with generic placeholders (e.g., [NAME]). The limitation? While highly effective at eliminating risk, redaction often disrupts context and can strip documents of the narrative flow AI models rely on to learn meaningfully.
  • Synthesis: With synthesis, sensitive values (like “John Doe” or “123-45-6789”) are replaced with realistic but fictitious alternatives. This preserves context, coherence, and consistency across documents, allowing AI systems to operate on natural-looking data. The trade-off is complexity: generating synthetic data that remains both safe and useful requires sophisticated models and fine-tuned workflows.

Together, redaction and synthesis provide insurers with practical pathways to make sensitive documents AI-ready. Yet the real challenge isn’t just choosing the right strategy – it’s accomplishing this reliably, accurately, and at scale across the massive volumes of unstructured data that flows into the organization every day. 

The technology under the hood of Tonic Textual 

Tonic Textual’s ability to handle the complexity of unstructured insurance documents at scale comes from the combination of advanced parsing and custom-trained AI models to tackle the limitless possibilities contained within unstructured text. Here’s how the technology works. 

  • Smart Document Parsing: Tonic Textual extracts free-text content from a variety of formats—PDFs, DOCX, TXT, and more – using a blend of file parsing and optical character recognition OCR; or a technology that converts documents like claim forms or policy records into searchable, machine-readable text that AI systems can analyze.
  • Custom NER Models: Tonic trains named entity recognition (NER) models; AI systems built to automatically identify and classify sensitive details – or entity types – like names, medical codes, policy numbers, and addresses so that they can be redacted or transformed before use in AI workflows.  

Using Textual to de-identify insurance documents 

With a basic understanding of how Textual works, we can now use it to perform a basic workflow to de-identify a collection of insurance documents. So that you can play around with this use case without leveraging internal documentation, we’ve created a collection of true-to-life faux policy documents that you can use for yourself: 

Create a dataset of insurance documents 

Since you will be de-identifying multiple documents, the first step is to create a new Dataset. Datasets are documents with a shared de-identification strategy; so for example, if you were working with a collection of insurance claims, all of which contained a patient name and policy number – and you wanted to de-identify those entities across all of the documents – you would be able to do this quickly and effectively with a dataset.

With this dataset, we are specifying a name and also the source of the files that we will be uploading for redaction, which we will upload directly from our local device. Likewise, we are specifying an output that will match the original file format – in this case a new PDF – alternatively you could export the redacted data as snippets of JSON code. 

After you have named your dataset, you can upload the files that you wish to de-identify; after a few moments they will populate on the left hand side along with a word-count analysis and the presence of sensitive entities. Once this is complete, you can select what types of entities you want de-identified as part of a bulk edit, along with your de-identification strategy: pure redaction or synthesis. 

Previewing initial results

For each redacted you are able to select and view a preview of the de-identified output. For this example, let’s zoom in on the insurance claim for the emergency room visit. As part of our bulk de-identification strategy, we selected that names (first and given) and locations are synthesized, along with the redaction of numeric PII. The output is what you see here, where on the right hand side you see that the patient and doctors names have been synthesized (“Eric Smith” is now “Daniel Uva”) and all numbers and codes have been replaced with a black box. 

Adjusting the de-identification to meet your needs 

For some use cases, this initial output may be sufficient. However, there may often be a need to fine-tune the de-identification so that important context is not lost, which could influence downstream results. 

In the example above, we completely redacted all numerical PII from view, which encompasses policy numbers, claim numbers, and diagnoses codes. Let’s say for the sake of this exercise, that we only want to redact policy numbers, while synthesizing a new claim number, and leaving diagnoses codes untouched. We can do this by making a few modifications directly to the de-identified preview of the document. 

Starting with the claim number, which was originally redacted; we are able to click directly on this entity and adjust our de-identification strategy. The result is a new document with a synthesized claim number (instead of just a black box). 

Likewise, because it was decided that diagnoses codes are not considered sensitive for this specific exercise, we are able to completely turn de-identification off for this field; ensuring that this type of code will appear with consistency throughout this document and others within the dataset. 

Exporting the final dataset 

Once you have effectively adjusted and applied your de-identification strategy across your entire dataset, you can push your sanitized files downstream for use by other teams or to be leveraged across the ultimate initiative (ie. to fine-tune a model). This is easy – simply click the “Download All Files” button, and you will be prompted to download a .zip file containing all of the de-identified dataset. 

Why this matters for the insurance industry 

The insurance industry sits at the intersection of sensitive personal data and fast-moving AI innovation. Every claim file, policy note, or customer conversation holds insights that can drive smarter models and more efficient operations – but only if that data can be safely unlocked. With an effective and trusted de-identification strategy that can be leveraged at scale, this industry is poised to unlock value, without sacrificing the privacy of its policy holders. 

  • Boost AI innovation: Power LLMs or claims-processing bots without exposing real customer data.
  • Stay compliant: Stay on top of privacy mandates and regulations like HIPAA. 
  • Enable secure collaboration: Share claim narratives or sensitive files with partners, carriers, or reinsurers without risk of PII exposure or abuse.

By embracing scalable de-identification, insurers can finally put their vast stores of unstructured data to work – fueling AI, streamlining claims, and enabling collaboration – while preserving the privacy and trust required by law and essential to maintaining a strong policy holder relationship.

If you’re interested in experimenting with the documents in this blog or even using some of your own, Tonic Textual is available for a free trial – set up an account and get started in minutes. You can also book a demo to get a more in-depth walk through and speak to an expert from our team about your specific requirements.

Whit Moses
Senior Product Marketing Manager
Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.