If this headline caught your eye, chances are you’re an innovation leader in insurance facing a familiar challenge: how to put your mountains of internal data to work. Claims forms, customer correspondence, and policy documents are packed with PII, yet also hold the real-world context AI models need to detect patterns, understand language, and improve decision-making in underwriting, fraud detection, and claims automation.
The potential for generative AI to unlock value within the industry is massive – informing underwriter decision-support tools, streamlining claims triage, or deploying an LLM-powered agent to answer policy inquiries – all are realistic initiatives that require real-life data to be effective. But there's a catch: training AI models directly on real documents from the field with sensitive customer data risks exposing personally identifiable information (PII) in unintended ways.
We built Tonic Textual for this very challenge – to de-identify sensitive data from unstructured documents while preserving the context AI models need to learn and deliver value.
When it comes to protecting sensitive customer information in AI initiatives, insurers have only two viable strategies: redaction and synthesis. Without one – or a combination of both – personal information inevitably risks slipping into training datasets and models, where it becomes nearly impossible to control or remove. These approaches are not just best practices; they are the baseline for keeping customer trust intact and meeting compliance obligations.
Together, redaction and synthesis provide insurers with practical pathways to make sensitive documents AI-ready. Yet the real challenge isn’t just choosing the right strategy – it’s accomplishing this reliably, accurately, and at scale across the massive volumes of unstructured data that flows into the organization every day.
Tonic Textual’s ability to handle the complexity of unstructured insurance documents at scale comes from the combination of advanced parsing and custom-trained AI models to tackle the limitless possibilities contained within unstructured text. Here’s how the technology works.
With a basic understanding of how Textual works, we can now use it to perform a basic workflow to de-identify a collection of insurance documents. So that you can play around with this use case without leveraging internal documentation, we’ve created a collection of true-to-life faux policy documents that you can use for yourself:
Since you will be de-identifying multiple documents, the first step is to create a new Dataset. Datasets are documents with a shared de-identification strategy; so for example, if you were working with a collection of insurance claims, all of which contained a patient name and policy number – and you wanted to de-identify those entities across all of the documents – you would be able to do this quickly and effectively with a dataset.
With this dataset, we are specifying a name and also the source of the files that we will be uploading for redaction, which we will upload directly from our local device. Likewise, we are specifying an output that will match the original file format – in this case a new PDF – alternatively you could export the redacted data as snippets of JSON code.
After you have named your dataset, you can upload the files that you wish to de-identify; after a few moments they will populate on the left hand side along with a word-count analysis and the presence of sensitive entities. Once this is complete, you can select what types of entities you want de-identified as part of a bulk edit, along with your de-identification strategy: pure redaction or synthesis.
For each redacted you are able to select and view a preview of the de-identified output. For this example, let’s zoom in on the insurance claim for the emergency room visit. As part of our bulk de-identification strategy, we selected that names (first and given) and locations are synthesized, along with the redaction of numeric PII. The output is what you see here, where on the right hand side you see that the patient and doctors names have been synthesized (“Eric Smith” is now “Daniel Uva”) and all numbers and codes have been replaced with a black box.
For some use cases, this initial output may be sufficient. However, there may often be a need to fine-tune the de-identification so that important context is not lost, which could influence downstream results.
In the example above, we completely redacted all numerical PII from view, which encompasses policy numbers, claim numbers, and diagnoses codes. Let’s say for the sake of this exercise, that we only want to redact policy numbers, while synthesizing a new claim number, and leaving diagnoses codes untouched. We can do this by making a few modifications directly to the de-identified preview of the document.
Starting with the claim number, which was originally redacted; we are able to click directly on this entity and adjust our de-identification strategy. The result is a new document with a synthesized claim number (instead of just a black box).
Likewise, because it was decided that diagnoses codes are not considered sensitive for this specific exercise, we are able to completely turn de-identification off for this field; ensuring that this type of code will appear with consistency throughout this document and others within the dataset.
Once you have effectively adjusted and applied your de-identification strategy across your entire dataset, you can push your sanitized files downstream for use by other teams or to be leveraged across the ultimate initiative (ie. to fine-tune a model). This is easy – simply click the “Download All Files” button, and you will be prompted to download a .zip file containing all of the de-identified dataset.
The insurance industry sits at the intersection of sensitive personal data and fast-moving AI innovation. Every claim file, policy note, or customer conversation holds insights that can drive smarter models and more efficient operations – but only if that data can be safely unlocked. With an effective and trusted de-identification strategy that can be leveraged at scale, this industry is poised to unlock value, without sacrificing the privacy of its policy holders.
By embracing scalable de-identification, insurers can finally put their vast stores of unstructured data to work – fueling AI, streamlining claims, and enabling collaboration – while preserving the privacy and trust required by law and essential to maintaining a strong policy holder relationship.
If you’re interested in experimenting with the documents in this blog or even using some of your own, Tonic Textual is available for a free trial – set up an account and get started in minutes. You can also book a demo to get a more in-depth walk through and speak to an expert from our team about your specific requirements.