Technical deep dive

Redacting sensitive free-text data: build vs buy

Author

December 12, 2023

A large-language model (LLM) is only as accurate and relevant as its training data. As pre-trained LLMs asymptotically approach the limit of publicly available data, the next frontier to maximize the business value of this new technology is to leverage your organization’s private data, often stored in unstructured formats such as free text.

Unstructured data can contain a tremendous amount of value–chat transcripts with customers can be leveraged for insights on customer sentiment, customer success team efficiency, and more; your internal process documents can be used to build a chatbot for faster retrieval and timely answers. We’re only beginning to crack the surface of what’s possible. However, fine tuning an LLM or sending data to a third-party model’s endpoints can result in accidental exfiltration of your organization’s sensitive and proprietary data, a looming and costly risk to keep in mind. A responsible approach is to redact the sensitive data before using it for LLMOps, MLOps, or building your data pipelines.

Building your own redaction system requires overcoming significant technical challenges and investing considerable resources to develop and maintain the process. I’m going to walk you through some of the challenges you may face if your organization decides to go down the build-it-yourself path.

The Build Option

Depending on what type of PII you need to redact, you might be tempted to try to build a rules-based system for detecting and masking sensitive values, essentially predefining the type and form of your sensitive data.

Let’s consider the seemingly simple example of detecting phone numbers in text. Simple regular expressions can help you find specific strings or patterns. For example, the regular expression \(\d{3}\)-\d{3}-\d{4} identifies phone numbers such as (415)-555-1234. For well-formatted data, this would work; however, for many use cases and depending on the provenance of the data, it would be insufficient to assume that the data is going to be so nicely formatted. For example, consider a transcription of a customer support call: your transcription model may transcribe the phone number a customer provides perfectly and the transcript would read “(415)-555-1234”. What’s more likely, however, is that the transcription model returns something like: “four one 5555 one 2 three four”.

You can spend a lot of time and brain power trying to formulate enough regular expressions to cover all formatting possibilities, but maybe your text contains difficult edge cases such as the one noted above. Even with this simple example of detecting phone numbers in text, we encounter two fundamental problems:

To cover sufficiently many edge cases with rule-based detection models, you will probably end up with a high false positive detection rate.
Understanding this inaccuracy requires a representative sample of your data that has been carefully annotated with ground truth labels.

A more powerful technique is to use named entity recognition (NER) models, which consider contextual clues to detect specific types of named entities such as names and locations. There are highly accurate open source NER models available such as spacy and NLTK.

While these models can work, using a pre-trained model has downsides. First, and most obvious, they might not be trained to detect the specific entity types that are in your data. A more thorny problem is that off-the-shelf models might not generalize well to your data if it is significantly different from the training data used to develop the models. For example, spacy can struggle with detecting named entities in the presence of newlines or other whitespace characters that don’t occur in the Ontonotes training corpus.

To ameliorate some of these issues, you can preprocess/postprocess the data to align it with the hidden assumptions of the model. Combining an NER model with regular expressions (see Microsoft’s Presidio) extends the model’s detection abilities, but then you need to maintain a complex and brittle codebase to handle redactions. And we still haven’t addressed the crucial second problem: how do you know if the performance of this tool is adequate to ensure it will capture all instances of the sensitive entities in your data?

The inevitable conclusion is that an in-house effort needs resources that are dedicated to manually annotating test data with ground truth labels. Training or fine tuning your own NER models requires even more annotations and, depending on the sensitivity of your data, it may not be possible to use third-party annotation services.

Doesn’t GPT solve this?

The zero-shot and few-shot learning capabilities of GPTs allow for a great deal of flexibility in expressing the classification task, but in general these models perform worse than fine-tuned, task-specific models on narrow predictive tasks such as NER. Put simply, the publicly available GPT models are not generalizable to this particular problem.

Several preprints have suggested that GPT-3/4 can provide state-of-the-art PII detection. For example, Sparks of Artificial General Intelligence claims that GPT-4 outperforms Presidio at detecting named entities. However, more thorough evaluations of standard benchmarks suggest that lightweight, focused models still dominate.

Fine tuning is a plausible, although expensive, option, but you often end up with a model that is as accurate as lightweight models like BERT, but 100-1000 times slower:

A table comparing LLM performance on a number of datasets — Table adapted from *UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition, Zhou, Zhang, Gu, Chen, Poon*

At the end of the day, using LLMs to detect PII is inefficient: you may get small improvements over open-source models, but the time and cost of using an LLM is much much higher. However, as we’ll show in a follow-up article, it’s possible to leverage LLMs to generate synthetic data for training smaller NER models that are cheaper to train and perform well.

The Buy Option

As champions of data privacy, we wanted to enable our customers with a better option to protect unstructured free-text data and practice responsible AI development and data stewardship. We built Tonic Textual to provide highly accurate detection and redaction of sensitive information adaptable to any flavor and shape of free-text data.

Tonic Textual is a state-of-the-art, extensible text data redaction platform built for enterprise scale. Our models are trained on a large corpus of carefully annotated, diverse training data and can natively detect a wide variety of entity types. You can easily integrate Tonic Textual into your data pipelines and LLM workflows with our Python SDK, manage redaction policies via a streamlined UI, replace redactions with contextually relevant synthetic data, and use our Custom Models workflow to train your own lightweight NER models to detect idiosyncratic forms of sensitive data.

If your organization is committed to safely and responsibly leveraging generative AI, buying an out-of-the-box sensitive text redaction solution reduces the time and training costs required to develop a functional solution in-house. Your development team’s time is freed up to focus on the end-user experience and you can still realize the reduction in exfiltration risk from redacting your sensitive data before sending it through your pipelines.

Get started with a free account of Tonic Textual today and protect your data with just a few clicks or lines of code.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Ander Steele, PhD

Head of AI

Ander Steele, Head of AI at Tonic.ai, is at the forefront of building privacy-preserving data and machine learning pipelines. He tackles complex AI challenges such as synthetic-data generation, evaluation automation, and scalable inference, with a particular focus on making sensitive data safely useful for real-world AI systems. A mathematician by training, Ander holds a Ph.D. in Mathematics and Statistics from Boston University and a B.S. in Applied Mathematics from Georgia Institute of Technology. His diverse background includes academic positions as a Visiting Assistant Professor at UC Santa Cruz and a Postdoctoral Fellow at the University of Calgary, as well as industry experience as a Senior Data Scientist at Fullpower Technologies, where he applied deep learning to analyze sensor data related to sleep and health disorders. At Tonic.ai, he is responsible for the development of the company’s AI models, ensuring secure and compliant data use for AI training. Off the clock, you’ll find him mountain biking or running the redwoods around Santa Cruz.

Introducing Tonic Textual: redact and synthesize sensitive free-text data

Product updates

See All Related Guides

Make your sensitive data usable for testing and development.

Unblock data access, turbocharge development, and respect data privacy as a human right.