Blog
Data Privacy

How to de-identify legal documents with Tonic Textual

Author
Adam Kamor, PhD
April 15, 2024
How to de-identify legal documents with Tonic Textual
In this article
    Share

    Picture this: you work at a large legal practice and you want to help your staff quickly access the combined knowledge of decades of legal work and case law. Generative AI is the obvious answer. You could fine-tune a model on a knowledge base crafted from your firm’s past legal work to create an AI assistant that can quickly draft common legal documents, matching your firm’s tone and style. Or you could answer questions about prior case law without poring over thousands of pages of court transcripts and legal opinions. But, as you probably already know, there’s a catch. Training LLMs on your private data comes with the risk of the model accidentally exposing sensitive information from its training data in the responses it provides. We built Tonic Textual to eliminate this risk by desensitizing text documents for use in model training.

    After spending years de-identifying structured data from databases with Tonic Structural, we created Textual to handle the complex process of de-identifying unstructured data from document files, like PDFs and Microsoft Word documents. With Textual, we’ve expanded the realistic de-identification capabilities that we’re known for to a number of interesting new use cases.

    What do we mean by de-identification?

    Before we dive into using Textual, it’s important to understand the flavors of de-identification that Textual offers. At a high level, these flavors include redacting sensitive data or synthesizing representative data to replace sensitive data.

    • Redacting data
      • This is the process of completely removing sensitive data from the original text. For documents, like PDFs and .docx, Textual replaces sensitive text with black boxes to indicate that there was content, you’re just not allowed to see it. For other text-based file types, like .txt or .csv, Textual will redact sensitive data with a tokenized entity representing the entity type that was removed (e.g. first names will be replaced with [NameGiven_0842903]).
    • Synthesizing fake data to replace real data
      • In this scenario, Textual replaces sensitive data with fake data of the same entity type. For example, a name like “Abraham Lincoln” would be replaced with a fake name like “John Doe”. Textual makes a point to keep track of the data that was replaced throughout a document so that all references are consistent.

    Deciding to use redaction vs synthesis depends on your specific use case and privacy goals. Textual is configurable to use any combination of these approaches, allowing you to redact certain types of data while synthesizing other types of sensitive data within the same document.

    How are custom Named Entity Recognition models built?

    Under the hood, Textual uses custom, in-house machine learning models that perform Named Entity Recognition (NER) to detect dozens of sensitive entity types in free-text data. To extract the text content of various documents, we’ve created tools that parse documents into chunks of plain text. This extraction process involves a combination of file parsing techniques augmented with optical character recognition (OCR) to take disparate file formats & convert their textual content into a common format.

    Textual has been trained to recognize data from a wide variety of sources. We’ve compiled a large collection of documents from real-world sources, including:

    • Internal Business Documents
    • Medical Studies
    • Invoices & Financial Records
    • Legal Filings
    • Essays

    After collecting all of these documents, we annotated the datasets by hand. By annotating the data with a human in the loop, we’ve created high-quality internal training data that can be used to create models that recognize specific data types for key use cases. While it’s possible to automate this by generating fake training data, we’ve found that the best results come from using real data annotated by real people. In our experience, the same is true for training other machine learning models—Textual enables you to use your real data for model training by removing sensitive information, thereby preventing accidental data leakage, model memorization, and costly compliance violations.

    Using Textual to de-identify legal documents

    Now that we have an understanding of what Textual does and how it works, let’s use it to perform a basic de-identification workflow for a collection of legal documents. To start, we need some legal documents that contain sensitive information. For the sake of privacy, we’ll use three publicly-available SEC filings from Apple’s website:

    Creating a dataset of our SEC documents

    To get started, let’s sign in to Textual and create a new Dataset. Datasets are collections of documents that share a common de-identification strategy. This strategy is easily configurable via our UI or Python SDK and has real-time feedback by showing previews of your documents with the redactions and synthesized data applied. We’re using Apple SEC filings in this example, so we’ll call our project “Apple SEC Documents”.

    In our example dataset, we’re only using PDF files, but Textual supports all of the following file types:

    • Microsoft Word Documents (.docx)
    • PDFs
    • CSVs
    • Plain Text Files (.txt)

    After a few moments, all of our documents are processed and we can now see a list of the entity types that Textual detected. By default, Textual automatically detects dozens of entity types, from general types like Names & Addresses to more specific types like Credit Card Verification codes & Medical Licenses.

    Previewing our results

    Now that the documents are processed, we see a list of all the sensitive data types that our NER models detected. In our example, Textual identified sixteen different types of recognized entities across all of our documents. Taking a safety-first approach, all sensitive information that Textual detects is redacted by default.

    Now we can preview each document’s de-identified state. Let’s open up the file manager by clicking on the “Preview & Manage Files Button” button.

    From this dialog, we can see thumbnails of all of our SEC documents. Lets click on one and see what the redacted file will look like with the default redaction strategy:

    It looks like quite a bit of data is being redacted, but this might be too much redaction for our use case. Let’s dive into how we can adjust our strategy so that the data we want to de-identify is masked.

    Adjusting our de-identification strategy

    Now that we’ve reviewed Textual’s default redaction strategy, it’s time for us to start thinking about our redaction strategy. Let’s create some arbitrary goals about what data we want to redact, synthesize, and ignore:

    • Redact addresses, numerical values, & money data
    • Synthesize names & phone numbers data, replacing their values with fake data
    • Ignore everything else, keeping the original data intact

    All the fields we want to redact are already marked as redacted by default, so we’ll start with selecting “Synthesis” for the entity types of “Name” and “Phone Number”, and we’ll start by turning all of the sensitivity toggles for the entity types that we want to ignore to “Off”. This will completely disable any redaction of these entity types.

    Now we can open up the preview again and see our strategy at work!

    Downloading our redacted files

    Now that we’ve adjusted the sensitivity strategy and verified the results, we can download the redacted files. To do this, simply click the “Download Files” button and a zip file will download containing all of the de-identified files in the dataset.

    Conclusion

    In this example, we’ve used publicly available legal documents as a use case example. Internally, however, we might decide that we want to build an LLM app that lets our team search across the trove of legal agreements we’ve already amassed as a Series B company. We don’t want to expose sensitive information to unintended audiences so we redact the sensitive information and then fine-tune our LLM. This way, we get the productivity benefits of generative AI without risking leakage of private information.

    We’ve covered what Textual is, how it works, and how you can leverage its custom NER models to protect your legal documents. Textual is capable of much more, including building custom models of your own, creating redaction templates for bulk document redaction, and even processing thousands of files effortlessly by way of an included Python SDK. 

    Ready to protect your data? Dive deeper into how Tonic Textual can transform your approach to data privacy and protection. Book time with our team to learn more or sign up for a free trial to start redacting your documents for free today. 

    Adam Kamor, PhD
    Co-Founder & Head of Engineering