Understanding automated data redaction

Author

May 23, 2024

Data redaction is the process of identifying and replacing sensitive values in text. This process is intended to ensure both data privacy and data security, which also helps ensure compliance with security and privacy regulations. It allows you to use text data for purposes such as tests, demos, and LLM model training.

For example, you might have transcripts of customer calls or PDFs of loan applications. Because they contain sensitive values such as names, identifiers, and other information that you cannot expose, you can't use this content without first replacing those values.

In this guide, we'll provide an overview of automated data reaction and its benefits and applications. We'll also show how Tonic Textual fits into an automated redaction process.

Tonic Textual and data redaction

Tonic Textual offers data redaction and synthesis of free-text and image files. Textual identifies a variety of sensitive value types, and produces new files that replace the detected values with redacted or synthesized values.

Named entity recognition and value synthesis

To identify the values, Textual uses a set of Named Entity Recognition (NER) models. Each model is able to identify specific types of entities. For example, one model might look specifically for names, another for addresses, and yet another for specific types of identifiers.

Text is fed as input into each model, which identifies and returns the relevant values.

Textual then replaces the detected values with either redacted values (for example, replaces John with NAME_GIVEN), or synthesized data (for example, replaces John with Michael).

Custom data models

But what if you have values in addition to the entities that are built into Textual? Entities that are specific to your industry or your company?

You can use Textual's custom models option to define your own model. The custom model provides example values and shows how the entities might be found in context.

Textual advantages over conventional methods

Without Textual, you might have to wrangle open-source tools and build all of your own models and algorithms.

Textual provides more accurate models, more entity types, faster inference, as well as better support.

Automating data redaction

With automated data redaction, the file selection, entity detection, entity redaction and synthesis, and redacted file creation are built into an automated process.

For example, a PDF of every processed loan application is added to a repository. You want to make sure that all of these applications are scanned and redacted, so that you can safely use them in other systems.

You can set up an automated process that regularly checks for new files in the repository. Every time a new document is added, the system automatically sends the document for entity detection and redaction. The redacted document is added to a different repository where it is available to use for testing and training.

A graphic depicting the workflow of automated data redaction.

Benefits and use cases for automated data redaction

By automating the process, you ensure that all of the documents are protected, which in turn means that you stay in compliance with data privacy and security.

Automated redaction also increases the volume of redacted data that is available to you to use.

Secure data for LLM model training

One use case for data redaction is to provide the data needed to train LLM models. The more data you have for training, the better the LLM.

By automatically processing new data as it becomes available, you can easily build up and maintain a pipeline of additional content for LLM training.

Bulk document redaction for testing

An automated process can make it easy to redact a high volume of data that you might need to share and analyze.

Integrating Textual into an automated workflow

Textual includes features and options that allow you to integrate it into an automated redaction workflow.

Textual API to integrate with MLOps

The Textual Python API allows Textual to fit easily into an MLOps workflow.

It includes functions that you can use to integrate Tonic Textual into data processing pipelines to access the Textual NER models and redact free text data at any step in the process.

On-premises Textual instance

In addition to Textual Cloud, Textual also offers the option to install an on-premises, self-hosted instance of Textual.

A self-hosted instance allows you to further configure and optimize the installation to work with your compute options.

Textual Snowflake Native App

And if you use Snowflake as your data repository, then you can also use the Textual Snowflake Native App.

The application allows you to deploy Tonic Textual as a containerized application directly within your Snowflake environment. It includes a sensitive data detection service and a redaction service.

Once you have this in place, you can use Textual functions to produce redacted and synthesized versions of text data. The data never leaves Snowflake, which means that it remains protected by your existing Snowflake security.

Getting started with Tonic Textual

To get started with Tonic Textual, sign up for a free account at textual.tonic.ai or book a demo directly with the Textual team. Also, be sure to check out the Tonic Textual documentation for comprehensive overviews of the product’s capabilities.

Automate data redaction for AI model training.

Unblock data access, turbocharge development, and respect data privacy as a human right.

Book a demo