Data de-identification

Redacting sensitive text data in JSON with Tonic Textual

Author

Lyon Van Voorhis

Author

February 20, 2024

JSON is one of the most popular data interchange formats, used by developers to package data for API requests, encode app configuration information, and to store data within databases. Its popularity owes to its easy human- and machine-readability and its flexibility as a data storage medium.

The Tonic platform for structural data has supported JSON redaction for a while now to allow clients to use the power of Tonic's data protection on semi-structured data within table columns in a SQL database, or in collections in a NoSQL solution like MongoDB. This approach works well with predictable JSON structures, because we can specify the configuration we want on a path-by-path basis, but it cannot provide high-quality redaction on unpredictable or unknown json structures. The good news? We’ve built a solution to address this.

Redacting JSON in free-text data with Tonic Textual

Tonic Textual is designed to help clients handle unpredictable free-text data found in things like chat logs and sensitive documents. The platform both identifies contextually what is sensitive and selects an appropriate Generator to use to redact the sensitive information and synthesize a relevant replacement.

We've extended that capability to JSON, where we can use similar contextual analysis to identify sensitive values in JSON properties, and then apply redactions to them. When identifying the sensitive data within a specific property, our models consider the name and value of both the specific property and surrounding properties to determine the sensitivity of the data. This allows us to redact sensitive data in a way that is both accurate and safe, while preserving the structure of the JSON.

Seeing it in action

Let's take a look at an example. Here's a JSON object that represents a user profile:

This object has some standard properties like `firstName`, `lastName`, `email`, and `phone`, but it also has a metadata property, which, for the purpose of this example, is a flexible object that can contain any number of properties containing different kinds of objects.

Our platform for structural data would currently be able to handle an object that just contained the standard properties, but would be unable to handle the metadata property without knowing what was going to be in there. Here is how it can be handled in Tonic Textual, using our Python SDK:

This small script would output the following redacted JSON:

As you can see, the redacted JSON has preserved the structure of the original JSON, but has replaced the sensitive data with redacted values. Tonic Textual can be used with all kinds of JSON, including deeply nested and complex structures, as well as JSON arrays. We also allow you to specify redaction configurations for different data types, to allow you to further customize the redaction process.

The takeaway

Tonic Textual's JSON redaction capabilities allow you to quickly redact sensitive data in JSON structures in a way that preserves the structure of the JSON. This allows you to use Tonic's powerful data protection capabilities on irregular, semi-structured data JSON data, and to do so in a way that is both accurate and safe. We're excited to see how people use this new capability to protect their data. To try it out today, sign up for a free trial of Tonic Textual.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Lyon Van Voorhis

Engineering

Lyon is a senior software engineer at Tonic.ai. He is currently working full-time on Tonic Textual, with a specific focus on PDFs.