Blog
Technical

How to automatically redact sensitive text data in JSON format

Author
Adam Kamor, PhD
February 15, 2024
How to automatically redact sensitive text data in JSON format
In this article
    Share

    NER models are used for identifying sensitive tokens in unstructured data.  They work best when the text they are scanning resembles the text on which the models were trained.  When the text to be scanned begins to differ from the training set, performance drops.  In this article, we ask a simple question that gives an interesting result.

    What happens when you use NER models trained on unstructured text to identify sensitive tokens in JSON?

    Taking a naive approach yields expectedly terrible results.  Passing JSON directly into an NER model provides poor identification and often results in redacted text that doesn’t adhere to the JSON syntax.  But we will show that there are approaches one can take that actually yield strong results.  The tl;dr is that you need to find clever ways to make your JSON data look like the unstructured text on which the model was originally trained.

    For this experiment, we’ll be utilizing Tonic Textual’s NER models, which we have trained on a diverse dataset of text data spanning domains, contexts, and formats. Importantly, we’ll be using a version of the models that haven’t been trained on text data in JSON format. If you would like to make use of these models, sign up for a free account at tonic.ai/textual.

    For our tests today, we will use a JSON document that contains both structured and unstructured fields.  That is to say, many of the JSON values are things like name and zip code, but other values contain free-form text.

    Below is the JSON object that we will be passing through our model. The keys which we consider to be sensitive are:

    • $.name.first
    • $.name.middle
    • $.name.last
    • $.location.city
    • $.location.state
    • $.location.zip
    • $.work.company_name
    • $.history

    In total, there are 8 JSON keys which are sensitive and 2 which are not. As a rubric, we will add up all correct fields identified as sensitive and deduct a point for each false positive. This will give us a score between -2 and +8. We then shift the score by +2 to keep things on a 0 to 10 scale.

    In the cells below, we show different approaches one can take to redact PII in JSON using Tonic Textual's NER models. We iterate through various approaches with greater complexity but higher rubric scores. In the final approach, we show an algorithm which scores a perfect 10/10. A modified version of this algorithm is the basis for the new redact_json() function in the Tonic Textual SDK (available in version 1.0.3).

    Setup

    Some boilerplate before we begin. We define a few functions, import the TonicTextual SDK client, etc.

    Naive approach - Rubric Score: 1/10

    Our first attempt will treat the JSON naively, and pass the entire contents of the JSON object into the NER model. We expect this to underperform because our NER models are neither tuned nor trained on json data. Even worse, this approach will not guarantee the redacted output is even in valid JSON format. This is because the model might choose to redact keys, JSON control characters, etc.

    In this approach, our NER models failed to find any of the structured data components within the JSON and instead only properly identify and redact sensitive tokens within the $.history field. However, the redaction of that field is correct.

    This gives us a score on our rubric (see above) of 1.

    Improved approach - Rubric Score: 9/10

    In this approach we will apply our NER model recursively on every primitive value (string and numeric) within the JSON object. However, we use the respective key to help provide the NER model with additional context.

    In a nutshell, we will build sentences out of each key value pair to help provide context to our models by constructing sentences of the form:

    For example, consider the following json object:

    In this approach, we would construct 4 sentences and pass each to our models:

    • The zip is 98103
    • The first is Janet
    • The last is Smith
    • The medical_history is ipso lorem...

    While this approach will give a better score than the naive approach, it is inefficient because of too many calls to the NER model and is not optimal because each sentence lacks the context of surrounding sentences and the sentence structure doesn't take into account hierarchies in the JSON.

    The results of this approach are impressive – we scored a 9/10. No false positives and the only PII missed was $.name.middle. This makes sense because the sentence constructed reads "The middle is A". This provides no hint to the NER model that a part of a name is being referenced.

    More Improved Approach - Rubric 10/10

    In the previous approach, we only failed on identifying the middle initial in the name object ($.name.middle). In this more improved approach, we try to provide the NER model with additional context such that we can identify middle initials as well. In order to do this, we will no longer send each individual constructed sentence to the NER model, but instead will pass all sentences into the model together so the model can gain context for one sentence by looking at surrounding sentences.

    As we can see from looking at the results below, this approach leads to no false positives and a perfect score of 10/10.

    Note that this cell has a LOT of code in it and that the way we re-assemble here is hacky, relying on non-conflicting keys (as well as other assumptions). This is just to keep the code simple and the re-assembly is not the interesting part here anyways.

    Final remarks

    We've shown a few ways that you can redact JSON using NER models not specifically trained on JSON format data. Our last approach, which was the most complex, yielded strong results by converting the JSON into friendly unstructured text, redacting the text, and then converting back to JSON. This is a complex, tedious process with many edge cases. To make this easy for you, you can use the Tonic Textual SDK. In version 1.0.3, we added support for a redact_json(msg: str) function, which takes a JSON string or Python dictionary and redacts your sensitive text according to a modified version of approach 3.

    If you’d like to try this approach yourself, sign up for a free account at: https://www.tonic.ai/texual. Enjoy!

    Adam Kamor, PhD
    Co-Founder & Head of Engineering