In this blog post, Tonic.ai’s Head of AI, Ander Steele, walks through a live demo of how Tonic Textual can be used to automatically de-identify protected health information (PHI) within unstructured data—making it safe and compliant for fine-tuning large language models (LLMs). Whether you're a healthcare organization navigating HIPAA regulations or a business working with sensitive customer data, this demo showcases how Tonic Textual enables responsible AI development without compromising data utility. Watch the full video below or read through the complete transcript to see Textual in action.
Interested in a custom demo to support your specific use case? Book a demo with one of our experts.
I want to give an example of how a customer might use Textual in order to fine tune LLMs to perform some interesting tasks on sensitive data. The example that I'm starting with is the problem of structuring unstructured medical notes. And so the goal of this fine tuning problem is to take unstructured medical notes and produce some sort of structured output that's very similar to the HL7 FHIR standard.
And so you can think of this as a very important problem because most medical data that we have exists in messy, unstructured forms, and so being able to fine tune LLMs in order to give the kind of desired structured output here is a great use case. The problem, of course, is that the data we have in order to fine tune these models is extremely sensitive, right?
It's full of PII and PHI and anytime that we are looking to fine tune a model on such data. We have to be extremely careful, not just for regulation, compliance issues, but for the very real risk of these models memorizing this sensitive data and emitting it at the wrong time.
So if our model is trying to structure unstructured data and amidst like, medical records or names of patients in our training data, then we have a real problem. So the challenge here, of course, is that the PHI and this data is relevant for the task.
We are interested in extracting things like names and ages, and all these identifiers are governed by HIPAA. And of course need to be redacted before we can use this data safely for our fine tuning problems. Now, if we just go ahead and naively redact this data then we've basically destroyed the problem.
If we replace our names and ages et cetera with just redaction tokens then it's very unlikely that we will succeed in fine tuning an LLM here because that's part of what we want to extract. So what I'll show you is how we use Tonic Textual in order to create safe versions of this dataset and then fine tune a model which will perform as well as a model that was trained on real sensitive data.
Okay, so let's jump into this notebook where I'm actually doing this work. And I'll describe this dataset for you. This dataset is purely synthetic. It is generated basically by taking synthetic records and then prompting those, using those to prompt GDP four to generate fake medical notes.
And the problem that we're trying to solve is to go the other direction. Go from fake medical notes to HL7 records. Okay, and so let's actually load this data set. I've put this data set on Hugging Face.
Taking an example it consists of basically pairs of notes which we saw an example of before. So we have the note. And then aside that we have the structured form of that note, the encounter data mentioning the age of the patient and the type of the encounter. In addition to age, we have other demographics like gender and race.
And then of course the patient name and then procedures and medications that were relevant to this encounter. And so the output, like basically if we, we start with a clinical note like this, for Harvey Dan Moore, who has apparently very bad gingivitis.
We should be able to extract from that the structured data, basically the patient demographics names as well as the procedures that occurred during that encounter. Okay, so now let's show how we use it, how we use Textual to generate a safe version of this dataset.
Basically by default we will replace sensitive tokens, sensitive entities that we detect in the text with identifiers and tokens that can be reversed. For example, if you were trying to build a RAG application and you want to use textual as a privacy layer between a third party and your internal internal knowledge base, this would allow you to.
Reversibly, redact and then un-redact answers as they come back from the LLM, but for the purposes of fine tuning, we want to replace names and other PHI with contextually relevant realistic replacements so that our model will generalize well. And so this is just a matter of turning on synthesis.
And so now Ander Steele becomes Matilda Pixler. Obviously there's lots of other particular sensitive types of information that we care about. And so here I've basically constructed a list of our sensitive identifiers per HIPAA. Obviously we have patient names and locations.
These are considered sensitive by HIPAA. You have contact information like phone numbers, email addresses, payment information, although we don't expect to find it in these notes, it is of course considered sensitive. And so we should make sure we detect that. We also have things like URLs, and generic PII.
Healthcare entities are all governed by HIPAA and need to be appropriately detected and redacted. And so our configuration here is going to take all these types and replace them with synthetic replacements that look real. What we're gonna do is we're even going to make sure that we preserve the gender of these replacements by substituting obviously male names for obviously male names and obviously female names for obviously female names.
Preserving gender, of course, is important in the context of medical records. We would ideally like the synthetic replacements to be very realistic. And so a medical record that's describing a woman should of course be associated with female names. And you can see an example of how our kind of location synthesis.
Here’s how preserving name synthesis works. Bob is going to turn to Herbert; and Bonnie Dune, California is going to be replaced by Santa Clara, California. Here we've essentially done the HIPAA thing where we consider the last two digits of a zip code sensitive.
So we truncate those and synthesize a new zip code and then generate a location within that zip code that's realistic. So, that's what it will look like in short form. Let's go through and look at the data set. Let's look at the kind of encounter data corresponding to that dataset because these are the answers; but it kind of cleanly lays out what's going on here. And so basically what we see is that there are lots of records associated with individual patients.
These are longitudinal kinds of records. And so Harvey shows up at various ages from age three all the way to age 62. And so, unfortunately for Harvey, that means his data is at higher risk of being memorized by our LLM that we're fine tuning. One of the key risks for model memorization is repetition.
And so if we see these repeated strings occurring in the fine tuning data it is much more likely that these strings will be memorized. And of course, these strings are highly sensitive, so we need to replace them. And we need to replace them with something realistic. So let's actually go through, and now we're going to replace all of our real records with synthetic records.
I won't actually run this because it takes basically four minutes, but it's relatively fast to turn through many, many records using Textual, which is extremely formant here. And now we have longitudinal records. But for synthetic patients, there's no risk of emitting these names because they're not real people.
We've mitigated at least our risk of emitting sensitive information. You can then fine tune by using the synthetic data. All right, so the rest of this is just set up on how you would like to do the fine tuning problem.
Okay, so let's do some setup to frame the problem. You write the prompts, formatting of your data as like; turns of conversations between LM and and the user. We have basically all of our training data like a system prompt, which describes the problem.
Then we have the content, which is our sensitive medical records that we want to structure. And the response from the assistant should be the structured output, which has this sensitive data. But because we are using synthetic data, now, everything is safe to use.
Our detection is quite good, but in any of these automated detection systems, there will be misses. But because we make our replacements look just like the real text; our misses hide in plain sight, and it's difficult to see what, what kind of pieces of PHI were actually missed by our system or what, what's actually synthetic data.
So here I've trained a model on synthetic data and evaluated it on real data. The evaluation, of course, doesn't change the model weights. There's no risk of leaking sensitive information on this by using this validation. And basically what we see is the training proceeds as you might expect, starting with some reasonably high loss, but then quickly converging to something quite low, both in training and validation.
On the real data, you see the same story. The losses are basically indistinguishable. It's easier to see this in terms of plots. The red is a model trained on real data, evaluated on real data.
The blue is a model trained on synthetic data, evaluated on synthetic data. Both of these converge in training and validation. And so we have succeeded in building a model here that's capable at least according to these losses of performing the structured output extraction. And we can see that by actually taking some of our validation data, and running it through.
And let's see, like here we have an example where we're going to structure this patient data, which has not been a part of the training, a medical note on nasal congestion if we plug this through our model. This is what the model predicts, which is pretty close to what we want.
It’'s not perfect because instead of the desired output would be to have viral sinusitis for the reason. But we put “nasal congestion and cough” for this reason. Justifiable and also completely independent of our synthesis here. So this is just the fact is we should train this model for maybe a little bit longer and we'll get the desired results.
But the point is it's safe to do.