Playbook

LLM fine-tuning

The problem

In regulated industries like healthcare, fine-tuning Large Language Models (LLMs) on internal data is limited by strict rules around personally identifiable information (PII) and protected health information (PHI). Although this sensitive data is essential for improving model performance, regulatory and privacy risks make it difficult to use.

For example, suppose unstructured medical notes need to be fine-tuned into models that produce structured outputs following the HL7 FHIR standard, a vital task since most clinical data is unstructured. However, fine-tuning on this data risks HIPAA violations and the model memorizing and unintentionally revealing sensitive patient details. This is especially challenging because PHI like names and birthdates is necessary for the task but also highly regulated.

The solution

Tonic Textual addresses this challenge by enabling safe fine-tuning on sensitive data. In this scenario, Textual is used to transform unstructured medical notes by replacing sensitive entities with realistic synthetic substitutes. Tonic Textual leverages proprietary NER models to identify sensitive values at scale, enabling data teams to process large volumes of unstructured data across a variety of file types and formats. 

This preserves the contextual utility of the data while mitigating the risk of memorization and regulatory violations, allowing for effective fine-tuning of an LLM that generates HL7 FHIR-like structured outputs—without compromising patient privacy.

Watch Ander Steele, Tonic.ai’s head of AI, leverage Textual to create a privacy-protected dataset from unstructured clinical notes, benchmark for performance, and then feed to a LLM.

Playbook steps

1

Load the data set

2

Use Textual to generate a safe version of your data by identifying and replacing sensitive information with realistic synthetic substitutes:

  • Install the Textual SDK and create an API key 
  • Determine relevant entity types required to maintain compliance

Based upon your organization’s needs, you may choose to have an expert determination provider certify compliance. Tonic Partners with a trusted expert determination service to support compliance certification when needed.

3

Ingest the dataset into the LLM for fine-tuning

4

When possible, benchmark the performance of the model using real data. While real data should not be used as training data because of memorization risk, it can be used as test data to ensure the model is performing as expected

5

Assuming the model performs well on the test set, the model is safe to deploy in production

Try for yourself

Want to test it out? We’ve included all of the assets from the video in the playbook so that you can experiment on your own.

Built-in Intelligence for real-world data

Tonic.ai comes ready with out-of-the-box support for a rich library of entity types—so your data is understood from day one. From names, dates, and locations to nuanced healthcare, finance, and developer-specific fields, our pre-trained models are designed to recognize the structures and semantics that matter most. Whether you're redacting sensitive information or enriching records with entity-level precision, these built-in types form the backbone of smarter, safer data workflows.

Here's a look at the entities Tonic Textual can detect automatically:

CC Exp
The expiration date of a credit card.
CC_EXP
CVV
The card verification value for a credit card.
CVV
City
The name of a city.
LOCATION_CITY
Country
The name of a country.
LOCATION_COUNTRY
Credit Card
A credit card number.
CREDIT_CARD
DOB
A person's date of birth.
DOB
Date Time
A date or timestamp.
DATE_TIME
Email Address
An email address.
EMAIL_ADDRESS
Event
The name of an event.
EVENT
Family Name
A family name or surname.
NAME_FAMILY
Full Mailing Address
A full postal address. By default, the entity type handling option for this entity type is Off.
LOCATION_COMPLETE_ADDRESS
Gender Identifier
An identifier of a person's gender.
GENDER_IDENTIFIER
Given Name
A given name or first name.
NAME_GIVEN
Healthcare Identifier
An identifier associated with healthcare, such as a patient number.
HEALTHCARE_ID
IBAN Code
An international bank account number used to identify an overseas bank account.
IBAN_CODE
IP Address
An IP address.
IP_ADDRESS
Language
The name of a spoken language.
LANGUAGE
Law
A title of a law.
LAW
Location
A value related to a location. Can include any part of a mailing address.
LOCATION
Medical License
The identifier of a medical license.
MEDICAL_LICENSE
Money
A monetary value.
MONEY
NRP
A nationality, religion, or political group.
NRP
Numeric Identifier
A numeric value that acts as an identifier.
NUMERIC_PII
Numeric Value
A numeric value.
NUMERIC_VALUE
Occupation
A job title or profession.
OCCUPATION
Organization
The name of an organization.
ORGANIZATION
Password
A password used for authentication.
PASSWORD
Person Age
The age of a person.
PERSON_AGE
Phone Number
A telephone number.
PHONE_NUMBER
Product
The name of a product.
PRODUCT
State
A state name or abbreviation.
LOCATION_STATE
Street Address
A street address.
LOCATION_ADDRESS
URL
A URL to a web page.
URL
US Bank Number
The routing number of a bank in the United States.
US_BANK_NUMBER
US ITIN
An Individual Taxpayer Identification Number in the United States.
US_ITIN
US Passport
A United States passport identifier.
US_PASSPORT
US SSN
A United States Social Security number.
US_SSN
Zip
A postal code.
LOCATION_ZIP

Build better and faster with quality test data today.

Unblock data access, turbocharge development, and respect data privacy as a human right.
Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.