Playbook

LLM fine-tuning

The problem

In regulated industries like healthcare, fine-tuning Large Language Models (LLMs) on internal data is limited by strict rules around personally identifiable information (PII) and protected health information (PHI). Although this sensitive data is essential for improving model performance, regulatory and privacy risks make it difficult to use.

For example, suppose unstructured medical notes need to be fine-tuned into models that produce structured outputs following the HL7 FHIR standard, a vital task since most clinical data is unstructured. However, fine-tuning on this data risks HIPAA violations and the model memorizing and unintentionally revealing sensitive patient details. This is especially challenging because PHI like names and birthdates is necessary for the task but also highly regulated.

The solution

Tonic Textual addresses this challenge by enabling safe fine-tuning on sensitive data. In this scenario, Textual is used to transform unstructured medical notes by replacing sensitive entities with realistic synthetic substitutes. Tonic Textual leverages proprietary NER models to identify sensitive values at scale, enabling data teams to process large volumes of unstructured data across a variety of file types and formats.

This preserves the contextual utility of the data while mitigating the risk of memorization and regulatory violations, allowing for effective fine-tuning of an LLM that generates HL7 FHIR-like structured outputs—without compromising patient privacy.

Watch Ander Steele, Tonic.ai’s head of AI, leverage Textual to create a privacy-protected dataset from unstructured clinical notes, benchmark for performance, and then feed to a LLM.

Playbook steps

Load the data set

Use Textual to generate a safe version of your data by identifying and replacing sensitive information with realistic synthetic substitutes:

Install the Textual SDK and create an API key
Determine relevant entity types required to maintain compliance

Based upon your organization’s needs, you may choose to have an expert determination provider certify compliance. Tonic Partners with a trusted expert determination service to support compliance certification when needed.

Ingest the dataset into the LLM for fine-tuning

When possible, benchmark the performance of the model using real data. While real data should not be used as training data because of memorization risk, it can be used as test data to ensure the model is performing as expected

Assuming the model performs well on the test set, the model is safe to deploy in production

Try for yourself

Want to test it out? We’ve included all of the assets from the video in the playbook so that you can experiment on your own.

Jupyter Notebook

Sample Dataset

Built-in Intelligence for real-world data

Tonic.ai comes ready with out-of-the-box support for a rich library of entity types—so your data is understood from day one. From names, dates, and locations to nuanced healthcare, finance, and developer-specific fields, our pre-trained models are designed to recognize the structures and semantics that matter most. Whether you're redacting sensitive information or enriching records with entity-level precision, these built-in types form the backbone of smarter, safer data workflows.

Here's a look at the entities Tonic Textual can detect automatically:

CC Exp

The expiration date of a credit card.

CC_EXP

CVV

The card verification value for a credit card.

CVV

City

The name of a city.

LOCATION_CITY

Country

The name of a country.

LOCATION_COUNTRY

Credit Card

A credit card number.

CREDIT_CARD

DOB

A person's date of birth.

DOB

Date Time

A date or timestamp.

DATE_TIME

Email Address

An email address.

EMAIL_ADDRESS

Event

The name of an event.

EVENT

Family Name

A family name or surname.

NAME_FAMILY

Full Mailing Address

A full postal address. By default, the entity type handling option for this entity type is Off.

LOCATION_COMPLETE_ADDRESS

Gender Identifier

An identifier of a person's gender.

GENDER_IDENTIFIER

Given Name

A given name or first name.

NAME_GIVEN

Healthcare Identifier

An identifier associated with healthcare, such as a patient number.

HEALTHCARE_ID

IBAN Code

An international bank account number used to identify an overseas bank account.

IBAN_CODE

IP Address

An IP address.

IP_ADDRESS

Language

The name of a spoken language.

LANGUAGE

Law

A title of a law.

LAW

Location

A value related to a location. Can include any part of a mailing address.

LOCATION

Medical License

The identifier of a medical license.

MEDICAL_LICENSE

Money

A monetary value.

MONEY

NRP

A nationality, religion, or political group.

NRP

Numeric Identifier

A numeric value that acts as an identifier.

NUMERIC_PII

Numeric Value

A numeric value.

NUMERIC_VALUE

Occupation

A job title or profession.

OCCUPATION

Organization

The name of an organization.

ORGANIZATION

Password

A password used for authentication.

PASSWORD

Person Age

The age of a person.

PERSON_AGE

Phone Number

A telephone number.

PHONE_NUMBER

Product

The name of a product.

PRODUCT

State

A state name or abbreviation.

LOCATION_STATE

Street Address

A street address.

LOCATION_ADDRESS

URL

A URL to a web page.

URL

US Bank Number

The routing number of a bank in the United States.

US_BANK_NUMBER

US ITIN

An Individual Taxpayer Identification Number in the United States.

US_ITIN

US Passport

A United States passport identifier.

US_PASSPORT

US SSN

A United States Social Security number.

US_SSN

Zip

A postal code.

LOCATION_ZIP

Build better and faster with quality test data today.

Unblock data access, turbocharge development, and respect data privacy as a human right.

Book a demo

Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.

Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.