Data de-identification

A guide to data masking for HITRUST certification

Author

December 30, 2025

If you're building healthcare applications, you've likely hit the wall where compliance requirements slow down development velocity. When it comes to HITRUST certification, you must meet a comprehensive set of control objectives around protected health information (PHI) and personally identifiable information (PII).

This article walks you through what HITRUST requires, which data elements count as sensitive, the masking techniques that align with HITRUST control objectives, and a five-step workflow to implement masking defensibly.

What is HITRUST?

HITRUST is an organization that manages the Common Security Framework (CSF), a certifiable set of controls combining requirements from HIPAA, ISO, NIST, and other standards. You can pursue HITRUST certification by demonstrating that your policies, procedures, and technical controls meet the CSF’s requirements, typically via a third-party validated assessment.

Key requirements include regular risk assessments, documented control implementation, and continuous monitoring. HITRUST certification isn’t a one-off audit—it mandates periodic reassessment and ongoing evidence collection.

What counts as sensitive data under HITRUST?

In practice, any data element that can identify, locate, or reveal health details about an individual falls under HITRUST’s scope.

Types of sensitive data under HITRUST:

Direct identifiers: patient names, medical record numbers, Social Security numbers, email addresses, phone numbers
Indirect identifiers and contextual data: dates of birth or service, billing addresses, geolocation coordinates, provider IDs, small-cell counts (e.g., cohorts with fewer than five records)
Sensitive attributes: diagnoses, lab results, genetic data, mental or behavioral health information

Data masking for HITRUST certification: techniques that support compliance

Different masking methods address varying needs for realism, reversibility, and performance. You’ll choose static or dynamic approaches based on whether you’re working in non-production, production, or analytics workflows.

For each technique below, we've included validation checkpoints to ensure your implementation meets both privacy requirements (no PHI/PII exposed) and utility requirements (data still works for testing).

Static masking for non-production datasets

Static masking creates a masked copy of your database or file which you then use in dev/test environments. You replace or redact direct identifiers, shuffle values within columns, and apply format rules to indirect identifiers.

Utility validation: Compare column distributions, class balances, and correlations to the source.

Privacy validation: Run nearest-neighbor checks to detect near-duplicates and confirm no direct identifiers remain.

Dynamic masking for production datasets

Dynamic masking intercepts queries or API calls at runtime, masking sensitive fields on the fly. It works great for read-only analytics where users need production access but shouldn't see raw PHI. However, dynamic masking adds latency and doesn't suit development/testing workflows where you need stable, reproducible datasets.

Utility validation: Test response times and query compatibility with your analytics tools.

Privacy validation: Audit mask-rule coverage and monitor logs for unmasked query patterns.

Format-preserving encryption

Format-preserving encryption (FPE) encrypts values so they look like valid inputs (e.g. credit-card digits). You can reverse FPE with a key, making it suitable when downstream systems require the original format.

In healthcare, FPE is particularly useful for maintaining valid formats in fields like National Provider Identifiers (NPIs), diagnosis codes, or procedure codes where downstream systems validate format but don't need actual identifiers.

Utility validation: Run format-check scripts and ensure encrypted values pass schema validations.

Privacy validation: Verify key-management policies and confirm encrypted outputs don’t reveal plaintext patterns.

Tokenization and one-way hashing

Tokenization replaces each sensitive value with a token stored in a secure vault. One-way hashing irreversibly maps inputs to hash outputs. Both techniques prevent direct exposure of real data. Use tokenization when you need reversibility for specific workflows (e.g., customer support looking up a patient by tokenized ID).

Utility validation: Ensure tokens or hashes integrate smoothly with your test suites.

Privacy validation: Confirm hashing salts and token vaults follow separation-of-duties controls.

Synthetic data as a complementary approach

Synthetic data generates fully fictional, statistically similar records. It’s especially useful for early-stage software development and model training or large-scale analyses without using any real PHI/PII. The Tonic suite of products streamlines synthetic data generation of structured, semi-structured, and unstructured data, using agentic AI as well as the data transformation approaches described above.

Utility validation: Check feature distributions, correlation matrices, and label quality against the production dataset.

Privacy validation: Perform proximity and uniqueness checks to avoid near-duplicates; log generation parameters for reproducibility.

5 steps for data masking for HITRUST certification

This five-step process helps you build a defensible masking program aligned to HITRUST control objectives.

Step 1: Start with data discovery and profiling

You need an accurate inventory of all sensitive fields, including indirect identifiers and rare-value risks. Scan databases, file stores, and logs using schema introspection and sample queries. Tag each column by sensitivity level and risk classification. Tonic Structural for structured data and Tonic Textual for unstructured data automate the sensitive data discovery process via built-in sensitivity scans.

Step 2: Tailor masking rules to use cases

Not all fields require the same treatment. Use format-preserving or tokenized approaches when downstream systems depend on input formats. Apply stronger irreversibility (e.g., hashing) for low-utility fields. Document rules per column mapping back to your risk assessments.

Step 3: Use hybrid workflows for scale and accuracy

Automate high-volume masking through CI/CD pipelines—trigger data generation jobs when production schemas change, and run sanitization against log exports nightly. Reserve manual review for edge cases like clinical notes with ambiguous PHI patterns or small cohorts that risk re-identification.

Step 4: Test masked datasets against functional, regression, and privacy suites

Run your existing test suites to confirm integrations and business logic stay intact. In parallel, run privacy validation checks including distribution comparisons, nearest-neighbor analysis to detect near-duplicates, and sampling reviews to verify no PHI/PII remains exposed.

Step 5: Maintain key management, separation of duties, and least privilege

For reversible techniques (FPE, tokenization), enforce strict key access controls. Rotate keys periodically and log all decryption requests. Keep your masking configuration repositories versioned and limit who can modify rules.

How Tonic.ai enables data masking for HITRUST certification

Tonic.ai's platform addresses HITRUST control objectives by eliminating production PHI from development environments. Tonic Structural generates synthetic data from real data for complex test data management and effective end-to-end testing, Tonic Textual redacts and synthesizes unstructured data where PHI often hides, and Tonic Fabricate generates fully synthetic datasets from scratch.

Integrate all three into your CI/CD pipeline to automate compliant test data provisioning for every build. This satisfies continuous monitoring requirements, generates audit-ready evidence, and accelerates secure development without manual data requests.

Ready to accelerate HITRUST certification? Book a demo to see how Tonic.ai helps healthcare teams eliminate PHI from non-production environments.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Chiara Colombi

Director of Product Marketing

Chiara Colombi is the Director of Product Marketing at Tonic.ai. As one of the company's earliest employees, she has led its content strategy since day one, overseeing the development of all product-related content and virtual events. With two decades of experience in corporate communications, Chiara's career has consistently focused on content creation and product messaging. Fluent in multiple languages, she brings a global perspective to her work and specializes in translating complex technical concepts into clear and accessible information for her audience. Beyond her role at Tonic.ai, she is a published author of several children's books which have been recognized on Amazon Editors’ “Best of the Year” lists.