Building a Data De-Identification Infrastructure

Ian Coe
August 7, 2020
Building a Data De-Identification Infrastructure
In this article

    At the request of some of our customers, we’ve put together this step-by-step overview of considerations and actions to take when designing and building a data de-identification infrastructure, as well as a summary of what the ideal tool should deliver, to truly make it worth all the work. By the end, you should have a pretty solid idea of whether it will work for you to just insert NULL into a few columns. 

    From regulatory compliance to streamlining testing, from facilitating data access to advancing external partnerships, or just for a general desire to do right by your customers, the reasons for doubling down on data de-identification are many. Here are the resources and considerations you'll need to put in to get the securely anonymized data your company requires on the other end.

    Step #1 - Identify your target

    The first and most important consideration is the goal and ultimate destination of the data. You’re likely going to need more robust privacy protections to share HIPAA protected data with third parties than to share usage data with a small number of internal folks for testing. That said, most of the regulations have borrowed heavily from each other so the recommended best practices for HIPAA, CCPA, GDPR, SOC 2, ISO27001 all suggest using de-identified data whenever possible, i.e., test and development, and generally don’t require onerous reporting once the data is considered anonymized. So you can kill a lot of birds (or tofu chickens) with one (carefully considered) stone. Once you’ve determined your goal, it’s relatively easy to identify the best practices that will support your needs.

    Step #2 - Get your ducks in a row

    In most cases, a variety of teams pull from production, whether they be test, development, or analytics, so the simplest approach is to create a new version of production or several new versions of production with de-identified data. If you already have a fast follower or multiples copies of production, it may make sense to anonymize in place as opposed to making yet another copy. This depends largely on storage costs and data size. If your data is large (>5TB), you’ll likely want to explore in-place options as copying data will be prohibitively time consuming and you’ll miss your refresh schedule targets. 

    Now is when you should scan your database for PII. Frequently PII is buried in text blobs, JSON, or XML, in addition specific columns. A number of tools can assist with this by inspecting both data contents and column headers. Depending on your goals for the data, you may need to go a step further and consider what metadata or general patterns exist in the data that could be used to reverse engineer anonymization efforts. For example, time series data is often particularly risky as it provides so many touch points for reverse engineering.

    If you’re aiming to adhere to GDPR or CCPA, you’ll need to take additional steps beyond simply redacting PII as the law still holds companies accountable for breaches if the data is susceptible to reverse engineering. 

    Step #3 - Start the “real” work

    The first step here is to choose an appropriate way to model the data which satisfies both compliance and usability requirements. At the very least, you should hash the identified PII, but it’s often preferable to replace this data with realistic looking data so it can still be used in downstream workflows. For more secure layers of protection or advanced data modeling, you'll need to build appropriate data generators here as well. 

    Technical considerations you’ll need to answer: 

    • What is the necessary refresh? Can you copy all of the data in the nightly process?
    • How do you ensure processing is fast enough even on the largest databases?
    • Are there datasets that need to be merged with your data?
    • Do you want all tables to have the same refresh schedule?
    • What does the reliability of that process have to be?
    • How will you handle bad data? 
    • What data sources will you support (databases, flat files, s3 buckets)?
    • Do multiple data sources need to align for join purposes?
    • Are you going to replicate all database objects, such as stored procedures and triggers?
    • Does data generation need to be deterministic or non-deterministic?
    • Do you need to mask JSON or XML within cells?
    • Do you need to mask free text within cells?
    • Do you want to support incremental updates? 
    • Do you want to replicate the full data set or just a subset? (See our post on subsetting)
    • Will you need to scale the data for performance testing purposes?
    • Do you need different versions of the data for different groups/use cases? How will you manage those different transformations?
    • How will you simulate time series data? 
    • What level of privacy do you want to guarantee?

    And diving more deeply, data usability and modeling requirements to consider: 

    Maintaining foreign key relationships

    • Are there important cross table relationships to preserve?

    Intercolumn relationships and dependencies

    • Are there columns that follow specific rules, e.g., they’re a combination of other columns?
    • Are there columns that relate to each other but don’t follow specific rules, e.g., home value and income?

    Event sequence relationships

    • Are there events that must happen in a specific order, e.g., diagnosis must come after an office visit?

    Step #4 - Validation

    After performing the necessary transformations, you’ll need to consider how to handle long-term viability and potential reporting requirements, either to management or for audit requests. To ensure privacy, it’s crucial that you continually track PII and set up alerts for any schema updates as they can be an overlooked source of leaks of sensitive information.  

    For GDPR and CCPA in particular, it’s helpful to record metrics on the data’s resistance to reverse engineering and to keep tabs on those metrics over time.

    Security/regulatory requirements to consider:

    • What metrics will you use to demonstrate the security of the de-identified data to auditors/management?
    • How will you report on them?
    • Can you make use of the data after safe harbor or do you need to leverage a different process and get sign off from an expert?
    SOC 2
    • How will you track which systems have production data?
    Reporting requirements
    • PII location
    • Security metrics
    • Data transform logs
    Handling schema updates
    • Automatically redact using a model?
    • Block generation until a human has confirmed the change?
    • Automatically add, but as null?
    • Who has access to generate new transformed sources and how are those changes managed?

    What Success Looks Like

    If you’ve answered all of the above and built a tool that meets all your requirements, what should it be able to do for you? Here’s a general idea of the features of a quality data de-identification infrastructure that will serve your company well.

    • Advanced data modeling with cross table capabilities to maintain consistency
    • Column linking to preserve relationships
    • Automated PII detection
    • PII reporting
    • Database subsetting
    • Scaling
    • Ability to handle JSON and XML
    • Ability to handle event pipelines

    These are the considerations we’ve made and the features we’ve developed in building Tonic. The strength of our platform comes in part thanks to the fact that we’ve had to ask these questions for a myriad of situations and databases. Each answer we’ve reached has expanded our capabilities to handle the most complex and unexpected challenges in data de-identification.

    If you’re feeling overwhelmed or have questions about the above process, we’re happy to chat and see how we can make your life easier. Reach out to us at, and we’ll get your questions answered.

    Ian Coe
    Co-Founder & CEO
    Ian Coe is the Co-founder and CEO of For over a decade, Ian has worked to advance data-driven solutions by removing barriers impeding teams from answering their most important questions. As an early member of the commercial division at Palantir Technologies, he led teams solving data integration chal- lenges in industries ranging from financial services to the media. At Tableau, he continued to focus on analytics, driving the vision around statistics and the calculation language. As a founder of Tonic, Ian is directing his energies toward synthetic data generation to maximize data utility, protect customer privacy, and drive developer efficiency.

    Fake your world a better place

    Enable your developers, unblock your data scientists, and respect data privacy as a human right.