At the request of some of our customers, we’ve put together this step-by-step overview of considerations and actions to take when designing and building a data de-identification infrastructure, as well as a summary of what the ideal tool should deliver, to truly make it worth all the work. By the end, you should have a pretty solid idea of whether it will work for you to just insert NULL into a few columns.
From regulatory compliance to streamlining testing, from facilitating data access to advancing external partnerships, or just for a general desire to do right by your customers, the reasons for doubling down on data de-identification are many. Here are the resources and considerations you'll need to put in to get the securely anonymized data your company requires on the other end.
The first and most important consideration is the goal and ultimate destination of the data. You’re likely going to need more robust privacy protections to share HIPAA protected data with third parties than to share usage data with a small number of internal folks for testing. That said, most of the regulations have borrowed heavily from each other so the recommended best practices for HIPAA, CCPA, GDPR, SOC 2, ISO27001 all suggest using de-identified data whenever possible, i.e., test and development, and generally don’t require onerous reporting once the data is considered anonymized. So you can kill a lot of birds (or tofu chickens) with one (carefully considered) stone. Once you’ve determined your goal, it’s relatively easy to identify the best practices that will support your needs.
In most cases, a variety of teams pull from production, whether they be test, development, or analytics, so the simplest approach is to create a new version of production or several new versions of production with de-identified data. If you already have a fast follower or multiples copies of production, it may make sense to anonymize in place as opposed to making yet another copy. This depends largely on storage costs and data size. If your data is large (>5TB), you’ll likely want to explore in-place options as copying data will be prohibitively time consuming and you’ll miss your refresh schedule targets.
Now is when you should scan your database for PII. Frequently PII is buried in text blobs, JSON, or XML, in addition specific columns. A number of tools can assist with this by inspecting both data contents and column headers. Depending on your goals for the data, you may need to go a step further and consider what metadata or general patterns exist in the data that could be used to reverse engineer anonymization efforts. For example, time series data is often particularly risky as it provides so many touch points for reverse engineering.
If you’re aiming to adhere to GDPR or CCPA, you’ll need to take additional steps beyond simply redacting PII as the law still holds companies accountable for breaches if the data is susceptible to reverse engineering.
The first step here is to choose an appropriate way to model the data which satisfies both compliance and usability requirements. At the very least, you should hash the identified PII, but it’s often preferable to replace this data with realistic looking data so it can still be used in downstream workflows. For more secure layers of protection or advanced data modeling, you'll need to build appropriate data generators here as well.
And diving more deeply, data usability and modeling requirements to consider:
After performing the necessary transformations, you’ll need to consider how to handle long-term viability and potential reporting requirements, either to management or for audit requests. To ensure privacy, it’s crucial that you continually track PII and set up alerts for any schema updates as they can be an overlooked source of leaks of sensitive information.
For GDPR and CCPA in particular, it’s helpful to record metrics on the data’s resistance to reverse engineering and to keep tabs on those metrics over time.
If you’ve answered all of the above and built a tool that meets all your requirements, what should it be able to do for you? Here’s a general idea of the features of a quality data de-identification infrastructure that will serve your company well.
These are the considerations we’ve made and the features we’ve developed in building Tonic. The strength of our platform comes in part thanks to the fact that we’ve had to ask these questions for a myriad of situations and databases. Each answer we’ve reached has expanded our capabilities to handle the most complex and unexpected challenges in data de-identification.
If you’re feeling overwhelmed or have questions about the above process, we’re happy to chat and see how we can make your life easier. Reach out to us at hello@tonic.ai, and we’ll get your questions answered.
Enable your developers, unblock your data scientists, and respect data privacy as a human right.