Data masking is one of the most common approaches developers rely on to get safe and effective test data when and where they need it. 

And yes, it’s exactly what it sounds like. Data masking is the art of dressing real data up with a compelling disguise to protect its underlying identity. 

What Is Data Masking?

Data masking is a method of data anonymization that performs a one-to-one transformation to replace or “mask” your original data with fake data. The goal of data masking is two-fold: 1. to protect sensitive data against exposure, and 2. to create representative data that can be used effectively in place of real data. Uses for masked data include customer support, sales demos, and, as we’ll explore in more detail below, software development and testing.

Data masking is a method of data anonymization that performs a one-to-one transformation to replace or “mask” your original data with fake data.

You can mask data via any number of methods, including replacing, randomizing, or redacting it. Some of these methods are more effective than others, and which one you use will vary based on your goals. 

Why Is Data Masking Important?

Thanks to an increasing focus on and regulations around data privacy, data masking has become a critical process within many organizational workflows, in particular within the data pipelines that fuel software development, testing, and QA. Software developers require high-quality, realistic data throughout their lower environments, but using sensitive production data is no longer an option, legally or ethically. A comprehensive data masking solution, ensures that developers have the data they need to build and test their products effectively.

The benefits of masking production data are many. In addition to streamlining data provisioning for lower environments, when done well, high-quality masked data boosts developer productivity by accelerating development and release cycles, and decreasing the number of bugs that get shipped to production. Data masking can also unblock wider resources by enabling offshore development teams to work with realistic data without infringing data privacy regulations.

These regulations, including GDPR in Europe, CCPA in California, and HIPAA across the United States, have increased the urgency for integrating data masking processes within the software development lifecycle. At the same time, they have also increased the demands on the rigor of those processes. In-house solutions and outdated, legacy tooling aren’t enough to satisfy the complex masking requirements of today’s agile developers who need to keep sensitive data out of their lower environments.

Which Types Of Data Require Data Masking?

Aside from the practical benefits of developer productivity outlined above, data masking may be a legal requirement depending on where you operate, what industry you work in, and what kind of data you work with. 

In the United States, for example, developers whose products run on protected health information must only work with masked data in their staging and QA environments, thanks to HIPAA. Europe’s GDPR and California’s CCPA, meanwhile, cast a much broader net, to define sensitive data as including identifiers, geolocation data, biometric data, education- and employment-related information, and demographic data.

Given the variety of the data types involved, a variety of approaches are required to satisfy all use cases. Let’s look at these approaches from two angles: first, the high-level method of provisioning masked data from its source to its user; and second, the techniques applied to create the masked data along the way.

Types Of Masking

Masking production data comes in many different forms—there isn't one single method known as “data masking”. At a high level, there are four primary approaches to data masking; static data masking, dynamic data masking, on-the-fly data masking, and synthetic data generation (kind of). Each of these types of data masking have specific use cases, and each has key considerations to keep in mind when deciding which approach is best for a given use case.

1. Static Data Masking

Static data masking involves creating a mimicked database or a replica of your original data, which contains your masked data. If you’re in software development, QA, or testing, this is the data masking solution you’re looking for. (Seriously.) By storing your masked data in its own database, this approach generates masked data that is read/write, meaning that you can work with it in your lower environments as naturally as production data flows through production environments. Statically masked data preserves the format, relationships, and referential integrity of its real-world source, to ensure that it looks, feels, and behaves like production data.  

Static data masking involves creating a mimicked database of fake data that realistically mirrors the values, format, and relationships of your original data while protecting sensitive information. 

This mimicked database can be referred to as the test environment or test database in software testing. With this method, your original data stays safe and secure, and your development team has the realistic data they need to get their work done.

2. Dynamic Data Masking

Dynamic data masking is a cursory method of masking your data that simply alters information as users access it. It sets up a proxy by which users can (or cannot) view sensitive information. Functionally, this masking system works in real-time and allegedly helps ensure that authorized users are the only ones who can view original data.

Dynamic data masking hides sensitive data from unauthorized users through a real-time proxy.

The downside to this, of course, is that this kind of system relies on authorization protocols, and isn’t actually accessible or functional in lower dev environments. The masked data it creates is read-only: it can’t be written to or altered because it isn’t stored anywhere. In other words, this approach is of little use to developers and offers no option for getting developers access to datasets for their local environments. What’s more, configuring its masking rules can require significant maintenance.

Data that has been masked dynamically is ideal for quick dives into sensitive data in relation to customer service inquiries, reporting and data analytics, and business intelligence needs, but it won’t work for software development and testing that requires read/write access to a stored database.

3. On-The-Fly Data Masking

The next masking system is on-the-fly data masking, which can to a certain degree be thought of as a combination of the dynamic and static methods. This data masking solution helps ensure that only masked data reaches its target by altering sensitive data in transit. In terms of data security, this method can be the most fluid.

On-The-Fly Data Masking alters sensitive data in transit so that only masked data arrives at the destination. 

Due to its in-transit mechanism, the on-the-fly method is optimal for large data migrations between different databases or larger entities with data stored on multiple databases. 

Pro tip: Some static data masking tools provide (and encourage) on-the-fly use. Tonic, for example, enables masking to be run on demand, whenever users need updated masked data. You can refresh your masked test data when you need it, where you need it, within just a few minutes (depending on the size of your DB). 

4. Synthetic Data Generation 

Now, before we move on, let’s pause to discuss synthetic data generation. This is not a method of data masking, but it is a solid solution for developers who need test data that’s realistic and safe.

Like data masking, data synthesis comes in a variety of flavors. The data synthesis approach that comes closest to data masking is model-based data synthesis. This involves using ML-driven algorithms to train a data model on an existing dataset in a way that captures the relationships and distributions within that data. This model can then be used to generate high-fidelity synthetic datasets that function just like production data, but don’t contain a drop of real data within their records

Synthetic data generation creates an entirely new dataset that maintains all the relationships and distributions in your old data for statistical purposes, but contains no real data and cannot be tied back to any individual records.

Synthetic data generation doesn’t just mask your data, it creates an entirely new dataset that maintains all the relationships in your old data for statistical purposes, but contains no real data and cannot be tied back to any individual records of real data. 

It’s important to note that not all data types require this more intensive approach to capturing the statistical realism found within their values. The vast majority of the data that developers need to protect in their lower environments can be fully protected and realistically mimicked by way of data masking. But an ideal data generation solution for developers will offer both options, to cover the full range of their data types and use cases.

Now that we’ve covered the high-level methods to mask your data, let’s look at some of the techniques you can apply within the realm of data masking. 

Data Masking Techniques

Here are some of the most common techniques that you can use to mask your data.

1. Data Encryption

Data encryption is a common data security strategy that protects sensitive information. This method of protecting data uses mathematical algorithms to scramble data, which can then only be unlocked using a cryptographic key. In theory, only authorized users will be able to access encrypted data, but in practice, anybody with the password or key can access your data, regardless of how they received it. 

This method does not exactly mask your data, so much as it hides it behind lock and key—and that key can be found and used by bad actors if not properly protected. To fully mask data by way of encryption, there can be no master key and the encrypted data cannot be reverted to the original.

2. Data Redaction

Data redaction is one of the most famous ways to protect data, and also arguably the least useful, at least for a developer. Redaction involves replacing original data with generic figures like x’s or, famously, blackened bars. Redaction can be a useful way to mask sensitive data in a testing environment if you don’t care about realism. 

One key feature of data redaction is that there is no real data in your database with aspects similar to the redaction, meaning that the redaction is obvious. If there is a similarity, it becomes substitution, not redaction. (See below.)

3. Data Substitution

The data substitution method is similar to the data redaction method; however, rather than a blatant redaction of original data, the data is replaced with an alternative value. The purpose of substitution over redaction is realism. The goal is ultimately to provide a dataset that safeguards the original data against exposure, while also preserving its utility in development and testing. 

4. Data Shuffling

Data shuffling is the process of moving the values around within a column. This ensures realism in that the original values still appear within the column, but they are no longer necessarily tied to the same records. This can be useful when working with categorical data, whose values and distribution need to be preserved. That said, it is not the most secure method of data masking.  

5. Data Scrambling

Data scrambling neither redacts, masks, nor shuffles data, but scrambles the characters and numbers involved to change their value. The original data does not remain in any field, making it distinct from data shuffling, above. 

FAQ

Okay, we’ve covered a lot, let’s take a pit stop to review. 

What Is The Meaning Of Data Masking?

Data masking is the process of obfuscating your data with a one-to-one transformation to replace or ‘mask’ your original data with fake data. This data can be generated by custom scripts, open source tools, or a comprehensive platform built to optimize data realism and data privacy.

There are three high-level approaches available to mask your data; static, dynamic, and on-the-fly. Data synthesis is a final solution often used to create realistic fake data, as well, with the key difference being that its output cannot be tied back one-to-one to any records in the real data from which it was created. Within these approaches, there are a number of commonly used techniques to perform the masking. These techniques include data encryption, data redaction, data substitution, data shuffling, and data scrambling.

While each of these methods and techniques has their own definitions and ways to protect your sensitive data, they all fall under the same 'data masking' umbrella and are used for data anonymization.

Why Is Data Masking Important?

Data masking is essential for developers who want to work quickly, efficiently, and compliantly. The recent rise in data privacy regulations have restricted access to production data for developers needing to build and test their software. This has placed the burden of provisioning safe, realistic data on development teams. Data masking is a critical solution in the modern developer tech stack to address and satisfy concerns of data privacy without impeding the development and deployment of quality software.

With a quality data masking solution like Tonic, you can have your cake and eat it too. Our platform enables you to mask or synthesize realistic, safe data based on your production data for effective use in development, testing, and QA. All the functionality of production, with none of the risks. 

What Is A Data Masking Solution?

A data masking solution is a platform or software that allows you to protect your sensitive data by enabling data masking via a variety of data masking techniques. A good solution will streamline and empower software development and testing, enable better data analytics and ML training, and, of course, help you achieve compliance. 

Whew, that was a lot…

If you’ve got more questions about data masking, mock data solutions, or just want to hear from the experts on real fake data (us), don’t hesitate to reach out. Drop us a line at hello@tonic.ai, or book a demo to take our synthetic data platform for a spin.

Chiara Colombi
Head of Marketing
A bilingual wordsmith dedicated to the art of engineering with words, Chiara has over a decade of experience supporting corporate communications at multi-national companies. She once translated for the Pope; it has more overlap with translating for developers than you might think.

Fake your world a better place

Enable your developers, unblock your data scientists, and respect data privacy as a human right.