Data masking is a data transformation method used to protect sensitive data by replacing it with a non-sensitive substitute. Often the goal of data masking is to allow the use of realistic test or demo data for development, testing, and training purposes while protecting the privacy of the sensitive data on which it is based.
Data masking can be done in a variety of ways, both in terms of the high-level approach determined by where the data lives and how the end user needs to interact with it, and in terms of the entity-level transformations applied to de-identify the data. In this guide, we’ll provide definitions of both the high-level approaches to data masking, as well as the types of data masking techniques used to achieve masked data at the entity level.
Data masking can be achieved by way of three primary high-level approaches. These approaches differ based on where the data is stored and functional requirements of the masked output data. The approach you take will be determined by these requirements.
Static data masking is masking performed on data at rest (aka data in a data store), and the process is designed to permanently replace sensitive data with non-sensitive substitutes. This approach creates masked data that is read/write, an essential quality for test data for software development and QA teams. Static data masking may be performed on a traditional relational database, such as a PostgreSQL test database, in a NoSQL database like MongoDB, or on file-based data like CSV or JSON files.
Dynamic data masking is masking performed on data in transit by way of a proxy. This process is designed to leave the original at-rest data intact and unaltered. The masked data isn’t stored anywhere and is read-only, making dynamic masking appropriate for simple data access management but not a usable approach for software development and testing workflows.
To a certain degree, on-the-fly data masking can be thought of as a combination of the dynamic and static methods. It involves altering sensitive data in transit before it is saved to disk, with the goal of having only masked data reach a target destination.
Within each of the above high-level approaches, a variety of transformation techniques can be applied to the data to achieve a masked output dataset. These techniques can be as simple as replacing existing data with random values pulled from a library, or as complex as recreating the statistics of the original values in a new, but still realistic, distribution of values. Here are a few of the more common ways to mask data.
By identifying the best high-level masking approach for your use case and using a combination of these data masking techniques within your approach, organizations can ensure that their sensitive data is protected from unauthorized access, while also maximizing their teams’ productivity. But which teams need data masking to maximize their work? Let’s take a closer look at the data masking use cases to better understand their goals.
Many teams in an organization can benefit from data masking to ensure data privacy while simplifying access to quality data to fuel productivity. Here are several common use cases for data masking in today’s companies.
In order to build a proper staging environment for testing and QA, organizations need usable test data that represents production as closely as possible. For organizations with sensitive data in production, creating realistic, masked data is a key ingredient for building a quality staging environment. Representative data makes it much easier and more reliable for developers and testers to catch bugs in staging before they’re pushed live to production. It also enables them to validate fixes in staging environments, as well. Without representative data, a staging environment is effectively useless for complex testing and developers and testers will find themselves validating fixes in production.
Developers typically work in their own ‘sandbox’ or development environment. Often, they’re running their applications on their own computers and need manageable datasets to work with in order to validate their work. Both to ensure data security and to ensure access to representative datasets, data masking can be fundamental for equipping developers with data they can safely and effectively use for software development. When masking is paired with subsetting—scaling a database down to a targeted, coherent slice of representative data—developers can best streamline their productivity and workflows.
Since software runs on data, software demos and training also require data in order to run smoothly. It goes without saying that demos and training should not run on real-world data. But spinning up realistic demo data has become increasingly difficult as our data ecosystems and scale has grown more complex. Data masking is a powerful approach for ensuring the availability of quality, representative datasets for sales demos, employee training, and customer onboarding. By creating demo data from real-world production data, teams can craft datasets that best spotlight their products’ features and capabilities. Here, too, pairing masking with subsetting enables crafting tailored datasets for specific use cases, industries, or customer journeys.
Data privacy regulations set limits on the length of time production data can be stored by an organization. In order to perform analytics over time, teams need a way to preserve their historic data that doesn’t infringe on these limitations. Masking production data by way of tokenization eliminates all PII/PHI/sensitive data and enables compliant long-term data storage, as it allows you to store de-identified data instead of storing real-world data. Tokenized data is compliant with regulations like GDPR, HIPAA, and CCPA, but also fully preserves the data’s utility for data analytics.
Data masking is also used in the field of outsourcing. Many organizations outsource their business processes to third-party vendors. However, sharing sensitive data with these vendors can create a security risk. Data masking allows organizations to share masked data with vendors which is safe to use and does not put the organization at risk.
When implemented effectively, data masking provides a wealth of advantages, though there are a few caveats to consider, as well. The table below provides an overview of data masking pros and cons.
Implementing data masking in your organization is an important step towards ensuring the safety and security of your sensitive data. Not only is data masking a best practice for data privacy, it is increasingly a legal requirement for organizations today. It is essential to implement data masking in a way that fully complies with the regulations your organization is subject to. Keep in mind: GDPR, which regulates the use, rights, and protection of consumer data in Europe; CCPA, which does the same for consumers based in California (essentially setting a baseline data privacy standard for all of the US); HIPAA which regulates healthcare data in the US; and PCI, which is a critical security standard in the financial industry. At the time of this writing, 11 US states had signed privacy bills into law, and another 5 US states had active bills progressing through the legislative process.
Historically, organizations have implemented data masking in a number of ways. Many organizations begin by implementing in-house solutions. These may rely on custom scripts or freely-available open source tools like Faker, which, as of 2022, was downloaded 2.4 million times per week, to patch together data masking workarounds. The in-house approach can work for simpler use cases and smaller teams, earlier on in their growth, but quickly become ineffective as an organization’s data becomes more complex. In general, given the patched together nature of in-house scripts, they aren’t able to guarantee the same level of privacy. What’s more, they require endless maintenance as your data changes over time.
Teams that outgrow in-house solutions often turn next to legacy test data management tools. By legacy TDM software, we mean earlier generations of data masking software. In addition to data masking, these tools may also offer database virtualization and orchestration. An important distinction of legacy TDM is that they often prioritize data security over data utility in their execution of data masking techniques, meaning that realistic test data is not their end goal. This is where they often fall short in providing useful masked data for software testing and development. In addition, since their technology is older, their UI and underlying architecture often reflects their age, resulting in a less user-friendly experience and more pain points when it comes to working with data at scale. Simply put, they aren’t built to work with today’s complex data pipelines and modern CI/CD workflows.
In response to the gaps of in-house solutions and legacy TDM tools, modern data platforms have entered the market, designed to better manage and scale with the complexity of today’s data. Unlike legacy software, these newer solutions place a stronger emphasis on maintaining data realism to ensure data utility in software development and testing. At the same time, they incorporate more modern data security techniques like differential privacy to guarantee data privacy along the way. These platforms offer today’s teams a streamlined approach to data masking, with native integrations to data stores (from SQL Server to Snowflake), modern workflow automations, and full access by way of API. They are purpose built for enabling developer productivity by maximizing the quality and ease of access to safe, realistic test data.
When implementing data masking, it is important to consider the type of data being masked and the level of security required. It is also essential to ensure that the masking technique used does not compromise the integrity or quality of the data. Regular testing and auditing should be done to ensure that the masking technique is effective and that the sensitive data remains secure. By implementing data masking techniques, you can protect your sensitive data from unauthorized access and comply with industry regulations, ensuring the safety and security of your organization's information.
The Tonic test data platform is a modern solution built for today's engineering organizations, for their complex data ecosystems, and for the CI/CD workflows that require realistic, secure test data in order to run effectively. To learn more, explore our product pages, or connect with our team.