Data de-identification

The ultimate guide to data masking in SQL Server

Author

Abigail Sims

Author

January 9, 2023

Data is big business these days, and there are significant regulations and legislation in place to protect and regulate how each data type is handled. Some of these require data masking to comply with data protection laws, but data masking can also be useful in other ways—including but not limited to security, efficiency, and data analysis.

Today, we’ll explore the definition and benefits of data masking in SQL Server, both for data security and developer enablement.

What Is Data Masking In SQL Server?

Simply put, data masking in SQL Server is a process that masks sensitive data to protect its privacy in the event of an attack.

The goal is to ensure that data masking of a production database directly protects the database users outside of the production environment. Masking in SQL Server works by masking sensitive data to non-privileged users and non-production environments. This means developers can do their jobs with access to the data as needed, while in cases where the actual data isn’t required, substitute data classification appears instead.

What does this mean? Well, for one thing, it helps cut down significantly on the potential for sensitive data exposure or subsequent data breaches. What information is at risk exactly?

We’re so glad you asked. The most common types of masked data or information include:

Personally identifiable information, or PII: Data objects such as social security number, legal name, date of birth, address or address history, and associated phone numbers.
PHI, or Protected Health Information: Information that would be covered by HIPAA or similar legislation limiting access of private medical information to privileged users only.
Payment card information, or PCI-DSS: Any numeric data type such as credit and debit card numbers, expiration dates, security codes, and more.

So basically, yeah, we need to keep this data safe.

The Importance Of SQL Server Data Masking

The primary importance of data masking SQL Server data is to hide data or restrict unredacted data from access by unauthorized users. Does what it says on the tin, essentially. But that’s not all! Far from it. Data masking is also one of the main components of full compliance with various national data protection laws, including HIPAA and CCPA.

Failure to implement masking rules for sensitive data in SQL Server can result in unintended unmasked data leaks. It can even lead to full-blown catastrophic data breaches — yikes!

The processes that yield masked data often utilize numeric data types and string data types with a random value. This means that masking in SQL Server is a one-way process. It cannot be reversed! This is true of both dynamic data masks and static data masks.

Static vs. Dynamic SQL Data Masking

There are many significant differences between using a static or a dynamic data masking model.

What Is Static Data Masking?

There are countless scenarios in the development or testing environment where realistic and representative data is needed for portions of the software testing timeline. Functional and regression testing are a few of those times.

In these cases, fully synthetic data created from scratch won't work, since it isn't closely representative of the real-life production environment.

This is an instance where static data masking can help to bridge the gap.

Static masking can help fulfill compliance requirements for privacy regulations and legislation like HIPAA, PCI, and GDPR.

Static data masking replaces sensitive data permanently, by changing the at-rest data of database copies that are created specifically for development and operations testing environments. That helps developers to both efficiently and securely do their jobs.

What Is Dynamic Data Masking?

Dynamic data masking is also known as on-the-fly masking. It’s the on-demand masking of sensitive data in transit. This process leaves the at-rest original data unaltered and is only masked to prevent unauthorized database users from viewing sensitive data. So, only the data shown in the query results to the user is masked; other data in the database isn’t masked. That means unmasked data remains visible and accessible in the database.

One of the major downsides to dynamic data masking is that the stored procedure isn't masked dynamically. To mask the stored procedures, the query results need to be rewritten, not the query itself.

Dynamic data masking is used most often for production systems. They use dynamic masking processes to ensure that there is a masked layer between the at-rest database and the user that is viewing the query results, keeping the core data the same.

For this exercise, we’ll primarily focus on static masking, since that’s most relevant to our use case here. (Although if you want to learn more about these data masking options and others, peep this: What is Data Masking? The Best Techniques and Practices You Need To Know About.)

The (Many) Benefits Of Static Data Masking

The biggest benefits of static data masking are in the original data and the fact that the data came from a live production environment, which adds a lot of value. (Like, a lot a lot.)

All the sensitive data stored is copied and the data masking function is applied. That means the result is simply a copy of the original sample data, with confidential or private sensitive data redacted.

This creates a database based on realism, with rich and natural data patterns. The core utility provided by the masked data is vital to production and development departments and personnel being able to run effective tests and simulations.

Increased Accuracy & Savings

With data simulations and forecasting becoming more and more accurate, it's crucial to incorporate solid, reliable, real data in the test data and sample data given to the DevOps teams to inject into their modeling.

As a result, the tests are more reliable and effective in their conclusion data, associated analytics, and defect identification earlier in the software development cycle.

This dramatic savings of time makes things easier for developers, helps drive down the overall cost of development, and allows the average quality to improve as well.

Compliance

Another major benefit to static masking rules is the ease with which organizations can integrate compliance with PCI, GDPR, and HIPAA standards and regulations.

These standards place a rigid framework around masking data in situations where individually identifiable information is used. Effective use of static data masking can result in an incredible reduction, organization-wide, in the volume of sensitive data in the general development landscape.

In turn, these masked columns significantly reduce the risk of data leaks and breaches.

Security Benefits

Since the data is masked when copied, the masked data will have the sensitive data permanently removed from the content entirely. The output will not contain any of the pre-masking sensitive data.

While there are SQL Server security features to help with column level and row level security, a static data mask completely removes the data, which completely removes the need for object-level, granular security measures.

Flexibility

DevOps teams need to be able to handle data, not just look at it. Data writing functionality simply isn't available with a dynamic data masking model.

What’s more, there is no ability to make changes to that data since masking applies exclusively to data-in-transit.

But static data masking in SQL Server or SQL Server Management Studio allows your teams to dig into their own copy of the database, without worrying about potential data use violations.

Your DevOps teams can read and write the data, giving them more flexibility in testing and development. They can use the data to make a fully sandboxed copy of your production environment, which gives them a far wider range of potential testing than with dynamic data masking.

Are There Drawbacks To Static SQL Masking?

Now, we’re sure you’re wondering: What’s the catch here? And while there are a few drawbacks to static data masking, they are easily managed and generally considered to be well worth the benefits.

Batching The Data

The first potential drawback is that the masking process isn't done on-the-fly. Instead, it's a batched process that needs to run periodically.

While individual data masking can take just milliseconds, larger databases with considerable amounts of sensitive data can take several hours to complete masking.

This may sound inconvenient. And yes, if you tried to run the masking process in the middle of the day when everyone and their roommate is using the system, it definitely would be!

However, the most prudent system administrators will arrange for data masking to be batched overnight. That’s when system use is at its lowest point and available processing threads are at their highest.

Storage Space

The other commonly cited drawback to static data masking is that you need additional storage for the resulting masked data set. Keep in mind that when you use static data masking, your database is essentially duplicated, so there will need to be ample storage space for it.

This may sound daunting, but the cost-to-size ratio of storage may surprise you, even for relatively large databases.

Twenty years ago, adding another 100TB of storage was a massive feat. But with current storage media prices, enterprise static data masking can be implemented for consumer storage prices!

Data Masking in SQL Server FAQs

Isn't Dynamic Data Masking Better, Because It's Real-Time?

Tools like Tonic.ai can automatically refresh your data at intervals you specify. You can have data generations run each night, so you have fresh data every day if you choose. Also, an incremental mode can be used to reduce the amount of time that it takes to complete a full masking generation.

Dynamic Data Masking Doesn't Require Any Server Space — Does Static Data Masking?

Since static data masking creates such rich and realistic data for your DevOps teams — data that can be both read and written — some storage space is needed. However, subsetting can help limit the space that data generations need.

Additionally, the cost of storage space per TB is so low that there really isn't any reason not to implement static data masking in SQL Server. In exchange, the premium test data that you can provide to your development team will be well worth it.

A Dynamic Data Masking Feature Like RBAC Is Essential To My Work. Does Static Data Masking Provide RBAC?

Tonic.ai is able to offer RBAC within the platform itself, with user permissions. But if you choose to provide masked databases only to your development partners, you won't need the granular approach of RBAC to manage how the data is accessed anyway.

Since the static data masking process permanently transforms sensitive data into data that’s unable to be accessed by unauthorized users without violation, there are no concerns over data access or potential penalties for things like HIPAA or PCI violations.

This Sounds Awesome, When Can I Start?

Great question.

The bottom line is, if you're not using fake data that looks, acts and feels just like production data—you're not getting the most out of your testing.

If you want to learn more about how we’re enabling teams like yours with efficient, secure data for testing, and beyond, you’re in the right place. Check out our ebook on Fake Data Anti-Patterns to learn more… Or just ping us. Seriously, we’d love to chat.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Abigail Sims

Marketing

As a reformed writer now deep in the marketing machine, Abigail can (and will) create narrative-driven content for any technical vertical. With five years of experience telling brand stories for tech startups and small businesses, she thrives at the intersection of complex data and creative communication.