Here at Tonic.ai, our bread and butter is empowering developers to test their software with the peace of mind that they will not be putting their customers’ information at risk. Developers use Tonic to easily access high quality and secure test data. Our platform is industry-leading in the myriad of different techniques it offers, including masking, subsetting, differential privacy, and more. Many of the methods we use are considered data obfuscation methods. What is data obfuscation you ask? In this post, we will take a deep dive into the data privacy-preserving techniques that are data obfuscation.
Data obfuscation is a method of hiding data by transforming it into a form that is difficult to understand or interpret while still keeping its fundamental characteristics. This prevents unauthorized users from being able to access certain information while still maintaining its utility. There can be, however, a trade off with this data security technique, as the more private it makes the data, the less useful the data might become.
There are a number of reasons why one might want to obfuscate sensitive data:
No matter what type of sensitive data you have, data obfuscation can be done by the following steps:
There are many different ways you can obfuscate your data. Some options maintain more usability of the data while others are better at keeping the data more secure. The choice of technique implemented should be based on the type of data, the level of protection necessary, and what it is being used for.
Tokenization
Tokenization involves replacing sensitive data with tokens, or random strings of characters while the original data is securely stored. This is best for obfuscating personally identifiable information such as credit card numbers and email addresses.
Encryption
Encryption turns data into a format that is unreadable to an attacker. This requires the use of complex algorithms and the maintenance of a decryption key so that the data can be transformed back to its original form. There are several different types of encryption that differ based on what algorithm is used to transform it, such as format preserving and homomorphic encryption.
Masking
Technically speaking, masking mainly refers to techniques to hide a certain subset of sensitive data. It is a term commonly used interchangeably with data obfuscation to talk broadly about many of these data privacy techniques.
Generalization
Generalization involves reducing the granularity and specificity of the data by aggregating it or putting it into a more broad format. This is best used for research or analysis that only requires the overall trends.
Randomization
Perturbing the data or adding random noise to it is another way to make it more challenging to decipher individuals from the data. The amount and type of noise is controlled based on the privacy/utility requirements of your obfuscation use case. This method is most useful when the data will be used for analysis or machine learning.
Data Synthesis
In synthetic data generation, new data is created based on the patterns in a real dataset or based on rules defined by a user. When created based on the patterns in a real dataset, the output data cannot be tied back to any individual.
The methods used to execute these techniques vary depending on the data pipeline architecture. Other factors that would influence your methodology might be tooling, programming languages, and of course the specific privacy regulations of your organization. Some common methods include:
Again, there are many different methods of data obfuscation. It is important to decide whether these methods be implemented before, during, or after data storage or transfer.
To most effectively obfuscate your data you must first and foremost understand the types of data you are working with. The obfuscation methods you might want to use would be different for structured numeric data versus unstructured text-based data. It is important to be intimately familiar with the risks associated with using your data in terms of how sensitive it is. This helps weigh whether you will want to prioritize utility over privacy or vice versa when selecting the technique you want to use. Also knowing exactly the regulations you are working to comply with will help guide what techniques and methods you choose to adopt.
When choosing what method or techniques to use, make sure you are only considering proven techniques and not just devising an approach on your own. If you do want to develop your own technique, do so with the understanding that there is potentially a higher risk for your data to be compromised.
To minimize the risk of compromising your data it is smart to implement strong access controls within your organization. Not everyone at the org needs to have full access to all of your customer’s information so making sure your access controls are set to a need-to-know basis is important. Further, educating all employees on the data security policies and procedures at your organization will ensure everyone remains on the same page of how to handle data including how to implement obfuscation techniques.
Finally, it is always wise to document your procedures and provide a rationale for each step. This ensures transparency in policies and regulations and can help train new people as well as go back and reform outdated policies.
There are many different tools out there that can assist in executing data obfuscation properly. Choosing the right one will largely depend on your data management architecture, how you transfer data, and who needs access to it. There are generally three groups of tools: legacy test data management (TDM) software, open source tools, and modern data platforms.
Legacy TDM software typically refers to the early generation of data obfuscation tools. These tools offer data masking, simple encryption, and database virtualization. They were often built with an emphasis on data security over data utility, and as such, the approaches they take to data obfuscation aren’t focused on generated realistic data as the output. This can make their obfuscated data less useful in testing and development. Ease of use and the ability to work at scale with today’s data can also be an issue with these tools, given their more dated approach to test data. Simply put, they aren’t built to work with today’s complex data pipelines and modern CI/CD workflows.
Open-source solutions like Faker are freely available for anyone to use, modify, and distribute, and are generally maintained by a community of developers. These solutions can be great for simpler use cases and smaller datasets but are insufficient for teams needing to work across their production data in an efficient and secure way. The privacy guarantees are weaker and the maintenance demands are high. As the old adage goes, nothing is truly free, and the cost of using open-source solutions is the time they take to set up and maintain.
Modern data platforms integrate advanced data obfuscation techniques with expanded data generation, management, and security capabilities. These technologies, such as Tonic, provide well-rounded solutions for intuitively and securely implementing data obfuscation into your data workflows by way of seamless integrations and automations like fully accessible APIs. Since these platforms arose in the modern age of data lakes and cloud data storage, as well as the age of GDPR and CCPA, they are built to handle complex data, scale with your organization, and guarantee data privacy compliance.
Data masking vs obfuscation comes down to scope. These terms are often used interchangeably, however, data masking can be considered a technique of data obfuscation. The primary goal of data masking is to ensure that the masked data resembles the real data and can be used for development, testing, or analysis without exposing sensitive information. This usually involves hiding a certain subset of sensitive data. Data obfuscation on the other hand is used as a broader term encompassing various techniques with the goal of balancing the privacy and utility of the data itself, not necessarily to maintain the data’s original format. At the end of the day both definitions can be seen as subjective and context-specific.
Anonymization involves removing or modifying data elements so that it is difficult or impossible to identify the individuals the data is tied to. Semantically speaking, data anonymization has gained a reputation of being a weaker technique for maintaining data privacy due to several well-known incidents of anonymized data being reverse engineered. It is also often used as an umbrella term for techniques such as aggregation, generalization, or perturbation. These techniques take an extra step of severing a direct link between the data and the individuals making it more difficult to re-identify someone. So what about data obfuscation vs anonymization? Data obfuscation is just focused on making the data less recognizable or understandable in general while still allowing it to be used for legitimate purposes, not always specifically for severing a link to the individual.
Data encryption is the process of transforming data into a scrambled, unreadable format using specific algorithms. The data can only get back to its original form using a decryption key. The goal of encryption is to ensure confidentiality of the data and actively prevent unauthorized access to sensitive information while data is stored or being transmitted. Comparing data obfuscation vs encryption, obfuscation can differ from encryption because it focuses on a general altering of the data to make it unrecognizable, not necessarily preventing it from being accessed by unauthorized users.
There are several data obfuscation algorithms, each with its own strengths and weaknesses. Some of the popular algorithms used for data obfuscation include the below.
These are just a few of the many algorithms used to execute different obfuscation techniques.
The medical research field is one where you can find many real-world scenarios that serve as data obfuscation examples. Medical records hold a high concentration of sensitive information, from someone’s name and phone number to highly private information about their health. Often these are shared with research groups for collaboration. Because medical records are so sensitive there are a lot of restrictions around sharing this data.
This is a classic scenario of when data obfuscation techniques are essential. In this scenario it makes sense for several different obfuscation techniques to be applied:
These techniques make it possible for medical and research organizations to collaborate while still adhering to the strict regulations on medical data such as HIPAA. These obfuscation methods also maintain the validity of the data so it can be used for research purposes and properly analyzed.
Testing applications on production data is a dangerous game. Test data obfuscation techniques create a dataset that is safe for use in software testing of all kinds. For example, say you are a developer at an e-commerce company that needs to test the website’s performance and functionality while at the same time keeping your customers’ data safe. In order to do this, you employ data obfuscation techniques to generate realistic yet de-identified test data to use in your testing environment. You may use the following techniques:
Using these techniques and methods you can perform rigorous testing on the platform using realistic data that maintains the characteristics of actual customer interactions without exposing their information all the while getting the results you need. For a real world example of this type of obfuscation in action, check out Tonic’s case study with eBay and learn how we equip their 4,000+ developers with realistic and secure test data.
Data obfuscation is an important technique for protecting sensitive data from unauthorized access. By transforming data into a format that is not easily recognizable or understandable, data obfuscation can help maintain the privacy and confidentiality of sensitive data. Understanding the different methods, techniques, and tools used for data obfuscation is essential for getting the most out of this approach to data protection, especially when realism and utility for software development and testing are key to unlocking your development team’s productivity.
To learn more about the obfuscation capabilities of the Tonic test data platform, visit our product docs or connect with our team.