Data masking: DIY internal scripts or time to buy?

Author

May 2, 2025

Considering internal scripts for data masking? Here's what the journey might look like.

It’s hard to avoid the need to use production data in other contexts, but much like King Tantalus¹ we often find it beyond our reach due to the more modern constraints of privacy and security. Fortunately unlike Tantalus we can escape via a device well understood by the authors of myths, that of transformation. Personal identifiable information can be detected and replaced in such a way that the integrity and character of the original data is maintained but precise details are overwritten. This can allow us to use realistic data for application development, quality assurance, machine learning, performance testing and troubleshooting.

But how should this metamorphosis be brought about? The initial impulse is similar to that of the ancient alchemist - by an elaborate process of trial and error, to induce the change ourselves, using ingredients sourced from our immediate natural environment. The classical practitioner might have used the limbs of some luckless amphibian whereas the modern day prestidigitator² will prepare a heady brew of shell scripts and regular expressions. Plus ca change, plus c’est la même³ chose as the French say.

Although the parallel is fanciful, there is a kernel of truth. Many set out on journeys of transmutation but they are often arduous. Charles Goodyear laboured for years before accidentally discovering the process of vulcanisation while Hennig Brand, who inadvertently discovered phosphorus, never accomplished his true, alchemic goal of turning lead into gold.

DIY data masking: the lure of internal scripts

It is of course simply a small matter of programming. Replace one name with another; magic up some phone numbers; fuzz the dates a little. Sounds like a fun project and may even be a welcome distraction from another seemingly more onerous activity.

However, the quicksand you were standing on will soon make itself known. “So you want the data in a separate database” ; “I have to recreate the schema?” ; “How many tables”. Followed by painful realisation - “wait, there are notes fields containing PII”, “do I really need to make the salary ranges ‘a little more realistic’ - what does that mean?”. “Oh you need some of the edge cases in there like the O’Connors?”.

Now it’s a full time project. You hadn’t really banked on that. Worse still, it’s become a deliverable. Senior management are interested! (they always are). End of quarter! We have to be seen to be delivering. Maybe we could turn it into a product! This you hadn’t bargained for. There’s 300 tables. 2154 columns! Which ones should I mask? Figure it out for myself? Oh, and make sure there’s no PII in there. We can’t afford the risk of that.

DIY data masking: the reality

This is not what you signed up for. 2000+ fields and you have to trawl through them all and figure out in each case if it should be masked, how it should be masked and then of course write the code, test it, yadda-yadda. Suddenly too, it’s a different sort of problem. With so many tables you need a framework to manage all those masking functions. Of course you’ve tried to make them generic to avoid writing the same thing over again, but now you need to store the config vs each masked column.

It’s taking longer than expected and the passage of time is making itself felt. The schema is changing. You’re going to have to adapt to that. It’s always going to change. You’re going to have to adapt to that too. How are you going to do it? You’ll need to scan and diff so you’re needing to couple versions of your code with versions of the schema - wait, worse, branches of the schema because it doesn’t evolve linearly and all the developers need this. IT’S A BLOCKER.

Wouldn’t it be nice if each developer had their own database? In a container. We could version them. Well yes, except that the database is 2TB in size. Can’t you reduce it? Also, we’ll need different versions - one for QA, one for performance testing, a small one for the devs and if we could just capture the edge cases for our unit tests we can plug all this into our CI/CD pipeline.

DIY data masking: the compromise

No. I can’t do all of this. I do have my day job to do after all. Look, I’m just going to write a few scripts. It will be better than it used to be (at least it’s not a copy of PROD with sensitive columns replaced by constants or gibberish). It will look more realistic. Yes, it might be inconsistent in places. Yes, we will have to apply migration scripts every time we make a schema change. OK, I will maintain the scripts but I do have my day job to do. No they won’t apply to any other databases we have. Yes, of course that includes our Databricks data lake.

Newsflash. Management is very concerned (they always are). You’re using production data in QA. How do they know you’ve desensitized the fields appropriately and that we’re GDPR compliant. Can’t I just show them my scripts? I don’t think so. Could you write a tool that traverses your code and presents a report from a privacy hiding point of view? Could you also write a tool that audits our schema identifying which fields might contain sensitive data so the two can be compared? I’m sorry - did I say something to upset you?

DIY data masking: two years later

Why do we get so many failures in production? Can’t we test with production like data? Well we sort of do but it’s quite out of date. We use an old copy of prod that was redacted with a script library that Charles built. It’s no longer maintained as Charles left six months ago and it’s so complicated no-one really wants to try. I don’t think even Charles really understood it any more and he’d got fed up with no-one else helping. So our main test database hasn’t changed in the last 12 months? No, not really. The whole thing never really worked if I’m honest. You may remember management being all over it when they discovered there was no way really of auditing it to determine whether there was still PII in there, so the whole thing got canned.

Yes, I know data is the new oil and our databases, now many times plural, contain priceless value and we can’t unlock it. and finding so many bugs in production bugs is embarrassing. Could we try again? I’m sensing no is not the answer you’re looking for …

Get the test data solution built for today's developers.

Accelerate your release cycles and reduce bugs in production by automating realistic test data generation.

Book a demo

Time to buy a data masking solution?

In the foundational text of economics,‘The Wealth of Nations’, Adam Smith, presents his observations of workers in a pin factory⁴. Thanks to specialisation and division of labour a team of ten workers was able to produce approximately 50,000 pins per day. A single worker would struggle to produce just one. Specialisation is a fundamental principle of business. If I can develop a product and sell it to ten people without further costs then I can sell it to you for 10% of the cost it would take for you to develop it yourself. If I have ten people I can do it in one tenth of the time. If it’s already been done, well, it’s there right now for the taking.

Writing a data masking tool is complicated. You may even be starting with a category error - it’s not a data masking tool you need, it’s a data masking platform. The kinds of problems alluded to above are real and only a subset of what you might encounter and what Tonic.ai resolves. There’s at least 200 person years of development invested into the product. It’s battle hardened. It’s not version 0.1 of a bash script. There are things in there you’ll never think of yourself, but will end up needing. Furthermore, Tonic.ai continues to evolve and innovate. When we move forward, you move forward.

Build vs buy is an age old debate. There are many good resources⁵ out there to help with the decision making process but there is some consensus that a fundamental question to ask is ‘if you build, will it provide competitive advantage’. For you the reader, almost certainly not. For us, the vendor, yes.

Here are some of the features Tonic Structural offers:

Connectors to 17 different data sources and rising including Oracle, Postgres, SQL Server, Databricks, Snowflake & Salesforce
A library of over 40 highly configurable generators supporting use of conditionals and regular expressions
A privacy scan for any platform we connect to which makes accurate, immediately usable de-identification recommendations
Subsetting for production of smaller representative databases
A comprehensive RBAC framework allowing tight control of what users can do and what data they can see
A comprehensive API allowing integration with development tooling and CI/CD processes
Preservation of integrity constraints both actual and implied
Preservation of data distributions
SSO integration to 9 different providers

What our customers say

eBay

“Nothing that we tried in-house is comparable to what we’re doing now with Tonic. It’s a game changer both in terms of the automation we can achieve, as well as on-demand function validation targeting specific use cases with the precise data we need.” - Srikanth Rentachintala, Director of Buyer Experience Engineering at eBay. Read the full case study.

Paytient

“If I think about what it would cost for us to build something even remotely viable for us to solve our test data problem in the way that Tonic has solved it for us, it's orders of magnitude more than what it costs us to run Tonic Cloud.” - Jordan Stone, VP of Engineering at Paytient. Read the full case study.

Everlywell

As an online-only startup in a field full of healthcare giants, Everlywell called on Tonic to help them outmaneuver the competition by ramping up time-to-market on more and better features.

Flywire

According to Engineering Fellow Felipe Talavera at Flywire, one of the company’s challenges was finding a solution that supported their myriad of databases used, including MySQL, PostgreSQL, and MongoDB.

Pax8

As Pax8 QA Manager and Technical Lead Michael Sounart described, “Provisioning de-identified data was a manual process. We had to sanitize prod with a bunch of SQL injections. But it simply wasn’t working.” It took inordinate amounts of time to locate all the sensitive data in their database, and on top of that, schema changes made the data hard to track, exacerbating the chore and leaving them vulnerable to the risk of data leaks.

Conclusion

You could try building. If what you need is really, really simple you should be OK, but realistically, even if you’re a company of 50 people, it’s probably not. So if you need data that is as realistic as possible, but doesn’t compromise on privacy or security concerns then Tonic Structural will deliver value for you. The only question is whether you talk to us today, or two years down the line after Charles leaves and you’re suddenly reminded of that curiously prescient blog…

¹ Having displeased the gods, Tantalus was punished by being placed up to his neck in water with fruit hanging above his head. Whenever he tried to drink from the water it receded and whenever he reached for the fruit it moved away from his grasp. ² Conjurer, especially skilled in sleight of hand ³ The more things change the more they stay the same ⁴ Although he may have taken some liberties with facts - see https://conversableeconomist.com/2022/08/23/adam-smith-and-pin-making-some-inconvenient-truths/ ⁵ E.g. https://hatchworks.com/blog/software-development/build-vs-buy/ ; https://www.thoughtworks.com/content/dam/thoughtworks/documents/e-book/tw_ebook_build_vs_buy_2022.pdf