Technical deep dive

How to create realistic, safe, document-based test data for MongoDB

Author

Chiara Colombi

Author

August 12, 2021

An Overview of MongoDB

MongoDB is a NoSQL database platform that uses collections of documents to store data rather than tables and rows like most traditional Relational Database Management Systems (RDBMS). It derives its name from the word 'humongous' — 'mongo' for short. It is an open source database with options for free, enterprise, or fully managed Atlas cloud licenses.

Development on MongoDB began as early as 2007 with plans to release a platform as a service (PaaS) product; however, the founding software company 10gen decided instead to pursue an open source model. In 2013, 10gen changed their name to MongoDB to unify the company with their flagship product, and the company went public.

MongoDB was built with the intent to disrupt the database market by creating a platform that would ease the development process, scale faster, and offer greater agility than a standard RDBMS. Before MongoDB's inception, its founders — Dwight Merriman, Kevin P. Ryan, and Eliot Horowitz — were founders and engineers at DoubleClick. They were frustrated with the difficulty of using existing database platforms to develop the applications they needed. MongoDB was born from their desire to create something better.

As of this writing, MongoDB ranks first on db-engines.com for documents stores and fifth for overall RDBMS platforms.

Being document-based, Mongo stores data in JSON-like documents of varying sizes that mimic how developers construct classes and objects. MongoDB's scalability can be attributed to its ability to define clusters with hundreds of nodes and millions of documents. Its agility results from intelligent indexing, sharding across multiple machines, and workload isolation with read-only secondary nodes.

Challenges Creating Test Data in MongoDB

While the ease of creating documents to store data in MongoDB is valuable for development purposes, it entails significant challenges when attempting to create realistic test data for Mongo. Unlike traditional RDBMS platforms with predefined schemas, MongoDB functions through JSON-like documents that are self-contained with their own individual definitions. In other words, it's schema-less. The elements of each document can develop and change without requiring conformity to the original documents, and their overall structure can vary. Where in one document, a field may contain a string, that same field in another document may have an integer.

The JSON file format itself introduces its own level of complexity. JSON documents have great utility because they can be used to store many types of unstructured data from healthcare records to personal profiles to drug test results. Data of this type can come in the form of physician notes, job descriptions, customer ratings, and other formats that aren't easy to quantify and structure. What’s more, it is often in the form of nested arrays that create complex hierarchies. A high level of granularity is required to ensure data privacy when generating test data based on this data, whether through de-identification or synthesis. If that granularity isn’t achieved, the resulting test data will, at best, fail to accurately represent your production data and, at worst, leak PII into your lower environments.

A high degree of privacy paired with a high degree of utility is the gold standard when generating test data based on existing data. Already it can take days or weeks to build useful, safe test data in-house using a standard RDBMS. The variable nature of MongoDB's document-based data extends that in-house process considerably. It's the wild west out there, and you’d need to build a system capable of tracking every version and format of every document in your database to ensure that nothing is missed—a risky proposition.

It’s also worth noting that there aren’t many tools currently available for de-identifying and synthesizing data in MongoDB. This speaks to the challenges involved—challenges we’re gladly taking on.

Solutions for Mimicking Document-based Data with Tonic

Safely generating mock data in a document-based database like MongoDB requires best-in-class tools that can detect and locate PII across documents, mask the data according to its type (even when that type varies within the same field across different documents), and give you complete visibility so you can ensure no stone has been left unturned.

De-identifying a MongoDB collection in Tonic

At Tonic, we provide an integrated, powerful solution for generating de-identified, realistic data for your test environments in MongoDB. For companies working with data that doesn't fit neatly into rows and columns, Tonic enables aggregating elements across documents to realistically anonymize sensitive information while providing a holistic view of all your data in all of its versions. Here are a few ways we accomplish this goal:

‍Schema-less Data Capture: For document-based data, Tonic builds a hybrid document model to capture the complexity of your data and carry it over into your lower environments. Our platform automatically scans your database to create this hybrid document, capturing all edge cases along the way, so you don’t miss a single field or instance of PII.‍
Granular NoSQL Data Masking: With Tonic, you can mask different data types using different rules, even within the same field. Regardless of how varied your unstructured data is, you can apply any combination of our growing list of algorithm-based generators to transform your data according to your specific field-level requirements.‍
Instant Output Preview: After applying the appropriate generators to your data, you can preview the masked output directly within the Tonic UI. This gives you a complete and holistic view of the data transformation process across your database.‍
Cross-database support: Achieve consistency in your test data by working with your data across database types. Tonic matches input-to-output data generated across your databases from MongoDB to PostgreSQL to Redshift to Oracle. Our platform connects natively to all of your database types to consistently and realistically de-identify your entire data ecosystem.‍
De-identify non-Mongo NoSQL data too: You can use Tonic with Mongo as a NoSQL interface, to de-identify NoSQL data stored in Couchbase or your own homegrown solutions. By using MongoDB as the go-between, Tonic is able to mask a huge variety of unstructured/NoSQL data.

We’re proud to be leading the industry in offering de-identification of semi-structured data in document-based databases. Are you ready to start safely creating mock data that mimics your MongoDB production database? Check out a recording of our June launch webinar, which includes a demo of our Mongo integration. Or better yet, contact our team, and we'll show you the ropes live.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Chiara Colombi

Director of Product Marketing

A bilingual wordsmith dedicated to the art of engineering with words, Chiara has over a decade of experience supporting corporate communications at multi-national companies. She once translated for the Pope; it has more overlap with translating for developers than you might think.