At Tonic, we’ve been mimicking structured data since day 1. But let’s face it: not all data belongs in columns and rows. Which is why we’re very excited to announce our newest database integration: Tonic can now mimic your document-based data in MongoDB.
This latest integration is joined by a full parade of additional databases Tonic now natively connects with, including Amazon Redshift, Databricks, BigQuery, Spark on Amazon EMR, and Db2.
Join us for a live demo of our MongoDB integration as Software Engineer Emily Ritter introduces our approach to handling the unpredictability of semi-structured data and shows you the tool in action. Following that, Product Manager Kasey Alderete will provide an overview of our capabilities for mimicking data warehouses.
From Db2 to Postgres to Redshift to MongoDB, our customers work in multi-database environments, and our technology equips them to work across those databases, regardless of the type, to create a true mimic of their entire data ecosystem.
Chiara Colombi (00:07):
Hello everyone and welcome to our June launch event. We've got some really exciting news to share with you today but first I'd like to take some time to introduce you to my awesome coworkers who will be providing more details around our latest capabilities. Joining us are Software Engineer, Emily Ritter, who has led the development of today's headline integration, and Product Manager, Kasey Alderete, who is keeping all of our teams on track as we navigate the goals of our product roadmap.
Chiara Colombi (00:34):
In case this is the first webinar you've joined us for, I'm Chiara Colombi. I'm on the marketing team at Tonic and I'm really glad to connect with everyone today. Like I mentioned earlier, drop us a line in the chat to say hello, drop us questions there as well or you can put those into Q and A at any time and I'll jump in to ask them of Emily and Kasey during the presentations. As a quick agenda for today's talk, I'm just going to highlight the focus of this launch, which is databases. Specifically the many new databases that Tonic now connects directly to, to help you create safe, useful mimics of your production databases.
Chiara Colombi (01:10):
I say databases pointedly because we realized that most organizations work in a multi-database environment. That's why we're really pushing to support as many database types and ultimately every database type that there is. Our latest integrations as you can see. We are extremely excited to now support document-based data in MongoDB. This is Tonic's first NoSQL integration and it requires a significantly different approach from relational databases. I'm going to leave all those details for Emily to explain and in her demo as well. We're also focusing heavily on data warehouses, including Amazon Redshift, Databricks, BigQuery and Spark. We've also recently added support for Db2 by popular demand. With that, I will pass the mic over to Emily for an overview and demo of our MongoDB integration.
Emily Ritter (01:58):
Sounds good. Thanks Kasey. Mongo has been an interesting challenge for us to create because it's NoSQL. As the first NoSQL database that we have been implementing, there's a lot of fun challenges that we've run into. One of which mostly being NoSQL has no schema. It could be anything. Each document can be completely different. A collection can change over time, giving very differing views of what a document should look like. With unstructured data and semi-structured data, it can be a little harder to mask it. It's also a little harder to figure out what exactly should be masked. What is private? What isn't? Along with that most customers who use Mongo, don't only use Mongo. They have others, for example, Postgres and Mongo, one of our first clients needed both to work together. We worked on a solution for that. I can now show you what we were doing.
Chiara Colombi (02:59):
Go ahead and stop my share.
Emily Ritter (03:12):
Okay. This is tonic. The first thing after connecting to your Mongo instance, you'll want to do is you want to come over to privacy hubs, that we can actually scan one of your collections or all of them. As you can see, it's actually very fast. What this does is this combs through all of your documents and gets us an overall schema of what could all of your documents not collection look like. This is really helpful, especially for sparse fields. For example, if you only have birth date appear in 1/100th of documents. Also, if you had any documents that were created in an unexpected way, or if the way that you've created documents has evolved over time, you still want to cover all of these edge cases and make sure that no data is leaked through so that you don't have any PII.
Chiara Colombi (04:13):
You mentioned that you can make the scans happen. Do the scans also happen automatically when you first connect? Okay, so you don't have to necessarily trigger them?
Emily Ritter (04:23):
Yes. Once you first connect with the database, then it will scan all of your collections so that you have something you can work with.
Chiara Colombi (04:31):
Emily Ritter (04:31):
As well as when you actually run a generation to mask. We'll also scan them so that we can keep track if anything changes. Here, we have what we actually see for what your schema should look like. As you can see, for city inspections, it has a field called certificate number, which can be an int or a string. You might want to mask these differently. You might want them to be the same and you can also see more values. Since as I said, this is a hybrid amalgamation of all of your documents.
Emily Ritter (05:10):
For masking data, we want to be able to cover it in a way that makes sense for your data, that hides all of the personally identifiable information and anything that you need to not come through but it's also still usable. We have a number of generators that do all separate things. One of which, sequential integer generator, just as simple as it sounds, it starts from zero and it'll go up. One thing that can be helpful about this is we also... Though these are two different fields, we keep in mind the idea that these are both linked. That though you have an in and a string, you may not want to have an int of one and then a string of also one. Even though we break it down by type, we still keep the understanding that the certificate number is the same thing. We also have-
Chiara Colombi (06:10):
[crosstalk 00:06:10]. For anyone who may be familiar with how we work with relational data, are these the same generators that you find in the relational data?
Emily Ritter (06:20):
Yeah, that's the idea. Anything that you can do in a relational database that also makes sense in a JSON we support.
Chiara Colombi (06:29):
Emily Ritter (06:29):
We also have some more complicated generators for more complicated use cases. One of the things we found is that to link Mongo and other databases, sometimes people have complex IDs. For example here, we have a rejects mass generator, where we can take this into its separate parts and then mask each part individually in a different way. That way we can maintain consistency across all different fields. Speaking of consistency, you want to be able to mask your Mongo and also mask your pot Postgres and have them work together. One of the things we have is what we call consistency, which means that the same value will go to the same value. For example, here we have a company name. If I turn consistency on, then this business name will always go to this business name and that way for things like IDs and other things, you'll be able to mask them the same way each time.
Chiara Colombi (07:39):
Is that across other databases that you're working in?
Emily Ritter (07:42):
Yes. That will work with across Mongo and Postgres so that you can mask your keys the same way.
Chiara Colombi (07:50):
Emily Ritter (07:50):
It will also work across generations so that you can run different databases at different times without breaking those links. Once you have taken your document and you've applied everything you want to it, you can actually run a generation, which will basically just go ahead and apply all of these masking rules that you set up. I'm going to share a different screen now. This is Mongo DB compass for anyone familiar with it, which is Mongo's created UI for how you can interact with their documents.
Emily Ritter (08:41):
As you can see, this is our unmasked data. It has all of our IDs unmasked and same with our businesses names. Then we set it to go to this output. As you can see, we now have the same endings that we were expecting and I will share Tonic again. Now we have the same endings that we were expecting because as you can see, it will always replace this end with test. One of our earliest customers on Mongo, eBay, we've been discussing with them. What do they like about Mongo, about using Mongo with Tonic? What makes it the most useful for them? They brought up a few pieces of interesting feedback. One of which is that they very much found it important that they could see everything in one place. The ability to see what all of their documents look like as a hybrid amalgamation.
Emily Ritter (09:46):
Also, the ability to then see other samples for that. Instead of having to go through to one masking software and then to Compass by Mongo and then potentially other places, they just have one view where they can look at it at all. Along that same note, they also liked that users can protect the data without actually needing any access to it. This is especially helpful for if you have any security systems, then you can share a workspace with them and Tonic without having to go through all of the hoops that you might need to grant them database access. This helps speed things up while also keeping them much safer. There's also the ability that now that you've shared this workspace, people can collaborate on it, figure out what makes the most sense. People can talk about, "Does this generate? Does this field need to be massed? Is this generator really the right one? Or do we want something different?"
Emily Ritter (10:45):
Another thing that we found very interesting is that they actually are using Tonic with Mongo as a NoSQL interface. They use base camp as well as Mongo. They're able to push everything from base camp into Mongo, and then Tonic will mask that and then they push it back into base camp. We're able to mask a huge amount of non Mongo data just using Mongo as an in between.
Kasey Alderete (11:10):
Yeah, I think it's Couchbase and maybe even some other homegrown systems they have. Anything that's just represented in a document, they're able to do their own translation, let us treat it as a document and then roll that back out.
Chiara Colombi (11:27):
In that way, Mongo may be Tonic's first NoSQL integration but it's enabling more working with more NoSQL data.
Kasey Alderete (11:34):
Yeah. It's really a stepping stone to us understanding how to work with documents, so just beyond rows and columns.
Chiara Colombi (11:42):
That's awesome. We do have a quick shout out from eBay about how nothing that they've tried in house is comparable to what they're doing now with Tonic. It's great to hear that. It's been fantastic working with them in particular because we have so many use cases with them, including Mongo.
Kasey Alderete (11:59):
Yeah, actually that's a good segue. I think, we think about eBay, we think about, what's interesting about their use cases, the scale at which they're working with Tonic. If you go onto the next slide. I can actually transition into talking a little bit more about how we're mimicking more structured data at scale. I want to introduce a whole new group of relational databases that Tonic now natively supports and that's data warehouses. These are still relational SQL based systems. If you want to click ahead, I think there's a little title here. We're still talking about relational databases, which we've always supported but what we've seen lately is a shift in the users and the use cases that they're coming to us with their privacy needs. If you go onto the next one, part of what we're seeing in terms of the users is that we're starting to serve a lot of the data engineering teams, data operations, centralized data groups, because once you move into the realm of data warehouses, you're not just working necessarily on one application but its massive scale across your organization.
Kasey Alderete (13:03):
This introduces some different nuances to working with the data. Starting with Amazon Redshift, we're really excited that we can now enable customers who are working on one of the fastest growing cloud data warehouses. Given that Amazon Redshift actually doesn't enforce foreign key constraints. The good thing is that Tonic actually allows you to define foreign key relationships in your data. Even if they're not declared or enforced at the database level, you don't have to go back upstream and configure the database schema or be reliant on that. You can do that from our UI or from our API. Why is that important? Is that foreign keys often represent relationships in your data. This will help us as we mask your data and mimic it, that we preserve that relationship so that you're not changing a foreign key and a primary key that it references independently. We don't want to get those out of sync. We want to keep them in sync so that our test data on the other side still makes sense. It's still referentially intact.
Kasey Alderete (14:05):
The other part of this that's interesting is just some of the underlying technologies that we've been taking advantage of. Once you're part of the Amazon ecosystem, or making use of Amazon Lambdas to make that really performance again, with that massive scale. We've got you covered on the Amazon Redshift front. The other aspect that I'm really excited about is our Databricks support. Again, it's a lot about the massive expansiveness, but it's huge. When we look at that huge data sources often coming from different lines of business, different upfront application systems, and they're all feeding data to this data lake house or data warehouse, whatever you want to call it. We've had to come up with ways of doing incremental data loads so that, maybe it's adding, we have one customer adding 50 Terabytes per day of data to their data lake house.
Kasey Alderete (15:04):
We had to find ways to not go back and reprocess the entire or re-transform the entire data database but to do that on more of an incremental basis. Another thing I want to highlight here is the last one on the page, but it's really interesting is that we've now provided a Java SDK so that you can integrate and access Tonic functions in your data pipeline natively. You can actually use the Java SDK to access the Tonic functions, transform your data in place as it's existing in the Databricks storage layer.
Chiara Colombi (15:39):
It's actually, it speaks to a question that recently came through about how automated customers are able to make time processes for data warehouses. There's just the sheer amount of data coming in. How much can Tonic handle alleviate data management?
Kasey Alderete (15:53):
Yeah, I think we think about automation in a few different ways. We take a lot of the hard technical challenges around performance and making that work really well under the hood but we give our customers different way of accessing it. Everything you can do in the UI you can do in the API. Most of our sophisticated customers do that through the API, they have scheduled jobs, they have really rich infrastructure where they're calling out to Tonic as a part of a new environment process.
Kasey Alderete (16:23):
Continuing the cloud data warehouse theme is our continued support for Google BigQuery and it's even gotten better recently. We've been working with customers who are just operating at massive scale. We've been doing more under the hood work to make sure that works for their use cases and some different things that come up there. I actually like what Emily said about collaboration, because that's really important is that once you move to data warehouses, you're probably moving beyond a single person who actually knows the data and is interacting with the data. We do have role-based access control, commenting, things like that to support that workflow of one person who maybe knows what sensitive another person knows the generators and how to going to put those together to get a really usable data set on the other side.
Chiara Colombi (17:06):
That's for both Mongo. It's within the UI just generally. Yeah.
Kasey Alderete (17:10):
Yep. That's again a single pane of glass, which is really helpful to do that. Everybody knows their to do list in terms of data privacy across the landscape. We're continuing to invest in Spark, which was really great for accessing flat files. So Parkin, Avro files and Amazon S3 and again, that Java SDK that I mentioned, it's a lot of what makes that possible is Spark functionality. Finally, not to be forgotten there, is just our commitment to enterprises and really our vision for protecting all data. We want all data everywhere to be a part of it. It's beyond just talking about modern technologies and the new cloud warehouses but a lot of companies, namely banks, healthcare companies have sensitive information that lives in traditional systems like, Db2. We've based on requests, popular request, we've added support for the LUW and Ice series lines of Db2.
Chiara Colombi (18:13):
I think something worth pointing out there is, a way that Db2 and MongoDB is something that they both touch on is that there aren't a lot of tools out there. What Tonic is offering is pretty novel in being able to create a mimic of production data in either of those systems. There's just not a lot of other options out there.
Kasey Alderete (18:32):
Right. I think that's all we have. That's our whirlwind tour of updates across our database ecosystem.
Chiara Colombi (18:43):
Awesome. Thanks to both of you, Emily and Kasey. We do have a couple of questions that I'd like to go through now. Let's see what we have. How common are multi database environments? What sorts of combinations are, we've seen?
Kasey Alderete (18:54):
Yeah, I like this question. It's easy to think that, "oh, one database can meet all your needs. If you can just find that one database." But really the reality is that different applications have different needs but also companies grow in ways where they have M&A activity. One company, and then now they have a new product line. You'll see different SQL databases, like MySQL and SQL server in the same environment or Postgres on MySQL. You'll also see them have something like Amazon Redshift and Postgres. Another customer has Oracle and Mongo. All of the combinations, it could be multiple SQL database or relational databases, or it could be, there are no SQL solution with their data warehouse with their application database.
Chiara Colombi (19:41):
All right. Yeah, and maybe this is a good followup question to that. It seems like a lot of teams will stand to benefit from this unified environment. If you could speak to which teams you think are benefiting?
Kasey Alderete (19:54):
Yeah, I think Emily touched on it a little bit. I'll just expand from there. One of the teams that's most benefits from having this unified environment to data privacy is the security team, because the security team has all of the knowledge and responsibility about protecting the data but they're not necessarily DBAs. They don't need access to the production database to go in there and view and manipulate data. Now they only have to get access to Tonic so that they can apply that policy right there in one place. That's the first team. The next one is the downstream consumers of the mimicked data. This is the development teams, the data science, data engineering teams. Instead of having to think, "oh, I need to protect this data before I use it or request it from some different tool."
Kasey Alderete (20:41):
We actually operate in the background. We're just this layer sitting between production and lower environments. For them, they just see the staging environment and have no idea about the background. They don't have to change their behavior. They just know that they have really great usable data in staging environments. The third team that benefits from this unified environment are the DevOps teams, the SRE teams, these centralized technical teams who are provisioning and managing Tonic. Now they just have one place. They don't have to go install it with all of their databases. It's not a database specific tool. It's again, it's just part of their deployment environments. We see them managing in the same place that they have their Docker or their Jenkins, or whatever their infrastructure is.
Chiara Colombi (21:31):
Yeah, we learn from our customers of how they're using Tonic and how it's impacting their teams as well. It's really expanded our use cases over time. I do have a question for you, Emily, about Mongo. If you could explain the hybrid document more fully. I guess it's something that builds as older versions have to be aligned with newer versions of a given document or schema.
Emily Ritter (21:54):
Yeah, so the hybrid document is in its simplest form, the combination of every document that exists in a collection. If you have a collection that has documents, that don't exactly match, or that have one field that only exists in a single document, all of that will show up in the hybrid document. It's for the user to be able to view so that they can see every possibility of what they need to cover and what they need to mask.
Chiara Colombi (22:26):
Awesome. Thank you for clarifying that. Let's see if we've got any other questions. I think we've touched on all of them. Well, there was a question around, does Tonic have an API? And we've mentioned that a couple of times, but just to reiterate yes, there is an API. Oh, one more question for you, Kasey, what else is on the roadmap? What can we look forward to?
Kasey Alderete (22:50):
That's a good question. We'll continue to push the boundaries of the databases we support. There's some exciting new ones that we're going to be adding in the next couple of months. The other thing that I'm really excited about that I'll give a teaser for is our machine learning capabilities. Just a little bit of Tonic background, the basic unit of Tonic is a generator and a generator is where you take raw sensitive data. You apply some masking mimicking strategy to it, which is a generator and you get this desensitized anonymized output. The way that works today is that you understand your data and you know the data type and the data needs and you apply the right generator. Well, we can actually discover using machine learning now and neural networks. True deep learning in the product to discover relationships in your data so that you don't have to declare them in a way that preserves the statistical characteristics of your data right in the product. I'm really excited to see more about Smart Linking. Stay tuned on Smart Linking
Chiara Colombi (23:56):
Awesome, further automation in the works. All right. Well, with that, then I think we can wrap things up. We've got all our questions covered. Thanks to everyone for joining us. I will quickly mention here a couple of ways you can contact us. You can visit the website, sign up for more webinars that will be soon coming into the pipeline and you can reach out at firstname.lastname@example.org, connect with us on Twitter @tonicfakedata. If you have any follow-up questions or if you'd like to, maybe get a demo for your team of Mongo or also our relational databases as well how we work with relational data. We would love to connect. Thank you again, Emily. Thank you, Kasey. This has been great. I will see everyone next time.