Useful, realistic, safe—this is the holy grail of modern test data. But what does it take to generate quality data that checks off each of those boxes?
In other words, what does it take to get your developers the data they need?
To coincide with the launch of our ebook Test Data 101, we’re deep diving all things test data generation, from anonymization to synthesis to what is fast becoming the gold standard for testing and development—data mimicking.
Join our CTO and co-founder Andrew Colombi for a live conversation and Q&A, as he explores the tools and integrations you need to create high quality, compliant data without draining engineering resources.
We’ll cover:
Be sure to download the companion ebook Test Data 101 for the fullest experience.
Chiara Colombi (00:06):
Hello, everyone again, and welcome to today's conversation on Test Data generation. My name is Chiara, I'm on the marketing team at Tonic. And today I have the great pleasure of introducing you to our speaker, Andrew Colombi. Andrew is a co-founder of Tonic, he is Tonic's CTO, and he is the co-author of the companion ebook to today's presentation. He also holds a PhD in computer science from the University of Illinois at Urbana-Champaign.
Chiara Colombi (00:31):
Before I let Andrew take the stage, I'd like to quickly remind everyone that you can download the Companion ebook Test Data 101 on our website product.ai. And the ebook contains some of the content that you'll see today in the presentation, if you'd like to share it more easily with others on your team. We'll also leave plenty of time for questions, you can either ask the question midway through the presentation, or you can save it for the end, just drop them in the Q&A or the chat and I'll keep my eyes out for them there. With that I will hand things over to Andrew.
Andrew Colombi (01:00):
Great, thank you for that introduction. And yeah, let's get started and jump right into Test Data 101. To start things off, just give you an overview of what we're going to talk about today, we'll begin with definitions and kind of like background context around historically what people have done with Test Data and historical approaches to Test Data and why are people building Test Data etc, just to give everyone a level playing field a starting point that so everyone in the audience kind of has a similar context, before going into some of the more modern approaches to generating Test Data, which we'll cover in the next section.
Andrew Colombi (01:43):
And then at the very end, we'll do a little bit of analysis of you know, what are the pros and cons? When are different approaches more applicable? And also answering the question of build versus buy a little bit there as well. And just to reiterate what Chiara said earlier, I do love questions so if you have any feel free to just drop them in the chat, and then Chiara will interrupt me. And I'll hopefully come up with a smart answer to your question.
Andrew Colombi (02:15):
Okay, so let's talk about why, I always like to begin my presentations with why. So why do we generate Test Data instead of using production data? And let's give a little example, let's say you're a QA manager, and you're responsible for setting up the staging infrastructure at your company. Well, one of the most important things that you're going to have to provide to make a quality staging environment is quality staging data.
Andrew Colombi (02:46):
And so you might call out to your colleagues on the other side of the chasm and say, "Hey, we would like some of your data." And of course, they're going to respond with, "Well, that's not possible." And there's a myriad of reasons why that's true. GDPR, specifically states that you really shouldn't be using production data in staging CCPA, which is the California law that's recently come into place, which is modeled a lot after GDPR, has the same stipulations.
Andrew Colombi (03:13):
SOC 2, same stipulations, there's a lot of reasons why in that scenario, you can't use production data in your staging environment. And that's not the only kind of application right? You might have an application of trying to enable outsourced developers to contribute to your project. And they're going to need data too, of course. And there's a number of reasons why that might become a problem, right? There's data sovereignty issues, Schrems II that court decision, kind of makes it difficult to share data outside the EU, especially with the US.
Andrew Colombi (03:51):
So these are a lot of reasons why kind of, it's difficult to use production data in your test environments. And there are a lot more applications than just these two that I've mentioned here, the outsource and staging. And let's just quickly review some of these, we talked about QA environments just now. But there is also the customer success aspect of things. If you have customers that like a specific customer that's experiencing a specific issue, being able to create Test Data that reflects their exact scenario is super valuable, most right away in debugging the issue, but also going forward, being able to create data that can be integrated with your automated testing to prevent that bug from coming back.
Andrew Colombi (04:38):
All those things, super valuable to have Test Data for that. CI/CD, another great example where you're going to be running automated tests, they're integration tests, functional tests, regression tests, all these kinds of tests want high quality Test Data to be able to provide the maximum value to your organization. And then product demos, totally switching gears completely, product demos and sales organizations, they have a need for data, just as much as your QA teams have a need for data. The best demos are going to be demos that look real, and they look real when they are real, right.
Andrew Colombi (05:17):
But, of course you can't use your customer data in your product demos and that's another application of creating Test Data. And then the final application, which has a lot of different sort of sub applications, is machine learning. Machine learning is a really broad topic but two particular applications that I like to bring up when we're talking about Test Data is, the first one is machine learning, it's a huge pipeline that produces the machine learning. At the very end, you have some complicated neural net that trains on data.
Andrew Colombi (05:51):
But before you get to that point, there is often a lot of data cleaning, data joining that needs to happen to creating that perfect data set that can then be used for your for your machine learning process. And that whole pipeline is software onto itself, that has problems, that has challenges that needs to be tested, that needs to be regression tested. And having Test Data to be able to put the pipeline through its paces can be very valuable.
Andrew Colombi (06:14):
And then the second use case that I like to mention is there's been recently some studies showing that using mocked data, or using kind of noisy data in your machine learning algorithm itself during the training process can improve the results, especially they've shown this with images. And we think that, that could be applicable outside the domain of images as well in other contexts of structured data.
Andrew Colombi (06:43):
Okay, so those are some applications of Test Data and now I wanted to talk real quickly about the evolution of Test Data, and how it's changed over time. And these are kind of stages of maturity that you'll see. And these stages can be applied to companies, as well as industries in the early days of even of data collection around health data, they didn't have much in terms of regulations or processes. And then over time, of course, things like HIPAA came about. And that happened a while ago, a good while ago, but you look at a different industry, like you look at social media, and they're like in the Stone Age compared to where healthcare was, even though in terms of time in terms like social media happened much later than healthcare data collection.
Andrew Colombi (07:29):
But every industry kind of goes through this process, where at first, it's the Wild West, and people are just using production data everywhere. And maybe they're starting to use some dummy data created in-house. And then as they mature as an organization, as the industry matures, as their regulations mature, as their customers mature and become more knowledgeable and the risks associated with the data, they kind of move along this path. And they move towards more anonymized production data sets, or early data synthesis. And just to be clear, we're going to start defining these terms, what is dummy data, what is anonymized data, we're going to start defining these terms shortly.
Andrew Colombi (08:08):
But before we get there, let me just finish out this thought and say towards the end, you get towards the Advanced Data, Test Data. And that's looking at what we're calling mimicking production data sets. And we'll get into that, as well as subsetting data and creating data bursts. These are all different techniques that happen towards the end of the maturity lifecycle when you've exited adolescence, and you're into adulthood of Test Data.
Andrew Colombi (08:37):
Okay, so I promised we would talk about some of the definitions of these things. And so here we are, let's start talking about some of these definitions. So we'll start with dummy data, which is probably the simplest and often when a company decides to embark on this process, and are saying, "Okay, we need better Test Data." The first thing the engineers think of is, "Well, why don't we create this?" And we call this dummy data and what is it? Well, it's looking simply at the data type of your field. So in this case, we've got first name, and we know that's a text field, and we're replacing the data in there with data that fits that data type and really nothing more.
Andrew Colombi (09:19):
So that's how you end up with these nonsense text, replacing the original text, the next level beyond dummy data. And this is the next step in the engineer's thinking, so the engineer thinks of this, they do this, and then they try to test their application with this. And they realize, "Wow, the application's really hard to use, I don't understand the interface. I don't know who the people are, it's just very confusing. My colleagues can't work with this data for testing so they come up with the next step, which is mock data." And in mock data, you're looking at not just the data type, but also the semantic type of the data.
Andrew Colombi (09:55):
In this case, a first name is a first name, we know what a first name is, and so we can create fake first names, David, Jane. I mean, they're real first names, but these aren't real people. And job titles the same or companies the same. So these are just different ways we can kind of look at the semantics of the data and preserve those semantics as we go. And this kind of leads you to the next step in the phase, which is pseudonymization.
Andrew Colombi (10:22):
So in the previous one, we replaced all the fields with these sort of made up fields, what pseudonymization does is it says, "Okay, there are certain fields that are identifying, and we're going to replace just the identifying fields." So we'll replace first name, last name, date of birth, and after having replaced that we'll leave the rest of the data intact. And this should be enough to prevent anyone from reidentifying this person or being able to glean harmful information from this data set.
Andrew Colombi (10:54):
And if you're at this presentation right now, you probably know that's not true that this isn't enough. If you look at GDPR, it specifically calls out pseudonymization as a mechanism, that is not sufficient. And you can look at some examples, historically, there's one, the Netflix challenge, you may have heard of where Netflix released some data after having done some pseudonymization in an effort to kind of create some buzz around their company and present a really interesting computer science challenge to the public.
Andrew Colombi (11:26):
And it backfired a little bit, maybe a lot bit backfired, in that people were able to deduce where the actual data came from, and who the particular people were inside that data, just based on the fields that were not anonymized. And it kind of makes intuitive sense, if I'm telling you there's a director of security that was hired on this date, as a man, and lives in this zip code, well how many directors of security can possibly live in that zip code that are men that were hired on that certain date. And so that's how pseudonymization kind of breaks down.
Andrew Colombi (12:02):
And from here, I wanted to talk just briefly about some approaches, these aren't directly anonymization approaches, but they're related to Test Data. And so we'll talk about subsetting data real quick here. So subsetting data is the process of taking a slice of your data set. So you're going to take a completely logically intact portion of your database. In this case, we've got three tables, a user's table, an orders table, and a likes table. And they're all linked together with foreign keys.
Andrew Colombi (12:37):
When we extract one row from the users table, it's going to automatically extract the relevant rows from orders and likes. And so how is this related to Test Data? Well, it's related in two ways. The first way is that often your test infrastructure is not as robust or not as resourced as or production infrastructure. So it's helpful to be able to create smaller datasets along the way, so that your production, sorry, excuse me, your staging infrastructure doesn't isn't overly taxed by the quantity of data that you have.
Andrew Colombi (13:14):
And then the other way that it's related to Test Data is it does provide some sort of protection. If you're reducing your size of your data, then naturally as a consequence of that, you're reducing your exposure to having lost that data, only a subset of your users will be compromised in that case. So that's subsetting and then the reverse of subsetting is data bursts, which is creating additional data rather than reducing your data scale, rather expanding your data scale. And this can be very valuable for scale testing, or finding bottlenecks in your application.
Andrew Colombi (13:52):
Okay, so those are some historical approaches to generating Test Data. We'll now go into some more modern approaches. And in discussing those, so the first one is anonymization. And we know, there are several techniques, when you kind of peel back the covers of how can you do anonymization. And this is the most primitive redaction, which is still used today, it's used in the CIA, when they kind of release public information, you'll see those reports and then things are X'd out. So David becomes David S, and we lose their job title and their credit card, you can kind of do one better than that, which is scrambling. And what scrambling is doing is it's kind of keeping the form of the data, the shape of the data.
Andrew Colombi (14:46):
So you kind of can tell that there are spaces in this without revealing the data itself. And so this is one step better than redaction, which kind of just removes the field altogether, and leaves you some sense of what the form of the data was. And this is actually quite valuable for tests and development because it gives you, when you're asking the question, well how big does this field need to be on my web UI? How many characters do I need to fit?
Andrew Colombi (15:15):
Well, the average user might have a job title that's 30 characters long. So I'm going to make this field big enough to fit 30 characters, you can get that information from scrambled data, you can't get that information from redacted data. So in our platform, and we're not going to talk too much about Tonic in this presentation. But just as a point here in our platform, we don't actually support redaction, we just support scrambling because we think it's more valuable.
Andrew Colombi (15:40):
And next level beyond scrambling is format preserving encryption, which you'll notice in these two slides like the output doesn't look that different. And that's because in terms of the ultimate output, they kind of do look the same. The big key difference between format preserving encryption and just straight up scrambling, is that with scrambling, you're actually doing a very random approach to generating the new data. Whereas with format preserving encryption, you're actually harnessing encryption technology to create this fake data.
Andrew Colombi (16:13):
And by using encryption technology, you can actually reverse it, and be able to recover what the original field values were. And it's this combination of the format preserving, which enables you to create data that looks in terms of its size and shape, looks similar to the input data, while also being able to have that reversibility capability. If you just use regular encryption, if you just applied AES encryption to this, first name would look nothing like a first name in the output, it would be completely blown away with sort of binary output.
Andrew Colombi (16:50):
But the format preserving part enables you to kind of fit the data into the fields as they originally were. Okay, so with that, let's talk about pseudonymization. We already talked about it, this is the example of replacing just a few fields. But I wanted to talk about it here again, because there's an improvement we can make. So let's talk about what the improvement can be. The improvement is what we call statistical data replacement. And in statistical data replacement, you do the same thing as pseudonymization. So you take your first name, your last name, your date of birth, replace those with mock first names, last names, date of birth, and then the other fields, rather than leaving those untouched, you replace them with statistically relevant data for that.
Andrew Colombi (17:35):
So let's take the salary example here, we'll take the salary and we'll look at the rest of the data throughout the database, and figure out what is a statistically kind of probable salary for this record. And this protects against having the exact salary in there, while also preserving the utility of this data as being, useful for example, if you want it to do some analysis and say, what is the average salary of my database? Well, having used statistical data replacement, would preserve that capability so this is kind of an improvement on pseudonymization. And these... Go ahead.
Chiara Colombi (18:18):
Can I just jump in with a couple quick questions?
Andrew Colombi (18:18):
Yeah.
Chiara Colombi (18:18):
The first is when we're talking about anonymization, we also mean de-identification, masking, obfuscation these are kind of synonyms in the way that we're talking about them.
Andrew Colombi (18:28):
Yeah, that's right. I picked one term there just to make it... Tie it in with GDPR more directly. But yeah, all of these terms kind of have a similar ring to them, they're similar ideas. Yes, that's exactly right.
Chiara Colombi (18:49):
And then a question about the security of different approaches, it seems like the encryption would be a stronger form of anonymization compared to like statistical data replacement. In terms of privacy, how secure can we consider statistical data replacement to be?
Andrew Colombi (19:04):
That's an interesting question. And actually, I would not actually say that format preserving encryption is necessarily safer. In some senses, it may be less safe. When you're looking at format preserving encryption, depending on exactly how you apply that encryption, there's without going too deep into the weeds here, there are different ways you can apply encryption, they're called cipher modes. And depending on the cipher mode, you use, you can be open to frequency attacks, for example.
Andrew Colombi (19:33):
And so what's the frequency attack? A frequency attack is, let's say I know that all Davids turn to Jkjls. I can tell that someone's hand was, their right hand was used for typing most of this example here, but yeah, Jkjls. If all Davids become Jkjls, and I know the frequency of David in my original dataset, or I just know the frequency of David in the world. It's a pretty common name, then I can make a guess as to what the Jkjls is in the original data set. So format preserving encryption, you know it has useful, I'd say the most useful property of it is its reversibility, not necessarily its protection, it's a way to get a degree of protection while also preserving that reversibility.
Andrew Colombi (20:23):
In fact, if you didn't care about the reversibility, scrambling is more protective, because it's totally random on the output. And you also have the question of statistical data replacement and how that might compare. That's also a good question. And you can go down a rabbit hole here and we do, that's what I do every day, my job is to go down this rabbit hole. But for the brevity of this conversation, let's just say you have to be careful, you can't just say, "I'm just going to figure out what the mean and standard deviation of this data is, and use that as a way to create new data." I mean, you can certainly do that. And that has a degree of protection associated with it. But if you really look at the mathematical rigor of that, and we'll talk about this a little bit later, but if you really look at the mathematical rigor of that, you don't get as many privacy guarantees, as you might expect.
Andrew Colombi (21:13):
And I want to focus that guarantees word and I mean, that in a mathematical sense, like what can you prove about the privacy of this? Depending on how you do it exactly, it can be weaker than you might expect. I mean, just as a simple example if you just pick uniformly between the min and the max salary in your database, well you kind of revealed the min and a max of your database, and then you might say, "Okay, well I know that the best paid employees, the CEO, and so now I know the CEOs salary."
Andrew Colombi (21:47):
So you need to be able to protect against that kind of thing, if you're going to do it right. Good questions. So let's talk about synthesis. So this is a good kind of segue here, excuse me, this is a good segue to statistical data replacement, it's kind of got shades of synthesis in it. So we're going to revisit it as a statistical data generation. So statistical data generation is essentially the same idea as statistical data replacement, where you're collecting statistics about the underlying database. But rather than replacing data, now you're just kind of generating data from whole cloth, you're just creating new data.
Andrew Colombi (22:28):
And this is a pretty powerful technique as a starting point for creating synthetic data. But you can level this up and the next level up is deep generative models. And so there's a lot of excitement around this approach right now in academia and in industry, people are trying this out, we're trying this out, we're trying to use it as a way to increase the utility of our data. But the basic approach here is to really look deeply at the data with a machine learning technique in the same way that Google is creating image detection algorithms, and other companies as well, not just Google, we can use similar techniques for creating synthetic data.
Andrew Colombi (23:17):
And there are a variety of models that people are using, variational autoencoders is a really popular technique, as well as GANs which are the generative adversarial networks, and people are trying different approaches, it's very new areas, like bleeding edge technology at this point. So there's a lot new going on here. But this can be a way of leveling up from just sort of the basic statistics. But it's basically more statistics. If you look at what generative models are, they're kind of like the progression of statistics into kind of taking it to the next level with what our modern computing resources can do.
Andrew Colombi (23:56):
And then kind of going back, this is worth mentioning, it's kind of a swerve from what we were talking about where we're using statistics to generate data. Here, it's more rule based data generation, but it's still a form of synthesis, you're creating data from scratch. But in this case, rather than examining the underlying database to learn what the data should be, you're having a human kind of input what they think the rules are of this data.
Andrew Colombi (24:24):
And we actually partner with this company Mockaroo, I definitely suggest you check it out if you're interested in doing some rule based data generation, and they have some great tools for that. But yeah, so this is a different approach to creating synthetic data. And I promised that we would talk a little bit more about how can you be careful about using statistical data generation as well as deep learning, by the way. Well, how do you formalize that idea of being careful while you're creating that data? Well, differential privacy is a kind of a framework, a mathematical framework for evaluating the privacy of a process by which fake data is created.
Andrew Colombi (25:12):
And it's a very powerful framework, we could do a whole hour long discussion in that, certainly wouldn't be enough, even there to really dive into differential privacy, but I'll just leave it as it's a way of making guarantees about the privacy of the data and ensuring that when you take some records, and you include them in a process that's going to create synthetic data, that you haven't compromised, any of the privacy of the people that are participating in this process.
Andrew Colombi (25:48):
So with that, I'd like to kind of switch gears. So that's kind of like a summary of the most modern techniques for doing Test Data creation. And I'd like to switch gears a little bit to talk about data mimicking, which is our approach, Tonic's approach and we think that the right approach to Test Data creation. And what I want to start with, is kind of like I started this whole presentation, which is why, why did we go after this? Why did we create data? Like why did we start coining data mimicking? Why do we push for data mimicking?
Andrew Colombi (26:25):
Well, we knew we needed a system that could preserve production behavior. And by that I mean the data would look like production data, it would feel like production data, it would have the same properties as production data, we also knew that it had to operate at production scale. You know, if you look at academic research in this area, it's predominantly focused on the one table case, like imagine I have one table and I need to create new data for it. And I don't mean to diminish that in any way, that's important research, important work, difficult work.
Andrew Colombi (27:00):
But when you look at the QA use case, which is really where Tonic began, and we spread from there to some of those other use cases I described earlier, but when you look at the QA use case, one table is just not going to cut it for an application, right? One database probably isn't going to cut it, you need multiple databases, you need 1000s of tables, you may need petabytes of data. So we knew from the start, we needed an approach that was going to be able to handle this large kind of realistic scale in the industry. And this, of course, meant that we would be able to handle all the use cases.
Andrew Colombi (27:35):
So what do I mean by all the use cases, I mean functional testing, I mean functional verification, I mean integration testing, and demo use cases, customer debugging, all these different use cases were use cases wanted to support in our approach. And this led us to the conclusion that it wasn't going to be a one size fits all, it wasn't going to be a magic bullet, there wasn't one algorithm that we could find the perfect data synthesis to algorithm, the perfect data anonymization algorithm, that if we could just uncover that algorithm, then everything else would follow from that.
Andrew Colombi (28:12):
But rather, it's a best of breed, it's a taking different approaches that have historical approaches, as well as more modern approaches, and bringing those all together. And being able to create a platform where you can use format preserving encryption on one table, and a deep generative model on the other table. That's a real challenge. You can't just bring those two very different ideas together in one spot without solving some real technical challenges along the way.
Andrew Colombi (28:40):
So that's kind of what we mean when we say data mimicking, it's like when we say data mimicking, it's starting with the goals, what are the approaches and technologies that we need to get there and maximize this utility privacy trade off, to give our customers and give people that need this kind of data, the most high quality data and the highest privacy data? So let's look at some... Do you have a question.
Chiara Colombi (29:09):
Yeah, just to clarify, differential privacy does it only apply to synthesis or does it also apply in anonymization techniques?
Andrew Colombi (29:18):
It can apply in anonymization techniques, if you apply it very broadly, by that I mean let me step back and say you can apply the concept of differential privacy to any randomized algorithm. So if you have a randomized algorithm that is participating in some anonymization, you can think about the differential privacy of that algorithm which is a really cool thing, by the way. I mean, the fact that differential privacy can be applied to really any anonymized algorithm makes it very powerful and a useful technique.
Andrew Colombi (29:47):
But I would say the main way to think about it is probably through the lens of data synthesis. At least, that's how I mostly think about it. So with that let's talk about some pros and cons. Okay, so with pros and cons, if we start with anonymization, the utility of anonymization is certainly can be very, very high. But as we talked about GDPR, CCPA, SOC 2, these various regulations and standards, they frequently call out anonymization specifically as not being sufficient.
Andrew Colombi (30:27):
So you need to be careful about how you apply it, I think it can be sufficient in certain circumstances, if you take it far enough, if you go all the way down to the level of making sure that you have statistical replacement of any fields that can possibly be used to reidentify a person, then you can look at that, but it can be very difficult to create sufficiently private data using anonymization alone.
Andrew Colombi (30:51):
Synthesis, if you look at the synthesis approach, it can be very difficult to preserve the utility of the data, you can get the privacy up there, but the utility can be very difficult to preserve. And that's why we looked at mimicking as the approach here, which helps us maximize utility and privacy. But it's extremely difficult to build because now, it's not just one size fits all, now you've got to build format preserving encryption, you've got to build the synthesis, you've got to build the subsetter, you've got to build all these different things, put it together.
Andrew Colombi (31:24):
And we're getting close to the end here, just wanted to talk about, why mimic data is safer and more useful than traditional Test Data. The first one here is regulatory compliance, with mimicry you can bring prod like data over to staging and satisfy CCPA, satisfy GDPR. Likewise, data security gets much improved, if you use mimicked data, then you are reducing your exposure, because less prod data is being exposed to a broader environment.
Andrew Colombi (32:04):
My slides are... Here we go. The next one I wanted to talk about is data sharing, with data sharing if you use mimic data, it's much easier to share data to outsourced teams, which kind of solves the data sovereignty problem. And similar with distributed teams, you want to take data out of the EU into other countries mimic data can can really help there. And build verses buy question, if you're looking at this, you're definitely thinking, "Maybe it makes more sense for me to build this in-house, I can make it more customized to my needs."
Andrew Colombi (32:40):
But the challenges are very real. There's a lot to implement, if you want to go down this route. And ultimately, that mass of software that you develop is going to require a massive support and maintenance over time, which will be very expensive. And you have to get it right. I mean, format preserving encryption is not necessarily easy to implement, differential privacy is not easy to implement. And many of the techniques are quite challenging to implement, statistical data replacement, you need experts, you need people that are very, very familiar with these techniques to be able to do it in a way that is going to have the guarantees you need around privacy.
Andrew Colombi (33:18):
And just looking at what Tonic has done over the years, here's just kind of a list of things that... And this isn't even a complete list but this is just to kind of give you a sense of all the different things that we've incorporated into our platform to enable people to get what they need out of their their production data without revealing the privacy of their users. And I'll just end with a quote from one of our customers that I think is really an amazing quote. I'll just read the first sentence, because I think that summarizes it. Nothing we tried in-house is comparable to what we're doing now with Tonic.
Andrew Colombi (33:57):
So with that, I'll kind of close and take the floor to questions if anyone has any questions. I do love answering questions. Hopefully, that came through with questions I already answered. So much fun.
Chiara Colombi (34:08):
It did. It was great to kind of dive deeper into a couple of those subjects during the presentation with those questions. And I do have some more for you here. And please keep them coming in the Q&A, or in the chat, either way. One came up when you were talking about rule-based data generation. How specific can you get and how much of the lift is it to create truly representative data using the rule-based approach.
Andrew Colombi (34:31):
It's very hard. It depends on how representative you want to get. But at some point, you end up finding yourself recoding like the behaviors of your users in your application, it's like you're coding up what does it like for a user to log in and what does it look like for user to create an order and all these things, they become kind of too much.
Andrew Colombi (34:55):
And my view is that ultimately with rule-based, you can't get there. Tonic to give a little history of Tonic, the pre of the fore founders, myself included, all worked at Palantir. And we had this problem of having sensitive data that we wanted to share with other people in the company so that we could get our jobs done. And our first attempt was our rules-based attempt, and we quickly figured out that it was just going to be kind of impossible to create the richness of data that exists and in real data.
Andrew Colombi (35:38):
So we quickly moved over to more of a statistical data replacement and mock data and synthetic data as a way of doing it. So yeah, I kind of view it as not really possible in the long... If you really want to get very, very high quality data out of it. But it's still an approach that can be useful. I mean, it doesn't mean it's not useful. It's just, it has its limits.
Chiara Colombi (36:00):
Then a question that just came through, would you mind providing a high level walkthrough of the process of using tonic to generate data?
Andrew Colombi (36:06):
Yeah, sure. So really what you're doing with tonic, as a user is teaching tonic about your data. And you're doing that kind of broadly throughout your product, or sorry throughout your data. So the first thing tonic does is it scans your data looking for what it thinks is sensitive information. So that would be your names, addresses, IP addresses, that kind of thing. And-
Chiara Colombi (36:37):
How does it scan your data?
Andrew Colombi (36:37):
How does it scan it?
Chiara Colombi (36:40):
What is the setup?
Andrew Colombi (36:40):
The first thing you do is you connect Tonic to your production data or a copy of your production data, frequently people use like a backup of their production data or like a recovery DR of their production data. And then having given Tonic the creds to your database, and by the way, Tonic is installed on prem. So our systems are not connecting directly to your systems. Rather, you are installing Tonic on your own VPC or on your own metal, and then connecting it to your database. And having connected to your database, it scans the database by sampling rows from each table.
Andrew Colombi (37:18):
And we have a variety of algorithms, we have some simple regex's, we have some machine learning as well, that scans your data looking for potential sensitive information and then it flags that for you. And having found that as well as your own knowledge of the database, you kind of tell Tonic where the sensitive information is and how it needs to be addressed. Building up what we call the model of your database. And that model I think when you talk about machine learning, people talk about a model and they're like, "Okay, it's like weights that have been learned.
Andrew Colombi (37:54):
Tonic's model is much more of a maybe a meta model or like an aggregate model, a model of models. This kind of fits into the mimicry idea, and there's like we're building a best of breed solution. So we're bringing together a lot of different sub models that are going to model individual parts of your database. So you might have part of your database that's being addressed with a deep generative network, or another part of your database, which is generated with format preserving encryption.
Andrew Colombi (38:24):
And you configure that together and once you've got that configured, you run Tonic on a weekly or daily basis to create new data for your staging environment, or even more often than daily, we have some customers are created every couple hours, just depends on your needs. How quickly is your database changing? How often do you want updated data?
Chiara Colombi (38:42):
I think a good question like as a follow up to this is that you talk about generators, you say Tonic has 40 plus generators. What do you mean by a generator in that context, to give people an idea of what it'll look like in the UI?
Andrew Colombi (38:57):
Yeah. So what a generator is and Chiara do you think I should fire up the UI and actually show it?
Chiara Colombi (39:05):
Just a quick image might help.
Andrew Colombi (39:06):
Yeah, sure. Let's do that. Let me stop sharing for a second. Okay, so share my screen. Can you see that?
Chiara Colombi (39:19):
Yeah.
Andrew Colombi (39:20):
Great. So here's Tonic. This is actually the result of a scan. This is just a Test Database of course. But the scan has identified these various fields that contain potentially sensitive information. And what a generator is, is it's a way of working with, it's like part of your model. So here's the model. And a model is... Sorry, I should say a generator is sort of one piece of the overall model. So a generator might be how do we deal with this first name, and we can say, let's make that another first name. So here's an example of mock data.
Andrew Colombi (40:00):
We talked about mock data, we talked about scrambling data, let's scramble some data. What does that look like? We can use our character scramble. And here it is, we talked about statistical data replacement. Well, we may want to do some replacement of these genders, we can use what we call the categorical generator, and that samples the underlying database to create new data that is in the same distribution as the original data.
Chiara Colombi (40:29):
That's very helpful. Thank you. Okay, another question that came in. How do you handle repeatability and mimicking for a demo environment where you need to update data due to product releases, and your sales people always want repeatable data?
Andrew Colombi (40:44):
Yeah, that's a good question. So we do have a capability called consistency, which enables you to create data kind of in a similar way each time. So here, we've got names, what do I have as the name, the first name is Alexandra, I can turn on consistency. And what that does is, it will always create the same data for the same input. So here, maybe a better example would be this, Melissa rows 20 and 21 both have Melissa. And if I turn off the preview, or turn on the preview, you can see now they both mapped to Rosella and that's because consistency is available.
Andrew Colombi (41:31):
And you might not use consistency for names, you might want it for a key, like the customer key for example. So you can do that in a way that creates data, new data gets created in the same way with the same key each time. So that can be a useful way of doing it. In terms of having a guarantee that the data will be exactly the same, it won't be and that's kind of the point. I mean, that's reason why you're probably creating a new refreshed copy. You want that data to be updated over time. And so the story of that user is going to change a little bit as the input data changes. But overall, we think you can create pretty compelling demos still with this approach.
Chiara Colombi (42:22):
Great. Another question just came through? Can I manage models and generate data through API's?
Andrew Colombi (42:27):
Yeah. So kind of everything that I do in the UI is also available through the API and we have our customers using that regularly. And the most common API is this generate data button over here, like I can click this button, and that'll create my data. But people rarely want to click this button every day. So instead, you can use an API there, we have a way of generating an API key so that you can... Here's a API token you can create API tokens. And that can be useful for operating with the API. But yeah all of this is available through the API as well.
Chiara Colombi (43:04):
Great, thank you. You mentioned a GANs generator in Tonic, could you explain more about how that works?
Andrew Colombi (43:14):
There's a lot there but essentially, what that's doing is you're training specific columns in a table, for example on... You're taking those tables and training the model on those columns, sorry, I said taking those tables, I meant to say, taking those columns in the table, and you're training a model on those columns. I mean, the specific approaches that we're using are still kind of very much R&D effort, I can say we're using variational, auto encoders and we're using as well generative adversarial networks, to look at this. But going much deeper would be kind of a treatise unto itself.
Chiara Colombi (44:04):
So is it taking like multiple columns at once and applying that generator?
Andrew Colombi (44:09):
Yeah, exactly. I mean, one of the whole advantage of using these techniques is that you can recreate structures that are in the data, that as a human you're not aware of, and are really subtle, a machine can pick up on these really subtle patterns that exists in the data that a human may not see, but are important to being able to use your data later on. Like if you want to be able to do say BI on the data that is created by Tonic and by BI I mean business intelligence, being able to do some analysis, you need to make sure that there are all the subtle relationships in the output data as there were in the input data. So being able to operate across multiple columns, And then find the relationships with those columns automatically is one of the key features of using GANs through Tonic.
Chiara Colombi (45:09):
Okay. Another question that came through that I feel like we hear a lot. How does a system like Tonics work with data that has already been encrypted?
Andrew Colombi (45:18):
Interesting. Well we actually... That comes up a lot. People like to encrypt their data and that's great. We actually have a specific technology, I don't know if I can demo it in this demo environment. No, I cannot. We have a specific plug in infrastructure where you can plug in your encryption approach to Tonic and we frequently help customers with this. And essentially, you tell Tonic this is how to decrypt this column. And then kind of an extra drop down in this UI, it'll be like encryption technique, and then you just drop down and you select whatever it was. And then Tonic, well basically you can treat it almost as like a post, or I should say, a pre and post processor for Tonic. So Tonic can preprocess that data to unencrypt it and then after you've done the anonymization, re-encrypt it with a post-process process.
Chiara Colombi (46:09):
This is on the opposite end, what happens in data mimicking if your source data contains inaccuracies, or is broken in some way?
Andrew Colombi (46:17):
Well, that's a good test case now, isn't it? So we like to keep that stuff, if you have broken data, what is broken data? I mean, you have data in your database that you didn't expect. You got to make sure your application can handle that. Here we've got a agenda with null, it's not broken data, maybe wasn't filled out or whatever. And the way Tonic treats that is it seems like a good opportunity to mess with your system. So if the data, if it exists in your data Tonic is going to try to recreate it. If you want to clean up your data before you pass it to Tonic, you're welcome to do that. But we view it as our job to recreate those idiosyncrasies rather than align to them.
Chiara Colombi (47:05):
That makes sense. I think the only other question I see that's coming through right now is how can we get in touch? Maybe you could put up the last slide of the-
Andrew Colombi (47:16):
Yeah, sure. Great, how about that as an answer?
Chiara Colombi (47:22):
That's perfect. Oh, we do have another question that came through. To deal with encrypted data, do I need to share the decryption encryption keys with Tonic?
Andrew Colombi (47:31):
Not with us directly. Like I said earlier, we are on prem. So you're going to share it with our application that you've installed on your premise, but we as a company will never see your encryption keys.
Chiara Colombi (47:48):
Okay, so none of the data that is accessed by Tonic is accessed by us.
Andrew Colombi (47:54):
Correct, none of the data that is accessed by our application is accessed by us. You can run Tonic completely isolated in an air gapped environment. In fact, we do in several cases. So you really can be confident that Tonic is not, like Tonic the company is not seeing your data or your encryption keys or anything like that.
Chiara Colombi (48:11):
I don't see any more questions in the Q&A or in the chat. But if you do have one, please go ahead and ask if not you can also reach out to us later if you come up with something later. hello@tonic.ai we're on twitter @tonicfakedata and our website is tonic.ai. Excellent. Well, thank you, Andrew, this was great. This was a really great presentation. We had some awesome questions come through and thanks to everyone for joining us. Join us next time. We've got more webinars in the pipeline as well. Thank you.
Andrew Colombi (48:46):
Thank you. Yep, thanks bye.