We've identified the 9 most common ways your synthetic data can fail you — and the solutions you need to ensure safety and utility in your test data.
There are wrong ways to fake your data. These data generation pitfalls can break your testing or, worse, leak sensitive data into unsecured environments.
Join us for this live webinar to gain insight into generating realism in:
- time series data and event pipelines,
- categorical data distributions,
- consistency in JSON blobs,
- outliers at risk of re-identification,
- and working across SQL and NoSQL databases.
Don't fail at faking it. We're here to help.
Be sure to download the companion ebook Fake Data Anti-patterns for the fullest experience.
Omed Habib (00:00):
All right. I think we have a pretty good turnout. Should we jump in and get started?
Chiara Colombi (00:05):
Yeah. Sounds great. Hi everyone. Thank you for joining us today. Today, we're exploring a subject that isn't really covered nearly enough I think in the synthetic data space and that's fake data anti-patterns, which by which we mean how your synthetic data can fail you. The content of this event is based on an ebook that we recently published by the same name. So, if you haven't already downloaded that, be sure to hop on over to tonic.ai, you'll see kind of a news flash banner toward the top of the homepage. And there's a hyperlink there that says, "Access it here." That's where you can get your copy. If you've tuned into our events before, you'll kind of know how this goes. I'm your host, Chiara Colombi, and I'm happy to voice your questions. You can put them in the chat or the Q&A. I'm also stepping into the role of speaker today, and I'm very happy to be joined by my colleague, Omed Habib.
Chiara Colombi (00:51):
He is our VP of marketing and he will be co-presenting today's session. So, like I said, we'll both be speaking. We'll both be looking out for your questions. Post them in the Q&A, in the chat, either way and we will get them answered as they come along. All right. I will pass the mic over to Omed.
Omed Habib (01:09):
All right, cool. Thank you for that warm intro, Chiara. So, a little bit of a history behind this book, we obviously are Tonic. We are the fake data company. We are in the business of fake data and in doing so, we saw some patterns of how you don't want to create fake data. So, this is not an overview of our product, it's more an overview of different practices or what we call an anti-practice or anti-pattern, how you don't want to do it. So, we have the anti-pattern, then we also have like, here's how you actually want to do it. How to solve for the anti-pattern. Before we jump in, there are four assumptions that we're making of you in the audience. The first assumption is that you are an engineer or a software developer within some context. The second assumption that we're making here is that as a function of being a software engineer, you probably are software engineering within one of the stages in the SDLC.
Omed Habib (02:05):
So, whether you're a production programmer building features on your local laptop or a sandbox, or you're QA engineer working on the QA stage or staging, capacity testing and pre-production, you are involved at some part of a pre-production environment, which gets me to my fourth assumption which is that you are working with an environment that has to... Sorry, my third assumption, you're working on an environment that has to mimic production as much as possible, which gets me to my fourth assumption, which is that better test data makes your life a lot easier. Okay. Those are the four assumptions. It's a long way of saying you're probably a software developer. All right, cool. Without further ado, let's go ahead and jump in here. Let's hop into not the overview, but our first anti-pattern. What do you think, Chiara?
Chiara Colombi (03:05):
Yeah, sounds great.
Omed Habib (03:09):
Okay. So, I'm going to set up the anti-pattern here. Let me just kind of zoom in here so I don't distract you all with all that content. Again, if you haven't downloaded the ebook, we'll include a link in the chat. It's also on the website. You can download the PDF. Okay. This is the anti-pattern. So, imagine you're a software developer and you are working with test data and in doing so you have dates that are attributed to your customers or your fake users. Now you can create arbitrary test data in this case, arbitrary dates. The problem here is that sometimes the dates are correlated. So for example, on my screen here, you have a date of birth. If you create test data that's arbitrary and random, you're going to have a story that doesn't really makes sense. So, the date of birth is 2029, but the ID actually expires in 1995. That does not follow a logical, chronological order of things in real life.
Omed Habib (04:13):
So, the anti pattern here is that you're creating arbitrary test data with dates that don't actually make sense in real world. That can break your testing. You could possibly have some kind of integration tests that maybe check to see if someone is 18 years or older. So, you're checking for like today's day and the date of birth or something like that to see if that math adds up. Regardless if you're creating test data and de-identifying production or creating it from scratch, you're going to want to have dates that can make sense even in an arbitrary environment like your pre-pro testing environments. Chiara's going to chat about how you do want to do this.
Chiara Colombi (04:56):
Right. So, at the bare minimum, you want the data within a single column to look like the original, right? So, your algorithm needs to be able to replicate your real world distribution of dates. Let's say your oldest timestamp in your data is from 2001, November 22nd. And your most recent is yesterday. That's the range that you're dealing with. Your fake data is going to feel more legitimate and be more useful if it falls than that same range. Not only that, but if it falls within the range, mirroring the same distribution. So, that's the first step. The next step is that you want this to work across columns. So, you want to be able to link column A of timestamp data with column B and ideally with column C, D, E and so on. So, you have a user being created, taking an action, taking of follow up action.
Chiara Colombi (05:43):
And the order is such that like Omed was saying, if you run a test, your program's not going to throw any errors. Where things get really interesting though, is when you go down to the row by row level. And the use case I'm talking about here is true event pipelines. So, let's say you're gathering Fitbit data from hundreds of users and it's logging the heart rate once a minute. What you're going to see is that same user appear in multiple rows as the data comes in. So, it's no longer a column to column relationship that you need to mimic, it's a row to row to row relationship and that is a very large nut to crack. I'm just going to leave it there as like something to consider if that is the type of data you're working with. I'm sure our data scientists would have a lot more to say about it, but it's also enough to fill the space of entire webinar of its own.
Omed Habib (06:32):
Yeah. Awesome. You're going to solely start to notice that in order to create test data that mimics the real world, there's a lot of engineering efforts and almost an infinite number of new nuanced use cases or scenarios that you didn't think about. Yeah. Synthetic data is a lot more complicated and complex, and even I originally thought it was before I joined Tonic. All right. Let's jump in to anti-pattern number two. Okay. This one's interesting. All right. Here's the anti-pattern. Okay. So, in your database, you're going to have data that relates to other data. So, you could have in this scenario, this example, let me actually scroll over so you see what we're talking about here. So, this is a manager and employee relationship. The hierarchy usually is an employee has one manager. One manager has many employees or directs, okay?
Omed Habib (07:29):
So, that's the natural hierarchy or topology or taxonomy. If you're creating random arbitrary test data, you could run into a scenario where you have in this case, the anti pattern is a data set that has the reverse dependencies. So, an employee has multiple managers, right? That's the anti-pattern. Chiara will [crosstalk 00:07:59] jump into how you do want to do this correctly.
Chiara Colombi (08:02):
Yeah. It's a question of just the distributions being totally off. And there are many instances when this becomes important and more complex than it initially may seem. On the surface, it's similar to the time series data. You've got a distribution of data in this case categories, and you want to recreate it and wherever the ratios of those categories are related to another set of categories, you want to be able to link them. So, like Omed was saying, if you're looking at employee titles across a department, maybe the ratios of account execs to sales reps within a sales organization, it needs to make sense in your output. And in order to do that, you've got to be able to look at the distributions and recreate them. But here we're also, we're not just talking about the failings of simply randomized data and randomized distributions. Defining ratios and linking columns is critical in realistically mimicking categorical data like this but we can also take things a step further.
Chiara Colombi (08:56):
There may be instances when you want to create custom categories within your fake data. So for example, you want to include a specific set of medical codes that doesn't currently exist in your data. The ideal categorical generator here would allow you to define the categories. So, add categories that aren't already in your source data and define the ratios at which they occur as well. In this way, you can customize your categorical data, even build logic around when certain categories should be used in relation to other data in your dataset. And this enables you to really test out specific use cases or scenarios more realistically in your product and cover use cases outside of what your data currently represents.
Omed Habib (09:38):
Yes. So, I'm just going to Zoom in really quick here. I'm just going to give you guys a really dumb down example. Common example might be, let's say gender. You're keeping, I'm totally geeking out here with my iPad, but you have millions of rows and you have two variables or you have multiple variables. For simplicity sake, we just use two male and female, right? And so, these are different rows. Let's say you have a million rows, but the mathematical distribution of this is you say, you have like 40% male and 60% female. Now the anti-pattern here is not just with the example that I use early, where you actually inverse the relationship. In this case, you have an incorrect proportion in your test environment. Let's say your fake data creates a 100% male, 0% female. That doesn't exactly represent how your production environment works. And so, you could have integration tests that end up not actually executing on a female gender.
Omed Habib (10:37):
And that's just like one of millions of examples. The closer your test data is to your production, the more accurate your integration tests are, and the sooner you can cash those bugs in pre-prod rather than it going into production and then now you're shuffling trying to figure out what actually broke in production. Cool.
Chiara Colombi (10:57):
Yeah. There's so many scenarios we could kind of deep dive it for each of these anti-patterns.
Omed Habib (11:03):
All right. Jumping into anti-pattern three, unmapped relationships. Okay. So, this one, I'm just going to zoom in on the anti-pattern really quick so I don't distract you. All right. So this one's interesting. You have two columns. One column is bonus, one column is income. So, let's say for example, you're a finance company or healthcare. For some reason, you're keeping track of people's income and people's bonus. Now, those two numbers are usually supposed to correlate, right? So, it's like our bonus structure is like 10%, easy math. If you make a $100,000, your bonus is $10,000. You make a million dollars, your bonus is a $100,000. So, if you plot this on a graph, you can actually see a pattern start to emerge, okay? If you were to create random arbitrary data out of this, you lose that correlation.
Omed Habib (11:47):
So, your data now is all over the graph. Maintaining the mathematical fidelity of this data, again, your tests could depend on that correlation. And so, if there's a bug that can occur based off of that type of data, you're going to miss it and it's going to slip through into production. The way that you want to solve for this, so the pattern would look like this, right? This is like the actual distribution of your data. Let me hand it over to Chiara.
Chiara Colombi (12:22):
Yeah. And I mean, sounds like we're saying a lot of the similar things over and over, because it's really, this is part of what we've been talking about all along, few columns in a data set standalone, but realistic data is all about the relationships. So, you need to be able to tell your data generator what is related, but I'm going to up level that. Better yet, you need your data generator to automatically recognize relationships within your data. And I'm going to talk about these two approaches one and then the next. So, approach number one, you define the rules. In this case, you want to be able to define which columns are related. So, the way we do that in Tonic is by column linking. So, you can link as many columns as you need to, that are making use of the same underlying data generation algorithm, what we call a data generator.
Chiara Colombi (13:04):
So, this might be categorical data like the employee title to department to city. Yes, you could use a city as a category in order to mimic relationships or like the case of our anti-pattern here, you might have continuous data, numbers that need to be recreated along the same distribution and you're working with multiple sets and distributions. So, income linked to bonus. We could also maybe want to link that to years of employment as well as something that would impact both of those numbers. And that's what really enables you to make realistic data here. So, I said there are two approaches. You define the rules, approach number two is your data generator automatically recognizing these relationships and this involves bringing your data scientists to the table. The reason I say that is it takes a much greater degree of complexity to build a data generator that automatically recognizes and recreates relationships across columns in your data. You most likely want to make use of neural networks, whether that's a generative adversarial network, a.k.a GAN, or a variational autoencoder, a.k.a VAE. We're getting them to the nitty gritty of data science here.
Chiara Colombi (14:08):
I like to geek out about this stuff and we should probably bring on a data scientist to really geek out about it. But anyway, once you've got that type of generator in place, it works beautifully. Instead of having to manually say, this column is linked to this column using this generator and this column is linked to that column using this generator, you simply point your network generator to all of the columns of interest and let it do all the relationship mapping for you. This can be built to work across numeric categorical and location data so you're not using a different generator for each of these. In Tonic, just by way of example, we use VAEs and our generator it's called the Smart Linking generator. So, you may have seen some articles on the blog about that as well. I'm just going to quickly pause. I don't see any questions come through yet, but do feel free to ask those at any time.
Omed Habib (14:59):
Cool, awesome. The anti-pattern could result. Thank you for that Chiara. That was a fantastic explanation. I definitely do think you explain Tonic and just the overall science of synthetic data far better than I do. I'm just always impressed with your ability to understand the science under the hood. All right, let's jump-
Chiara Colombi (15:22):
All those nitty gritty things, I love it.
Omed Habib (15:23):
Yeah.
Chiara Colombi (15:25):
I'm like a wannabe data scientist, I guess. Thank you.
Omed Habib (15:28):
All right. Anti-pattern on this guy. So, this one is a inconsistent transformations. So, I'm going to explain to you what that means. Let me just kind of zoom in here what the anti-pattern looks like. All right. You have a table here, customer_id. You have a column called first name, last name and form_submission. You'll notice though that the form_submission is a JSON object. And so, if you're de-identifying from production or maybe even creating random from scratch, the challenge you have here is let's say Jon Stevens is the random name that you're creating. Well, that name is going to occur in more places in that row than just in the first name column. So, in the form submission you have, somehow your generator created a first name that's also random arbitrary, but it doesn't match the same name that you're using in that column.
Omed Habib (16:26):
And that first name could repeat itself multiple times in that row depending on what kind of columns you have. What you don't want to have is a scenario where you're not using the same first name and having it be consistent throughout that test row. Here you have Ali, but also that row is using a different first name called Tony. So, what does a pattern look like?
Chiara Colombi (16:54):
Yeah. So, the fact of the matter here, is that a good time for me to jump in Omed?
Omed Habib (17:01):
Yes. That was my transition to you, which you've caught it pretty well.
Chiara Colombi (17:06):
Cool. Next time I won't ask. So, the fact of the matter here is your high quality fake data requires consistency. And by consistency, we mean that the same input will always map to the same output throughout your database and even across your databases. Practically speaking, your data de-identification infrastructure needs a way of kind of tagging a particular input so that it can find it in all instances of your source data and consistently map it to the same output every time. So, in the case of our anti-pattern example here, we've got that form. Consistency is ensuring that your customers' fake names are faked in the same way wherever they appear, whether that's in the first name column, JSON blob, first name column in another table or in another database entirely. So, we use first name as an easy example to show what we mean here, but it might not be, consistency might not be critical when it comes to names but it is critical when it comes to unique identifiers like customer IDs or email addresses. Primary keys? Obviously.
Chiara Colombi (18:08):
It's what enables you to fully anonymize these fields and still use them in a joint or maybe match duplicate data across databases. And like much of what we've already discussed, it's a pretty complex problem to solve for. And I kind of highlight that because it's funny in Tonic, all you see is this little toggle and it says, turn on for consistency. Turn off. It's in the model when you're configuring your generator. It looks really simple. It looks trivial, but it's doing serious, heavy lifting on the back end, worth calling out. Oh, and I just, I do have a question that came through about Smart Linking that I want to cover before we jump up to the next anti-pattern. The question is, is Smart Linking, does it work across all columns in a table or is it only columns you choose and can it work across tables? So, the way it works is you decide which columns you point it to.
Chiara Colombi (19:02):
And currently there are some limitations on what types of columns you can point it to. I believe it's categorical, continuous and I mentioned location data as well. And Smart Linking works on a table by table basis. So, you can use it for one set of columns within a table. That's my understanding of it to date, but of course it's constantly being developed. Omed, is there anything you would add to that?
Omed Habib (19:27):
There's a fourth data type that I want to say is date. The type, I can double check that. It's in the product docs if anyone was interested. All right. Fifth anti-pattern, sensitive data leakage. Okay. So, let's say, I'm just going to zoom in here. Let's say that you were to create your own test data generator and you have an environment that has hundreds or thousands of tables. Each table could have an average of like, I don't know, 10, 15, 20 different columns. So, it's going to be difficult for you to go through every single column, through every table, through every single database for you to figure out all right, what actually is sensitive data. Probably the best way to do it, but short of that, you're going to write some kind of automation script that's going to scan through all of this. And in your automation script, you're going to be looking for things like first name.
Omed Habib (20:28):
You're going to be looking for things like date of birth, credit card number, right? So, it's an easy way for you to catch what is supposed to be sensitive data. This is pretty important if you are using production data in your pre-prod environments, but you're masking that data. And so, the problem here though, is that you may have a column that isn't predictably sensitive. So, the example we're using here is student_db, right? That could be like student database. It could be whatever, but it's not intuitively date of birth, right? So, what ends up happening is you caught student_first name, which is fn, which is also kind of not super predictable, but let's say you caught it. You caught student_last name, but student_db ended up slipping through. So, what ends up happening is while you de-identified the first name, the last name, the email, and the phone number, the date of birth is actually passing through.
Omed Habib (21:32):
So, why is this a challenge? Right? It's like, "Well, I de-identified everything else. Why is date of birth such a big deal?" The problem is, is that hackers, if they obtain your test data, you want it to not be reverse engineerable, right? So, they reverse engineer. But what they can do is use reference data to figure out who exactly is this person in your test database. So, if you have test data in... Sorry, if you have data in your original data set that can contain reference data, then your test data is not a 100% protected. This is only relevant if you're using production and data and you're de-identifying it. All right. So, there's a common, very famous story about how this actually happened. The company was Netflix.
Omed Habib (22:33):
I don't know if anybody here recalls, there was a contest that Netflix had done. I think it was like 10, 15 years ago where they offered a large sum of money to anybody who can create a better prediction algorithm than what they have. And so, what they did is they took their entire database of all their users and viewing activity and they de-identified it. Okay. So, this is actual production data and it was de-identified, right? This is a fame company, like unlimited computer science resources. They de-identified the data and said, "No. The data's de-identified. We're protecting privacy." Which they did, I mean, and they did their best in doing so. Two researchers got together and they said, "I think we could probably crack this." And using publicly available reference data, they were able to reverse engineer Netflix's public data set, which is a pretty scary. I mean, this was intentionally de-identified to give to the public, right?
Omed Habib (23:34):
So, it's like most companies, including yours is probably not assuming that the public's going to see it anyway, right? So, it's not like you're intentionally trying to, but so your data may not be as protected. Anyway, so that's the anti-pattern and a pretty scary story that comes with it.
Chiara Colombi (23:50):
Yeah. It just goes to show how careful your de-identification practices have to be and how just a small amount of reference data can be, within the data that you've de-identified or anonymized, can be used to then link it all back and uncover everybody in your data set. So, it seems like a pretty obvious solution. You need a system in place that identifies and flags PII/PHI, whatever is deemed as sensitive in your data. But it's actually not just that one system, you really need two systems, one that automatically flags the sensitive data, but one that also allows your users to define rules around what should be flagged as sensitive to really make sure that oh, that doesn't seem to be sensitive, but it is in our case and we want to make sure it's always de-identified thoroughly.
Chiara Colombi (24:35):
So again, I'm going to talk about these two approaches, first one and then the other. So, first your system needs to automatically flag sensitive data. Like we've discussed, looking at column name alone isn't always enough. Your system should make use of machine learning to examine the data at the field level, in order to understand what may or may not be PII. So, machine learning can identify a social security network or what looks like an address or a phone number or gender is probably a pretty easy thing to flag but it has to be set up with machine learning on the back end, regardless of what the column names say they are. ML offers just a more thorough, granular examination of your data to look at both column names and field level values and really flag everything that is sensitive. But what about that not so obvious data?
Chiara Colombi (25:22):
So, I'm not talking about social security numbers and addresses and things like that. Data that isn't universally accepted as sensitive. There may be information within your data that is sensitive in the context of either your particular industry or your use case. So, you might have proprietary business information or data that is subject to a industry specific regulation. That's when you need that second system in place, a system to enable your users to define the rules around what is automatically flagged as sensitive. So, with those rules in place, all your not so obviously sensitive data will be identified and flagged going forward and you'll know that you'll have to de-identify it properly. Some examples of what might not be obviously sensitive could be you geolocation data in the form of lot long values. Maybe you've got warehouses that the privacy of where they're located needs to be protected, or maybe like revenue amounts tied to individual users.
Chiara Colombi (26:19):
Anything that you deem to be sensitive, you should be able to flag. If it's in your data, catching it and stopping it from leaking into lower environments shouldn't be this kind of gargantuan manual, risk heavy task each time you're running a generation. It should be something that your system is designed to do for you. And I see another question that I want to get to as well. So, I'm going to ask that quickly. Do you apply different fake data generation rules for records used in a staging or dev environment than you would in QA? It seems like there may be instances where you want to treat the data differently. Interesting question.
Omed Habib (26:56):
All right. Yeah, I can take that one. I've definitely seen that happen. So, you have the same source database, and then you have different teams that work on the data differently. So, it's like you have software developers who are like, "Look, I just need data that can match production as much as possible. I don't only care about the nuances of the data. Just make sure it's de-identified and I have the same behavior." You have release managers who are like, "Yeah, I just need a lot of data. So, in production we have a million rows, give me a synthesized data set that has 10 million rows." So, they don't care at all about the data they care about mass scale. Again, it's exact same source data. Then you also have organizations where you have data scientists who are like, "Yeah, we're running like analysis on the data, but we don't want to get production access to all of these interns or associates or maybe even contractors, right? So, we're giving them data where the statistics of the data is absolutely imperative because that's what the running analysis on, right?
Omed Habib (27:58):
And so, they're really focused on it. With Tonic, you can create different workspaces, exact same source data, but the output data is all configured differently depending on the needs of the team. I think that answers the question if that was-
Chiara Colombi (28:12):
Yeah. Another thing I was thinking of is subsetting. Sometimes a team says I don't need all the data de-identified. I just want a small chunk of it so I can run a test locally. So, subsetting is a great example of when one team creates their own version of the de-identified data.
Omed Habib (28:31):
Yeah. Well, we do have a customer at eBay. I mean, you can imagine like eBay's mass scale of data. So, I believe it's exactly like, or more than eight petabytes, right? So, it's like a level of scale that I don't even know if most computer science books have answers for. These are the type of companies, eBay, Netflix, Google, Amazon, like they're pushing the limits of computer science every day. So, you can't possibly, if you're a software engineer at eBay, which they have over 4,000 software engineers, mind you, you can't possibly take even if it's a synthetic data set, you can't even fit eight petabytes of data on your local laptop. So, how do you solve that? You take a subset of it and a subset for those of you who don't know is essentially a micro version of your entire database while maintaining all the tables and the foreign key relationships and the parent child dependencies. It's essentially a smaller version. With eBay, they take eight petabytes of source data and they create a one gigabyte synthesized subset test data file. It's a lot of scale.
Chiara Colombi (29:41):
Yeah.
Omed Habib (29:42):
Fun stuff. Okay. So, going back, you mentioned something interesting earlier about latitude and longitude. We don't have that in this book and I believe it's coming out on the next version, but I'm going to tease it anyway. Imagine actually keeping latitude and longitude coordinates and then your test system has these random numbers that aren't even actually like, that it's even launched to coordinates or if they are, they're all over the globe, right? So, it's like, that's another challenge here is how do I make sure that my customer is protected and de-identified while still maintaining a geographically relevant coordinate so my tests look like they're doing what they're supposed to be doing? Yeah. Anti-pattern solution obviously is you want to have some kind of radius where you can't reverse engineer it, but at the same time, it's not like somebody in Detroit, Michigan and in the test environment, it's like, I don't know, Antarctica or something.
Chiara Colombi (30:47):
Yeah, that's a great one that I'm looking forward to adding to the book.
Omed Habib (30:51):
Yeah, cool. Great question. All right. Let's keep going here. Missed schema changes. Okay. So, I mentioned eBay has like 4,000 software engineers, okay? So, how do you maintain that level of contribution and that level of software development without having somebody introduce a new feature that introduces a new column and that column is sensitive? So, if you're creating test data from your production environment, right? You have some kind of software possibly Tonic or something you built in-house, hopefully Tonic, that takes your production data set and de-identifies it. Well, said developer could introduce a new feature that introduces a new column that goes into production that's sensitive. The example here is a student social security number. If you don't keep track of schema changes, that column could be missed. And your test data generator is now pulling sensitive student information into your test data environment. And now your QA engineers are testing against data that has a column that should have been de-identified. That's the anti-pattern.
Chiara Colombi (32:14):
Yeah. And it sounds similar to the solution for PII because it is. You need to set up your system to flag something automatically. And it's in this case simpler than PII obviously, you're just flagging a schema change. That said, we do have some tips for kind of upleveling this capability. So, essentially if you've created a de-identification infrastructure that automatically runs regular generations based on an input database, you wanted to also automatically alert you anytime the schema of your input database changes. The reason being that if a column table has been added like Omed said, you need to define how that new data should be de-identified before you run any new generations or rehydrate your output database. So, that's kind of table stakes, basically. Just let me know when my schema changes so I can avoid pushing data through that isn't yet de-identified.
Chiara Colombi (33:04):
The up level is not only alert me that my schema has changed, but stop me from running any generations until I've addressed the issue. So, block generations from taking place until the model that you've built to generate data has been updated to a count for any new data that's flowing through. At Tonic, many of our customers run nightly generations. So, their test data is refreshed each day in the morning they come in. Some even run multiple generations a day. So, you could see how quickly production data could just end up in staging if schema changes alerts and blocks on data generation were not a part of the platform's core-capabilities, because you want this to just be running in the background doing the work for you, but you also don't want it pulling in data when it shouldn't.
Omed Habib (33:46):
Yeah. Awesome. All right. Anti-pattern number seven, outliers revealing TMI, too much information. Okay. Let's zoom into the anti-pattern here. Here is your table. First_name, last_name, city, state, and household_net_worth. Notice something here sticks out like a sore thumb. $500 million at the household_net_worth. Joe Schmoe, you probably could guess who this is. If you are a Stephen King fan, I just gave you answer away. It's Steven King. Let's say for example, you have an environment where you're keeping track of net worth. All right. Well, okay. I de-identified Joe Schmoe and let's say I even de-identified the location, right? I mean, in this case, I didn't, Bangor or Maine, but let's say I did. If the test data ends up in the wrong hands and somebody happens to know that Stephen King is a customer of this particular company, they could probably look at that and be like, "Oh, I'm willing to bet that that record there, that 500 million household_net_worth, which is pretty obvious as an outlier is probably Stephen King, okay?
Omed Habib (34:58):
So, and by the way, his net worth could be like $700 million and your test data generator ended up de-identifying it to 500 million. Well, it's like, yeah, but I mean, mathematically that still sticks out like a sore thumb, relatively speaking to all of the other data points in that data set. So, that's the anti-pattern.
Chiara Colombi (35:21):
Yeah. And here's another one of my favorite topics, because we're talking data science again. The answer here is differential privacy. We can't seem to have a webinar without bringing it up. And like I said, I love it because it's really powerful stuff. In the simplest terms, differential privacy is a property of an algorithm that provides mathematical guarantees of privacy. And the way it works is after ingesting your source data, differential privacy will add noise to your de-identified output to create a more tempered version of the original data. And along the way, it will obscure outliers that would otherwise leave your fake data at greater risk of de-identification. So, a data point like Steven King's net worth, instead of causing a similar data point along the lines of $500 million, whatever it is to appear somewhere in your output data, that kind of one off obvious outlier in your data would be softened by differential privacy.
Chiara Colombi (36:17):
And you might find a data point more like $500,000 somewhere in your output. So, that's still above average, but it's not this dead giveaway of, we've got some big bucks in this one place that could be re-identified. Now, it's worth pointing out that not all algorithms can easily be made differential private. Even the simplest of algorithms like the one behind our categorical generator, it requires a really nuanced understanding of the formula behind differential privacy to kind of toggle the dials if you will, to the right degree of privacy versus utility. Because the thing is the more you kind of lean in to making your data differently private, the more you lean away from making your data useful when it comes to analytics and preserving your data's realism in the sense of being very true to the original.
Chiara Colombi (37:06):
Think of it this way. If you add noise, the more noise you add to your data, it's going to be harder to hear the truth within your data. So, these dials of differential privacy have to be really carefully fine tuned, which is why we're very grateful for the data scientists on our team to manage this for us.
Omed Habib (37:27):
Okay. Yeah. Very well explained. I want to say differential privacy was a standard created at Microsoft. Am I-
Chiara Colombi (37:31):
I don't know that. I do know one of the lead data scientists behind it, Dwork, is her last name. I can't remember her first, but we do have, there's a blog article on our blog about differential privacy. I can find that and put that in the chat.
Omed Habib (37:47):
Yeah. I am equally as thankful for the multiple PhDs that we have on our team here at Tonic that understands this. All right, anti-pattern number eight. So, I think the two more left here and then we can jump into some more questions. Please keep the questions coming through. The anti-pattern here is no API, no webhooks, no integration. So, the idea here is that okay, so you use some sort of software or maybe even a tool that you built in house, but there's no API. There's no way to integrate it into your existing workflow or automation. If you're a software team, you probably have some kind of CI/CD orchestration system, whether it's like GitLab or Codefresh or Harness or even Jenkins for example. And in every stage in the workflow whether it's like storing an artifact or kicking off automated QA tests or security tests, it's probably going to depend on some sort of environment that's going to be using test data, right?
Omed Habib (38:52):
So, if you're running integration tests, you're going to need an environment that looks and feels and acts, behaves like production. So, both infrastructure of that ephemeral mimic environment, but also the test data in that environment has to be as close as possible to production. Okay. So, let's say you solve for that. Great. Well, every time you run a pipeline or a new job or new release, you're going to have to run those tests. And you want that test data to either be regenerated, so if you're spitting up a whole new environment, you want to output and saturate that test database with test data, or you want to update an existing static test database. Regardless, you want have some sort of integration into your existing workflows. So, if you've already adopted DevOps or you're thinking about adopting better modern CI/CD practices, you're going to want to have some sort of integration with your CI/CD environment. So, the anti-pattern here is no API, no webhooks, no integration. That's the anti-pattern.
Chiara Colombi (39:55):
Which makes the pattern pretty obvious: API, webhooks and integration. It's one of the first questions we actually always get when introducing Tonic to someone new, whether it's on a Webinar, it always comes through. Does tonic have an API? Yes, it does and your solution should too. And another handy feature is webhooks. Webhooks allow you to fire HTTP POST requests after specific events take place during your data generation workflows. So for example, those schema changes that we were talking about earlier. Sure, you could get them by email, Tonic can send you an email, but we all know a Slack notification is going to get the message across faster. So, webhooks is a tool that makes that possible. The point here is like Omed said, the more your workflows integrate, the faster and more effectively you can get your work done. It's also worth mentioning that the maintenance required to keep all your integrations up and running. It's probably clear by now that building these data de-identification infrastructures, they're no simple task, but maintaining them once they're built is not a walk and the park either.
Chiara Colombi (40:58):
Your data is moving quickly. Your ecosystem is constantly changing and your de-identification solution should be able to keep pace with all of those changes.
Omed Habib (41:08):
Thank you for that. All right. Our last one. Last but definitely not least, vendor lock-in. Okay. So, imagine that you're building some sort of script with, I'm going to tool in-house, okay? Hopefully you're not but imagine that you are. The anti-pattern here is that you are solving for a MySQL Database that you guys are using in production. The larger your environment and the more you're probably having distributed teams, you probably have gone from some sort of like monolithic architecture to microservices, you're going to be introducing different stacks, right? Your PHP stack might be using MySQL. Your .NET stack is using SQL server. You might have some kind of Node.js stack somewhere with a team in Europe who just adopted Mongo. So, what's going to happen is now you have a production environment with multiple different types of databases.
Omed Habib (42:01):
That's very common with most modern software deployments that we see today. I'd actually be surprised if you have a monolithic environment in production. It's almost nonexistent now. So, the challenge here is that you created a test data generator that takes production data from your MySQL instance and let's just say that you were able to achieve this Herculean effort of actually solving for this. Well, now you're stuck to just one vendor, MySQL. Okay. So, you introduce another database, PostgreSQL. Now you not only do you have to continue to build adapters and connect and understand the nuances behind every type of database engine, but then you introduce, okay, your Mongo team goes, "Hey, we need to test data for our Mongo." Now you have a NoSQL database, right? It's a whole, you have start from scratch with a whole new database type. So, the anti-pattern here essentially, is that what you don't want to do is build this in-house, I think is pretty obvious by now from this conversation.
Omed Habib (43:00):
You probably don't want to build this in-house and if you do, you want to solve for this anti-pattern, not be locked into a certain vendor. And if you do use a third party, hopefully you're using Tonic, shamelessly plugging us, that you have a support for an extremely wide array of database types.
Chiara Colombi (43:21):
Yeah. Just to add to that, we talk about this, it all comes back to the integration component. You want to be able to easily integrate with the tools in your tech stack and easily integrate with all the sources of your data. So, we like to say, Tonic works as effectively with Postgres, as it does with Redshift, as it does with a NoSQL database like MongoDB. Whatever data source you're working with, you want your solution to be working with it too. And in rattling off all these vastly different databases, we're in no way trying to downplay the complexity of the task at hand. It sounds like we're saying it's a case of a one size fits all solution, but it's really a case of a one size has been meticulously designed and built to work seamlessly with all solutions.
Chiara Colombi (44:04):
It's a huge lift. And part of the reason we include this, I think, I mean is pretty much covered this. The reason we include this among the added patterns is because even if you have a system in place right now that more or less works for your current data ecosystem, there's really little to no guarantee that it's going to work a year from now. It's partly a maintenance issue, but it's also just the nature of today's expanding data landscapes. I would wager a bet that there's a good chunk of you listening right now that the team is either considering adding a data warehouse or has recently added a data warehouse to your setup. Are you prepared to manage de-identification at that scale? It's a big scale. So yeah, you need a solution that works with your data now, and no matter what your data looks like in the future.
Omed Habib (44:52):
Awesome. Thank you for that, Chiara. I believe that covers all of it. Am I right? Yeah.
Chiara Colombi (44:55):
Yeah.
Omed Habib (44:55):
There we go. So, that was number nine. We are ready for some questions.
Chiara Colombi (45:02):
Yes. And I was also going to quickly put a link to the ebook where you can download it right here. It's in the chat. So, some questions let's see. So, here's a high level question for you. What have teams done historically to solve this problem and why isn't that working anymore?
Omed Habib (45:27):
Yeah, that's a great question. So, a lot of that actually was answered during the presentations. So, the obvious answer that we see most common, I'd probably say, this is our number one competitor, if you will, is people trying to solve for this themselves. There are third party libraries, and you can take open source projects and extend it to try to solve for it. But almost every conversation that we come across has one or more of these anti-patterns, which is actually where we source this data from. This data is from hundreds of conversations with large enterprise software deployments out there. That's one avenue trying to solve for this is doing it yourself. If you do that, you're going to come across these anti-patterns. This is like the least of it. There's many more anti-patterns that are going to come across, not to mention the ROI challenge, which is like, Okay, well, how many engineers do you have to put on this problem, A, to build a software then on top of that maintain?
Omed Habib (46:26):
So, it becomes very, very expensive, very quick. The second is that third party software, this is not a new space. People have been eating test data since the beginning of time. And so, the challenges there are, well, you can use a third party product to create random arbitrary data, and you're a 100% safe. And if you're a brand new software or a start up, or a new project at a larger company that doesn't have any production data, it will serve your purpose. The problem here is that, and this is kind of where Tonic steps in is some of the larger deployments in the world, they're going to have a lot of data, a lot of tables, a lot of columns. But probably more importantly, it's they have a lot of nuanced behavior in both the complexity and the relationship between the data. And so, Tonic can solve for that. And so, that's where you begin to realize that, all right, well, as your organization and your software starts to get more and more complex, you're going to probably need a more advanced solution for it.
Chiara Colombi (47:36):
All right. A question about consistency. Does consistency work across database types? And I kind of breezed over that, but I'll just confirm. Yes, it does. So, if you've got data that appears in your Mongo database and it's also in your Postgres, consistency can be set up to work across to make sure that the way the data is de-identified in Postgres, is it's doing the same thing, getting the same output in whatever other database you're using. Anything you'd add to that?
Omed Habib (48:05):
Can I just admit that I actually didn't know that and that's pretty impressive. I probably obviously should have known this, but I am admitting that that is actually pretty impressive.
Chiara Colombi (48:13):
Yeah. It's again, one of those like that's why like consistency, it's just this little toggle. No, it's not just this little toggle. It's doing so much. Yeah. It doesn't get kind of the bells and whistles look to the feature because all you need to do is toggle it on. All right.
Omed Habib (48:30):
Great question.
Chiara Colombi (48:31):
Let's see. Oh, what is a generator in the context of Tonic? How do we define because we talk about categorical generator, continuous generator? Omed, do you want to take that? Do you want me to take that?
Omed Habib (48:45):
I could take that, but if I do, I actually want to present something if I have time. So, why don't you take it and then I'll bring up a slide that can visually show you what a generator is.
Chiara Colombi (48:58):
Sure. So, in the context of Tonic, a generator is basically an algorithm that has been developed to handle a specific type of data. So, it can be something as simple as like a first time generator where we're pulling from a library of first names or our continuous generator which looks at a distribution of numbers and recreates the distribution or the categorical which recreates a distribution in terms of the ratios of the categories. But it's basically a customized algorithm for a specific type of data. And you can kind of use the generators are used on a column by column basis and you can mix and match, use all the generators that you need for the different data types that you have on a column by column basis. They're kind of the building blocks of how you create the model that Tonic uses to generate your data.
Omed Habib (49:52):
Yeah, that's exactly correct. And that's a visual in case you're interested. You can take a screen if you'd like, but also you can browse through our docs, docs.tonic.ai. There's a section called Generators. Chiara said it very well, it's the building blocks of your synthetic data sets. So, you pass any of your columns in production through Tonic using any combination of these building blocks to solve for hypothetically an infinite number of scenarios to create your synthesized data set. Great question.
Chiara Colombi (50:27):
Yeah. Kind of a follow up question to that is again, Tonic specific. Would you mind providing a high level walkthrough of the process of using Tonic to generate data?
Omed Habib (50:39):
A high level walkthrough of using Tonic. Okay. I'm going to try to do this in 30 seconds.
Chiara Colombi (50:44):
Yeah.
Omed Habib (50:45):
You have production data and you pass that data through Tonic. The data doesn't get stored in Tonic. You keep like a 100 records in memory, but using Tonic and using any combination of these generators, you flag which column is sensitive. So like date of birth, you pass it through the date of birth generator. And Tonic, then across all of your columns and tables, you also connect an output database. All you need is the connection string, that's it, input source and output, okay? And Tonic will then connect to your output database and automatically create the schema, create the tables and create the records for a synthetic data set that looks, acts and feels like production because it was actually created from production. You can create macro sets.
Omed Habib (51:35):
So, you can go from like 1 million records in production to simulate a 20 million record environment in case you want to test against Black Friday scenarios, but you can also do a micro version using our sub setting feature, which is like, Hey, I just need a smaller version of this or I do need actual production data to do diagnose this bug. The user ID is like 145. Just give me a schema that represents user ID 145. And so, you can give that to a software developer to go diagnosis not having to give them all a production for example. That's in a nutshell, very high level. I hope that captured it.
Chiara Colombi (52:13):
Yeah. I like to just think source, build a model on Tonic, output. That's just how it all works. Yeah. All right. I don't see any other questions. Any other questions feel free to put them in the chat or the Q&A.
Omed Habib (52:30):
Cool. Hope everyone enjoyed it. We'd love to hear from you, feel free to reach out. If you want to register for a sandbox, we are offering that. I highly recommend you take advantage of it. You can go to the website and request an onboarding. You get a full two weeks, you get onboarded for an hour, your own login and password with all the features enabled. If you do want to get a demo, you can also request a demo. Again, also schedule live right there on the website, or if you just want to reach out to us, hello at tonic.ai.
Chiara Colombi (53:03):
Yeah. And we also have a bunch of events that are working their way onto the calendar for the next year. So, keep your eyes on the webinars page for more to come. Some exciting stuff. And that should be, oh, go ahead.
Omed Habib (53:16):
Yeah. And since we're on this topic, sorry, keep an eye out for our next ebook. It'll be coming out probably Q1 of next year.
Chiara Colombi (53:23):
Yeah.
Omed Habib (53:24):
Test Data Patterns Best practices.
Chiara Colombi (53:28):
Yeah. Awesome. Great. Well, thank you to everyone [crosstalk 00:53:32] for joining us. Thank you, Omed. This is a lot of fun kind of tag teaming the problem and solution. Yeah. This has been great. Thanks to everyone and we look forward to the next event. Thanks. Thank you.