With big data getting bigger by the minute, subsetting is fast becoming a crucial step in representative data synthesis. But it isn’t easy.
Many synthetic data tools don’t even offer subsetting capabilities.
Preserving referential integrity across hundreds (or thousands) of tables is a challenge. Today, we’re taking that challenge one step further.
Staff Software Engineer Johnny Goodnow will shine a light on multi-database subsetting and how this impactful process fuels targeted test data generation while strengthening data security.
In this webinar, you’ll learn:
Chiara Colombi (00:07):
Hi, everyone. Welcome. Thank you for joining us today for our session on Multi-Database Subsetting. My name is Chiara. I'm on the Marketing and Developer Relations Team here at Tonic. And I'm excited to introduce you to our speaker, Johnny Goodnow. Johnny is a Staff Software Engineer here at Tonic, and he's been largely responsible for the development of our subsetter. He has a fantastic presentation plan for you and we're excited to get into that. We'll also with plenty of time for questions at the end of Johnny's presentation. So feel free to drop those in the Q and A, and I will cover them at the end. With that said, I will hop off now and pass the mic over to Johnny.
Johnny Goodnow (00:45):
Awesome. Thanks Chiara. Hello, I'm Johnny. I'm an Engineer at Tonic and the goal today is for me to talk about subsetting. And so really what this is, is it's shrinking your database down in size and a lot of the utility that comes from that. So we're going to talk a little bit about how it's useful, and how to do it in a performant manner, and Tonic's unique approach to handling this problem. So we're going to split this into a couple segments. First, we are going to talk about what is subsetting, why does everyone need it, why is it useful in all kinds of domains. We're going to talk about, in addition, the challenges of subsetting, why this isn't just an easy tool to find off the shelf, including when we originally started developing it. And then of course, Tonic's a unique approach to subsetting and some of the things that we've learned along the way.
Johnny Goodnow (01:43):
So let's start off with what is subsetting, right? So subsetting, very simply, is it's intelligently shrinking your data. Or to have a stronger definition here, to take a representative sample of your data. You want to do this in a way that preserves the integrity of the database. You want it where related data still maintains those relationships, especially if you're typically using a more relational style database, SQL, et cetera. And being able to ask yourself a question of, give me 5% of my users and not just 5% of users from the user's table, but all of the things that are related to those users. So that way you can have the full story but at a much smaller scale compared to a production dataset.
Johnny Goodnow (02:36):
And so here we have a really basic schema. This could be for an e-commerce company, if you have products, purchases relate to products, products can have reviews for them. And then there could also be favorites that are tied into specific reviews or products, et cetera. And so at a very simple level, we decide, "Okay, the core of our database is products in this particular case." And so we want to grab some data from the product's table. And then we want to also grab related data from these other tables based off of that initial, that core of the database. In this case, let's say, 10% of the products that my company sells and I want all of the other things related to it. And this includes both having data that is necessary as well as data that is optional as far as how databases are typically designed.
Johnny Goodnow (03:37):
And so, why is this thing useful? First off, production data tends to grow in size and complexity over time. It's just one of those things that won't be long before it's far outside of the realm of just one person's head. Likewise, testing and analyzing that data can be a bit of a mess, right? You can run into problems with security, you can run into problems with like causing downtime for other users to reuses utilizing this data, you can have access concerns. You can also have regulatory requirements, especially if your industry has very specifics around who can access production data, particularly if it has information about individual users and making sure you respect their privacy.
Johnny Goodnow (04:30):
And finally, sometimes modifying the data is very risky and sometimes impossible just because of many of the other concerns are mentioned, but also just because you can't necessarily disrupt people who are utilizing the core production dataset. A common solution that you might think would work for that is, well, what if we just make a full copy of this thing and we throw some masking on top of it. It actually turns out it takes a lot of money and time to continuously do this. And especially if you were refreshing this data often, or if you end up having it where different teams end up having their own copies and they need specific details about that data. It just becomes very expensive and very time consuming, especially if your dataset is already really large.
Johnny Goodnow (05:21):
Likewise, continuing down the road of being cheaper, it's also just cheaper to store data. If you're storing it in the cloud or even on your own hardware, it's lamp less bandwidth necessary to retrieve and share that data with others. And it's less time spent just packaging and building that data for different environments, especially if you want to have different complete versions of your dataset for a test environment versus a staging environment versus an analytics environment. It's simply can cost a lot of time just to be able to package all these things, to have automation around building these different datasets based off of your original production dataset.
Johnny Goodnow (06:07):
And then finally, it's just a lot easier to use a smaller dataset. It's more easily easy to share it. You can load it up on your own laptop and not need to utilize servers. You can have it where different teams and services can have different subsets of data. And this becomes really important when you're thinking about things like microservice architectures or teams that only really care about a certain slice of data and only certain tables or objects within the schema. And finally, it's also easier to tune it such that you can curate a more specific data environment for reproducing issues, testing, analysis, et cetera.
Johnny Goodnow (06:51):
And many different contributors on the team are going to be able to utilize this like developers can more easily build in test bugs and features with a smaller dataset. QA can more easily reproduce bugs with specific conditions. Data Science can know that they're working with a representative dataset and still be able to have faster queries or just be able to do some spitball analysis a little easier. And it's simply better for the majority of teams, in general, to have something that is just going to be a lot less unwieldily compared to a large production dataset.
Johnny Goodnow (07:34):
And so we talked a little bit about what it is, when we need it. I want to go a little bit into the challenges of subsetting. Why isn't this something that you can do easily? Well, first off, it's hard. To simply take 5% of a single table, let's say, a users table, it's much harder to get all of the related data. Even if you get all of the related data from a referential integrity sampling, you're making sure that if you have related data that all the necessary relationships are being made, that doesn't necessarily mean that the subset is going to be useful. This is because much of the valuable data is optional from a referential integrity standpoint. We'll get into some specific examples showing how optional data really comes into play.
Johnny Goodnow (08:29):
And likewise, what a good subset is can be hard to define. Different organizations and domains have very different needs, including what data is included, what data systems they're using to store that data, and ultimately how they're related. Sometimes that can be represented in the data system itself, just foreign keys or SQL database. Other times, that can be just in the app logic itself, where individuals at an organization decided how they're going to link related data, especially across the different data systems. Most organizations usually just try taking a simple cut. And if you don't actually try to make it useful, people just don't use it. They would rather move back towards production or just feel a lot of frustration and just decreased productivity. Really trying to utilize data that doesn't have high quality to it.
Johnny Goodnow (09:29):
So if we look back at our original example, we have reviews, favorites, purchases, et cetera. The thing is these tables could often be in many different kind of domains. There could be other related tables. It's never just four tables. It could be on the order of hundreds of tables. On top of that, they could be stored in different databases but they could still have identifiers that link between them. It's not just ever three databases, they can be many databases. It really depends on the scale that you're looking at and how your organization is con…
Johnny Goodnow (10:13):
... well, really leading to this complication of how do we actually traverse all of these relationships and make sure that we grab all of the data necessary to build this statistically representative subset.
Johnny Goodnow (10:29):
Now, we can also take a look at more of the micro-level for this particular problem. Much of the issue with dealing with multiple databases is often just how do we query that data? Just tell us which servers to point out and we'll grab that data as long as the identifier is actually mixed and matched across different systems. But there are also other problems that exist within just a set of tables. So for example, let's say we grabbed some data down in this particular table, in the middle. And we start looking at some of its neighbors. Okay, let's start grabbing related data into it. Who grabs the table next to it? Okay. And then we grab some more data at another table that relates to that one. And if you look at the arrows in the center, you may already notice that what we actually have is like a cycle, right?
Johnny Goodnow (11:28):
And cycles are very common in data schemas that are out in the wild. Inherently, the problem becomes how much data should I continue to collect if I start at a certain point and if you don't actually work with that cycle, you could end up just copying the full table for all three of these tables. And often the cycle is much larger than three tables. It can be across 30 tables where the cycle exists. In that case, you don't actually end up with the subset. You just end up with a copy of your original data and we haven't solved anything.
Johnny Goodnow (12:06):
And so all of these problems really come into play when trying to even build an algorithm to intelligently do this, to take in mind, different systems. A lot of that is kind of at the core of what Tonic really does. So I want to talk a little bit about Tonic's approach to Subsetting. Also, just wanted to check if anyone had any questions up to this point? Otherwise, I will keep going. Let's take a look.
Johnny Goodnow (12:43):
Okay, cool. So I will continue going. So the way that Tonic handles, all of these problems comes across into a variety of different contexts. So the first thing we do is we try to keep the configuration simple. So we have it where a user specifies a specific target that could be a users table or a products table, something that they feel is the core of their database where most of the relationships end up tying to. And then some condition, say 5% of the users, but they're also all in the state of California.
Johnny Goodnow (13:22):
One thing that we noticed early on is that a lot of open... Or actually, I should say the few open source tools that existed for this were often very difficult to configure in order to get a subset that you actually wanted to use. We want to really abstract away a lot of the challenges that exist in producing a good subset. From there, we iteratively explore all the connected objects and we make sure to grab optional data where it's useful, and then necessary data to ensure referential integrity. And likewise, we handle a lot of complex data schemas including ones that have referential cycles, but also more complicated relationships between composite keys, handling cases where tables or objects could have multiple different unique identifiers to them which different tables will point to. And many others that we've come across along the way working with many of our customers.
Johnny Goodnow (14:26):
And so notably, the algorithm works at a very high level that we sort the tables according to their dependencies. This is done doing a topological sort, which is often used for ensuring that tasks, for example, are done in a dependency order. You never work on something before all the things you had to get done before it had been finished already. We likewise do this for the tables and their relationships themselves. We then subset the target table or objects. We'll talked a little bit about NoSQL at the end. And then we add optional data, which we consider upstream of the target, and we'll show some pictures to explain that. And then finally, we make a downstream pass which takes all the required data that is necessary in order to have referential integrity. And so here's keeping in our theme of e-commerce. What now is a topological sort.
Johnny Goodnow (15:27):
You have all of the dependency, arrows are now going from left to right. We've decided at this point that, "Okay, SKUs are going to be the core of the database. This is going to be our direct target. And let's say, we decide to take 10% of them." Or something like that. From there, what we want to do is we want to go upstream. What this means is we're going to grab auxiliary information, optional data on a specific row. So looking back at our picture, we have products in the center. Reviews points to products, which means that a review can exist for a product, but a product doesn't necessarily need to have any reviews. Makes sense in the context of reviews for an E-commerce product. And it's auxiliary because it's not required. Technically, a valid subset would be that we could just grab no reviews for any product in the database and that would still work.
Johnny Goodnow (16:32):
You wouldn't have any complaint at the end that any foreign keys were violated or any app logic downstream. But it's not necessarily as useful. Maybe you're interested in reviews, maybe you want to do analysis on those reviews based on those products, and what categories those products are in. So we find it essential to actually go through and pull optional data off of them. And we do this by following the relationships between these objects so for every product we grab the reviews associated with it. But we also have options to filter this data. Because it's optional, you can say, "Hey, only grabbed reviews that have been published in the last 30 days." So that way, we're not grabbing reviews from all time. And there's a variety of ways to do this either by grouping or just having conditions like this where we look at the published date of the review.
Johnny Goodnow (17:30):
This process generally continues upstream. So we look at products, purchases is also directly upstream of SKUs so we grabbed some data from that. This continues all the way up until we've basically reached what I like to call the top of the table graph. These are going to be tables that have nothing that referenced them and are safe to essentially start the downstream paths. It's okay if these tables have no data in them, but when we go from left to right, we have to make sure every subsequent table has data in it.
Johnny Goodnow (18:14):
This is the downstream aspect of it. So what we do is we want to gather necessary information to support a row. Looking back at this, products must have an SKU associated with it. And this is kind of the arrow pointing from left to right from products to SKUs. Without an SKU, you would have it where particularly if this relationship we had a NOT NULL constraint, it wouldn't actually be valid. You need to make sure that SKUs are imported for every product that you import.
Johnny Goodnow (18:47):
And so this is one of the reasons why we go upstream first to grab any optional data, but then we go downstream after that. We basically fill in the gaps to make sure that all data grabbed previously that references something that isn't already in the database want to go ahead and grab along the way.
Johnny Goodnow (19:08):
Once all these requirements have actually been met, the subset is complete and this is referential integrity. So we start over and over it like favorites in this other unnamed table. And you can see when we go to reviews, you have it in a different color. And so what this could mean is that at some point, we may have grabbed rows in the favorites table, but were already in the reviews table.
Johnny Goodnow (19:36):
This is all in the perspective of the destination, the subset that we're creating. So what we do is we check the source. Oh, were there reviews that were not already important? Let's go ahead and import them now in order to ensure that we're actually maintaining this referential integrity. This completes downward and so you could have some cases where, for example, the purchases table may have all of its records are already present in the table and you don't need to grab additional data to maintain these referential checks. But the products table in this case, actually, we didn't have all the data so we grabbed a little bit more data from the source to fill up those gaps.
Johnny Goodnow (20:21):
And we continued this down all the way to the targets, but actually all the way through the rest of the connected tables in the database. This is to ensure that, again, at every single relationship, we go ahead and manage data that needs to be imported and no foreign key dependencies are broken at the end.
Johnny Goodnow (20:44):
Some of our differentiators to making this work is that we're able to create a single subset from multiple different data stores. You can have your database be highly sharded. You can have it on different hosts, et cetera. As long as the identifiers are generally assignable between each other, we can go ahead and subset off of those. We also allow users to have multiple targets in their database. It doesn't necessarily just have to be SKUs or products. We could have 10% of SKUs and we also want reviews from the United States, just because both of these use cases are useful for us. And it handles the logic in order to, again, both grab optional data upstream and necessary data to fill everything out downstream.
Johnny Goodnow (21:38):
And we also provide a bunch of other advanced features and clean reference tables. We have the ability to filter data so that way you can control how large the subset ultimately ends up being. And we also have the ability to add additional data relationships. Not everyone has their relationships in support and foreign keys in the database. And also, you could have it where two tables have it where the relationship between them is in app logic. And so we have the ability to let Tonic know that a relationship exists between two tables and how to go ahead and figure it out that way.
Johnny Goodnow (22:21):
Ultimately, I think, probably the biggest thing that Tonic provides is that our entire privacy preserving feature set is available for the subsets. That means that not only do you have the benefits of shrinking this thing down, but you also have all the benefits of removing any private information that exists. So you can mask identifiers, you can mass email addresses, names, et cetera, all while having a smaller subset. It's very useful for creating a good test and staging environment where you don't want users to be able to know what records they referenced in the original production dataset for user privacy, purchase privacy, et cetera.
Johnny Goodnow (23:07):
Just to give a specific example of this, we actually do a lot of subsetting over at our customer, eBay. Notably, eBay has eight petabytes of data just across a vast number of data sources. It's really important for them to be able to have high quality staging data. And it's something that they've struggled with for a really long time across all of their dev... And QA teams. And so we essentially use Tonic there to be able to aggregate, create subsets, and then dynamically run these off of a continuous integration systems like kick off jobs regularly when they want to import more data into staging.
Johnny Goodnow (23:49):
It saves us significant amount of time and also generally from all of the domain teams have worked there has made their lives a lot better when it comes to testing and really not needing to worry about a lot of the security that they have for production. They're just free to use the staging data as is, and they don't have to worry about having anything sensitive at it. This includes their entire 4,000 plus developers worldwide. We're continuing to build this out at eBay up to this day.
Johnny Goodnow (24:23):
You can also see various bits. I would say that subsetting is probably one of Tonic's big core aspects. Recently, we were in FastCompany for its listed most innovative small companies. A huge part of this is our ability to subset in addition to some of our data synthesis work as well, in case he wants to grow the database. And I would say, a key differentiator compared to a lot of products that we've seen. Notably, one of the things that caused us to build this in the first place was our need to actually work with subsets. And addition to the fact that we had a really hard time finding other tools that actually did this in a way that was useful and easy for folks. Yeah.
Johnny Goodnow (25:09):
So, what's next? We're definitely looking to make the future better for users. We want it to be faster, we want it to be better suited for our customers' use cases. Just like all datasets that can be highly contextual based on whatever domain that data is in. And so really making sure that we can support a wide variety of interesting schemas or designs and have a very flexible tool while also maintaining the amount of configuration necessary to get to work. We want this to also be pretty easy to use. We're also looking to continue to expand our technology to even more of our supported data stores. This definitely gets more interesting in the NoSQL world, but we've already done a ton of work in SQL already at this point.
Johnny Goodnow (25:58):
I'm happy to talk about specific databases and yeah. That's pretty much it. If you have any interest in booking a demo, you want to look into this more, or hear more context about how this can work for your use case. You can visit our website, there's a large "Book a Demo Now" button, and then you can also email us as well. And yeah, that's pretty much it.
Johnny Goodnow (26:26):
Oh, one more thing. I want to just show just a very brief demo of like what this looks like in Tonic. So just like for context, this is a very simple database that exists. And really all the subsetting comes down to is what we've explained already where you say, "Hey, I would like 1% of the products tables." These are going to be the tables that are referenced ultimately. After running a data generation, what we can do is this. You can see, we have some the upstream and downstream tables actually going on here. You'll find that the output dataset is a lot smaller. And so, for example, if I'm looking at... Let's see. Products, for example, is a good example, where originally we had many, many, much more products that could exist here.
Johnny Goodnow (27:27):
And in fact, we have it in the original dataset, you can see that the original count of products was at 2,500, as well as the wholesale, which is a related table. Two products had 7,000 rows. And if we are going to query like the output, as a simple context there. Let me see. Actually run this safe one again. It went down to 29 and 56, respectively. Tonic is really good at actually letting people know if something went wrong or if the referential integrity wasn't able to be maintained for one reason or another. But in this case, worked out just fine. So yeah, happy to show more of this in future demos, et cetera. Feel free to sign up for one.
Chiara Colombi (28:24):
Johnny, could you provide a little bit of context, quickly for people who might not be so familiar with how Tonic is set up? You were showing us Tonic and then you jumped into your database. Can you just explain where Tonic sits in the process?
Johnny Goodnow (28:36):
Sure. Absolutely. So the way that Tonic works is that you essentially provide a source database, which is often going to be the production database. In this case, I had this subset test set up and then you provided an output database where it's going to ultimately send the subset over to. The database view here is, on the left, we have all of the tables that are present as well as all of the columns. This can make it very easy for configuring, if I want to protect certain types of information. You can actually see here, for example, I've actually set up a bunch of generators. So you have customers' first name, last name, gender. These are all masked by default. Likewise, for products like the product name has been changed. And what I went to was just looking at the view for it.
Johnny Goodnow (29:31):
And so this is what the original data looks like for products. So you can see there's Nike, Adidas, et cetera. But after the masking is performed, these are all ended being scrambled. So the table view allows you to basically have a bit of a snippet of what the table looks like. So the second tab that I actually have showcases the output of that database. This has no generators on it right now. You can see, this is what the output of all the masked product names were originally from here. So even though they had all these names, Nike, Adidas, et cetera, and the subset that we came out with, it's also masked as well and a low number of rows so you get the best of both worlds.
Chiara Colombi (30:18):
That's great. Thank you. That was a fantastic presentation. We do have a bunch of questions. So I want to jump over to those. The first question was, just a simple one, to talk a little bit about the databases that we support.
Johnny Goodnow (30:31):
Yeah, certainly. Out of the box, we support most SQL based databases. This includes post-stress MySQL, a SQL server, or a call. I believe we supported on Db2 now. I might be missing another. I think we'll have Redshifts very shortly. And this is an ongoing thing. We're continuously building this out, and we're also looking into supporting this with Mongo in the future as well.
Chiara Colombi (31:05):
Yep, great. Then another question, how does our commercial subsetter differ from our open source subsetter? For those who might not know about it, we have an open source subsetter called Condenser, if you could share about that.
Johnny Goodnow (31:19):
Yeah, certainly. So the main differences between it is one, Condenser doesn't handle the scale that the Tonic does. It's useful for when you have a smaller database on the order of a couple of gigabytes, but it gets a little one unwieldy in how long it can take to actually generate the subset, if you have a lot of data. The second thing is that doesn't include any masking. So the ability to get rid of all this private information, the open source tool just shrinks down the database. So you have to handle that yourself as well. In addition, it also doesn't reflect some of the latest innovations that Tonic, with regard to the algorithms, that it uses. And notably, this means that in certain cases, it doesn't actually grab all the data that you might be expecting, especially with regard to optional data.
Chiara Colombi (32:16):
Yeah. That's probably a good segue into this other question. Someone says: I saw the announcement in your product news email yesterday about how your subsetter now uses a Full Algorithm approach. Could you explain what that means?
Johnny Goodnow (32:31):
Yeah, sure. So this is actually one of the differences between Condenser and the subsetter in Tonic, is that the original algorithm really try to optimize on minimizing the amount of data grabbed, especially the optional data, because we didn't want to grow the subset too much. But some of the characteristics of this algorithm meant that certain tables might end up being empty because it's optional data and it's still valid.
Johnny Goodnow (33:02):
And so after working with a lot of our customers and really seeing some of these cases come up over and over again, we developed a new algorithm. It's called the Full Algorithm, which essentially means that all connected to tables and for all the dependencies and whatnot, all of those tables will actually end up getting subset. Some of them may pull very little data, but what we found is that it has a lot more utility to make sure that there's a little bit of data everywhere that's actually necessary for the use case. It doesn't necessarily grab data from every table in the database, but it does make sure that pretty much most cases that we've ran into where people expect to see data, like they're actually ends up being data in the subset.
Chiara Colombi (33:57):
As a follow-up to that, I think, it'd be good to highlight the way that Tonic allows you to, if you want to make sure that there is a table not in the subset, could you talk about truncating tables? So people know that you can still customize your subset to exclude things.
Johnny Goodnow (34:14):
Yeah, certainly. So even in the product, there's a couple interesting things. So if we're looking at, for example, the database view here. By default, the tape mode is masked, which means that it's going to use our generators to be able to remove a lot of the private information that exists. But you also have the ability to set these two, for example, truncated tables. What that means is that the table schema will be copied over, but none of the data will be present. So for this customer's legacy table, that might be a good reason to go ahead and truncate it. Likewise, we also have some other options available. We have, for example, this include tables out of subset. And what this means is that if you decide to subset the products table, the customer's legacy table doesn't actually have a relationship between these two.
Johnny Goodnow (35:13):
And so by default, that table is going to be truncated. But let's say for whatever reason, you wanted that to just be copied over directly and you wanted to still include that data in the subset, you have the ability to just check this button here. Likewise, if you change this target to customer legacy, et cetera, this preview is updated to give you an idea of what tables are connected, and we'll actually end up in the subset in which ones will just end up getting truncated as well. So there's a lot of options in order to ensure that you're getting the subset that you want, especially if you include multiple targets.
Chiara Colombi (35:55):
Great. Thank you. I've got one more question here. Someone asks, I would imagine foreign keys play an important role in subsetting using your approach, what if I don't have foreign key relationships set up in my database?
Johnny Goodnow (36:08):
Yeah, this is a really good question. So by default, Tonic allows any user to augment the relationships that exist in the database. It will both use the relationships in the database as well as anything else you tell to do it. And so, for example, what we have here actually is like a foreign key tool. This can actually be done manually by hand, but we find it's a lot easier to use the tool in the product. But the idea is that you can add additional relationships. So for example, you could find that there is a birthday column and so this exists in three tables, the customer's legacy, customer's marketing, et cetera. You could decide that these are all foreign key to the employee birthday or something like that.
Johnny Goodnow (37:01):
And that would essentially allow you to add those relationships to the database as well. We've had some customers where they've had to add a ton of them very quickly. And we find that matching by name, honestly, just saves a ton of time. It's usually an easy way to go about it. And one of the benefits is that Tonic has a system for handling cycles. So even if you introduce the cycle this way, we still make sure that the subset is properly processed.
Chiara Colombi (37:35):
Okay. That's all the open questions I have, currently. Does anyone else have any questions, comments, anything at all?
Chiara Colombi (37:46):
Okay. If not, I guess we'll wrap things up for today. Like we said, feel free to reach out to us if you have more questions, hello@tonic.ai. Hop on our website, you can book a demo. We've got those Amazon gift cards ready to go and thank you for joining us.
Johnny Goodnow (38:06):
Yeah. Thanks so much, everyone.
Chiara Colombi (38:07):
Thank you, Johnny.
Johnny Goodnow (38:07):
Yeah, thank you.