Join us for a live demo/Q&A of Tonic's latest feature releases.
Tonic has expanded its list of integrations to include DB2, BigQuery, Databricks, and Spark on Amazon EMR, further enabling you to operationalize our technology in a multi-database environment. Not only does Tonic integrate with roughly a dozen databases now, it allows you to work across them, regardless of the type, to best reflect real-world data landscapes and maintain consistency between input and output when needed.
The database-agnostic nature of Tonic and the ability to connect directly to your database, as opposed to a data-upload approach, are two of our key differentiators as compared to other tools currently available.
Our subsetter now features a Full Algorithm approach to better capture the nuances of your subset targets across your database. It’s worth noting that our subsetter is fully integrated within our data generation platform; it isn’t a separate product or add-on. We’ve purposefully designed subsetting and data de-identification to work side-by-side in Tonic so that they take place in the same job. Why?
Chiara Colombi (00:04):
Hello everyone. Thank you for joining us for our spring product launch. I'm Chiara, I'm on the marketing team, and with me are Omed, also on marketing, and Kasey from product management. Today, we're going to be demoing some of the latest features from Tonic. We've got some new capabilities around subsetting, we've got some enhanced workflow features. Yeah, a lot of great stuff we're excited to show you. And we really want today's session to be more of a conversation, so interrupt us with your questions. You can put them in the Q&A, you can raise your hand, we'll unmute. We just want to kind of chat about what's going on at Tonic and share it all with you. So with that, I'm going to pass the mic over to Kasey.
Kasey Alderete (00:38):
Sure. So just to introduce myself, I'm Kasey Alderete, I'm the product manager at Tonic, and I'm always interested in also what's coming next. So I'm super excited about what we have, but also curious to hear your questions, like where we're pushing the product, anything else that you'd like to see the next time around that we've come out with our launch. So today, the features that we are going to highlight are right here for you. So are our two biggest areas of headline features that we're excited about are our expanded database support, which I'll get into later, as well as how we're continuing to advance our subsetting capability, which really sets us apart from a lot of other tools on the market. And then as we have time, we'll go through some of the other generators and privacy and utility features that we've added to make your data more realistic.
Kasey Alderete (01:36):
So to start, wanted to highlight where we've added additional database support. I probably can't do this from memory, but the relational databases that we support now include Postgres, MySQL, SQL Server, Oracle, DB2. We've expanded beyond relational databases to data warehouse use cases. So we do Spark on Amazon EMR, that's for S3 and Hive. We also support Databricks and BigQuery. On the relational side, that means that we support, out of the box, over 80% of the relational database market. So these databases that we support are most likely going to fit your use case and your environment. I will add as a sneak peek preview that we are adding database support for documents. So in MongoDB, as well as some expanded data warehouse use cases that are going to be really exciting, that we'll talk about in a separate launch.
Omed Habib (02:37):
And it also might be worth noting that we do get requests all the time for new and other unique side case databases that you might be using internally, so if you have a request for database, you can reach out to us, primarily Kasey, you could also chat in the Zoom chat panel if you'd like. Regardless, we do take requests for new databases all the time.
Kasey Alderete (03:04):
Yeah. And part of the advantage of this, is again, just mimicking your actual environment. So Tonic natively connects to your databases, which is a real benefit because we become just part of your CI/CD pipeline. We're just a part of your overall development process, that those going from production to staging, it's automatically de-identified as a part of your normal course of refreshing your data. It's already available to your developers who have access to staging. It's a single pane of glass, no matter your database, that you're de-identifying. So if you have Postgres over here, you have many Postgres instances, you also have some Oracle, you can do all of that from one pane of glass within Tonic. And not only do we de-identify to keep that whole database intact, so that web of relationships between your tables, your columns, but we can also keep that consistency across databases, and aim of databases of different types, so we can have consistency where one input always equals one output, like a customer ID. So of course you want that within the same database, so that the lookup tables stay intact, but we can also do that across the databases.
Chiara Colombi (04:24):
Yeah, you just answered a question that came through, which was "What if I have different types of databases? Can I work with Tonic across those different types, keeping the data consistent," so that's great. Thanks.
Kasey Alderete (04:34):
Yeah. And so we offer... Tonic operates in place, so if we're in the Postgres environment, we're going to be de-identifying from Postgres to Postgres, but then we can use these seeded values to keep consistency as you move up more across your architecture.
Omed Habib (04:52):
Yeah. It's also cloud agnostic. So if you're deploying across different clouds, Azure, GCP, AWS, or even your own private cloud, all you really need is just the server database, port, username, password, and you can connect to your databases.
Kasey Alderete (05:14):
Okay. So I'm going to move on to our subsetting. One of the biggest upgrades that we've made to our subsetting is to the algorithm that's used under the hood, so there's not a whole lot to show you here. I will get into showing some of the features, but what's really cool is that the algorithm that we're using now better matches the use cases that we're seeing for subsetting. So if you take a step back and think about what is subsetting good for, it's for reproducing a use case and end to end scenario, but in a local environment. And so we found that, while subsetting is all about getting a smaller version of your dataset, we still want a very broad view, a very broad slice, so we want it to be wide across all tables, but we don't want it necessarily very deep. So what our full algorithm does, is it will actually traverse all of the relationships, no matter whether it's an upstream, what we call an upstream or downstream relationship.
Kasey Alderete (06:13):
So you take an example like customers. Customers maybe orders points to customers, and customers points to addresses. So in the classic algorithm that we used before, we would get pieces of all these databases. But you might also have orders points to products, and products has an input table of suppliers. And you might end up with no data in the suppliers database, because this is too far from your target. What we're now doing is in any kind of relationship, we will traverse and grab the relevant piece of data, and that helps your legacy application actually work or function properly. Another improvement that we've made in subsetting is allowing for reference tables. So you can imagine a table of states. And if we grab a portion of our customer database, and we only grab the related part of states, we might only get 30 states instead of 50. And your application just might not make sense. There might be things that aren't working well if you don't have that full list populated. So we also have added the ability to pass through reference tables in full, as a part of creating a meaningful subset.
Kasey Alderete (07:39):
One thing I wanted to point out that I think will kind of help us transition, is the foreign key tools. So I'm going to actually... Oops. Yeah, I'm going to skip ahead here to the foreign key tool, because I think this follows on from our subsetting capabilities really well. And what this does, is it allows you to declare foreign key relationships in your database that don't exist in a schema. So why does this happen? Sometimes application logic will be using data in a particular way that is an implicit foreign key, but your database is not being used to enforce that. Sometimes it's just the nature of working in a distributed environment. You have a DBA team that's different from your application developers, and they may not always stay in sync about what is the application doing, and how do we make this data meaningful? So our foreign key tool-
Omed Habib (08:37):
I think there was an example of a customer that you spoke to recently, right? Where they actually said that, "Yeah, we actually don't want to enforce the foreign key on some tables." And I think the reason why, was because then you would have to... If you enforce the foreign keys, then your queries are going to also have to follow suit, whereas you have the freedom to not have to write queries that are enforcing the foreign keys. So you have a little bit of freedom. The downside here then, is that if you don't have the foreign key stated in your database, then Tonic wouldn't know, or before it wouldn't know, which tables are related. Now it does, and I'll let Kasey show you what it looks.
Kasey Alderete (09:18):
Yeah. And I think because there's such a de-centralization of power, away from maybe centralized DBA functions, developers are empowered in the applications to do what they want with the data, and they're kind of used to that freedom. And now with Tonic, we're giving you the ability without being a DBA and having DBA access, you're not changing anything in production, you can create these foreign key relationships. So I'm going to-
Omed Habib (09:43):
Not to go on a trend here, but it's... Sorry to cut you off, it's kind of a cultural thing too. You always have developers trying to do one thing, and then you have the operators trying to do something else, right? There's this change control. This control to the chaos and the chaos to the control always happening. Anyway, just thinking about how developers always want freedom. They always want access to X, Y, and Z. Obviously they can't have access to production, and Tonic enables you to be able to provide that access. A good example here is, if I have a bug and I want to diagnose the bug, the subsetting capabilities give me the ability to be able to replicate my database environment with just a subset of the data that I need to be able to diagnose that bug without full access to my millions of production records, which obviously is probably illegal in some cases, and unrealistic, in others. Anyway, I digress, but there's a common theme here between developers using Tonic and getting value from it, but also operators using Tonic and getting value from it.
Kasey Alderete (10:52):
Well, I think it just speaks to the messiness of schemas, and how a lot of times, people don't even know their schemes, and we get comments all the time that when they're using Tonic, they actually realize what the web of relationships was. And that was part of the reason for that subsetting update, was that people didn't necessarily know how these tables were related without Tonic. And so Tonic kind of exposed that they didn't know the... what... how to expect the traversal of keys would work. So we're giving them that visibility and that access for those use cases.
Kasey Alderete (11:24):
So the way that we to... earlier populated foreign keys was purely based on your schema. So you can see these are locked because they're being inferred from the schema. These are foreign key, primary key relationships that exist in the database. So now, I can point and click, become a DBA by adding a key here. So let's say I want the customer key and the marketing key, let's say. Actually, let's do the legacy and the marketing, and I'm going to relate them to the customer key. So I'm actually going to create two foreign keys right here. And if I go back over here, you can see these two new relationships that I've added. And there's two primary pieces of functionality that this drives in Tonic. So one is that, now that we have this customer key and marketing key linked up, we want to make sure that as we de-identify them and apply generators to protect the privacy of these keys, maybe that key is a social security number, that we still... We do that consistently, so that they're still linked, and those relationships can be traversed.
Kasey Alderete (12:34):
So when we apply generators... So if I go to this customer key, let's say I just apply an integer key here. When I go now to the marketing table, you can see this customer key column is grayed out because there's a foreign key here now. So I'm not going to accidentally de-identify these in different ways that no longer preserve that referential integrity. So this is one thing that Tonic's doing under the hood. The other thing is, as I mentioned with subsetting, is that when I go get the customer table, you can see that one of the tables in that customer as that subset has coming out, is there's a relationship with marketing. And part of that's because of that key that I've identified that keeps that relationship. So it knows to go grab the relevant data from the marketing table per the customers that we've selected in the subset.
Omed Habib (13:34):
Hey, Kasey, we just had a question that just came through, someone's asking, "What's a generator?"
Kasey Alderete (13:37):
That's a good question. So a little bit about Tonic just to go back, the simplest way to understand generators is that that's the algorithm that you're using to de-identify your data. So in this example, we have last name right here. I have a last name generator turned on. What this will do is, actually look... takes the input last name, it looks up in a dictionary that Tonic has provided, to get a different last name that's completely made up. It's not in your data set at all. So if I turn on and off preview here, you're going to see that change. So right here we have Cregger, that's the de-identified protected version. But my raw data, this customer was Alexandra Sanford. So when I turn on preview, you can see what generators I have applied, and how the data will be scrambled.
Omed Habib (14:33):
Yeah. And if you have more questions about what different types of data types we do support, you can always check out docs.tonic.ai. I think we're about over 40 different generators now. And if there's a particular data type that you have in your organization that we don't support, it takes us anywhere from a couple of days to a couple of weeks, just build one out for you.
Kasey Alderete (14:58):
Yeah. And I'll highlight that a lot of times, the simplest way to understand generators is within a column, but a lot of the power of Tonic comes from linking generators across columns. So we know in our relation... in our data, there's a relationship between, let's say, annual income and your largest bill amount. I mean, for one, you don't want a largest bill amount that's higher than someone's annual income. They're probably not paying more than a year salary on just an order with you. But also the people with more higher incomes are probably spending more at your store. So they're going to have higher largest bill amounts. So we know there's a relationship there, and we can do a lot of linking of generators too. So I'm going to come back over here...
Chiara Colombi (15:44):
I think that we can jump into some of the new generators that we've released.
Kasey Alderete (15:48):
Yes. So as Omed said, not only do we customize for customers, given your particular data set and your policies and requirements that you have to adhere to, but we're continuously adding to our out-of-the-box set too. So some interesting ones, I think, are the HIPAA address and HIPAA birth date as well. So there are some allowances in the HIPAA regulations that allow you to preserve some of the realness of the data, while preserving the granular detail. And so, in birth date, what that does is it truncates to the birth year, except for people above a certain age, who need to be removed from the dataset because they're so infrequent that they're identifiable. On the address side, it's very similar, and that's what we've added more recently, is that you can be precise to the first three digits of a U.S. zip code, while dropping the... putting zeros for the last two digits. And so that allows you to keep that similar geographical distribution of your patients, of your customers, for example, without exposing exactly where they reside.
Omed Habib (17:05):
Yeah, and it's probably worth noting that the generators aren't creating the data from scratch, and this is one of the biggest differentiators that Tonic has on the market, is that Tonic is designed and built to mimic your production database. So what you ultimately end up getting is a dataset that you can test with, that is... that that looks, acts, and feels like production, because it's essentially made from production.
Kasey Alderete (17:32):
Right, and it's... I kind of like to think about it as being safe enough, you don't need it to be completely just X's, but you do need to comply with whatever rules and privacy protections you have for your data. But then you also want it real enough to make sense and to work for your development use cases. So another one that we've added is expanding our differential privacy support. So differential privacy is a mathematical standard. That means it's provably unable to be re-identified from the data set. So we previously had it in generators, like our categorical generator. The categorical generator shuffles values, while preserving the overall ratio. You can think of males and females, or job titles, and it's just going to kind of shuffle which rows those values are in. Differential privacy on categorical means that, if there's an outlier, if I have a bunch of job titles in my customer data, but there's only one CEO, that might be able to be re-identified, someone would know that that was in the dataset. And so, differential-
Omed Habib (18:43):
Or you might have income, right? And then you have Elon Musk in there with a billion dollar salary, and that stands out.
Kasey Alderete (18:50):
Right, and that's where we've expanded it to beyond just categories, but actually into these amounts like you're talking about. So we can change $1 million dollars, or $10 million, $100 million dollars, to a different number that preserves the statistical relevance of the dataset without that single value being able to be identified.
Omed Habib (19:13):
Kasey Alderete (19:13):
So maybe instead of having $100 million salary, now we have two $50 million, so we're still getting maybe some of the overall kinds of values or statistical meat, I guess, of it.
Omed Habib (19:27):
So the obvious use case here is that if your test data unfortunately ends up in the hands of someone that it shouldn't, they can't reverse engineer and go, "Okay, well, I can see that there's one outlier here, and I also know that Elon Musk is a customer of yours. That probably is Elon Musk."
Kasey Alderete (19:47):
Right, and it's like, a lot of times when we think about breaches, we think about external, like malicious actors hacking into your data. But it's also important internally, that you're not sharing with your own employees. Sometimes we've got cases where customers have their own employees are also customers. And so it's a little bit of a risk to have that data visible internally as well. Format-preserving encryption, I think that's pretty straight forward. This is an industry standard, but a lot of applications might expect a particular link, or alpha text versus numeric numbers, and we have the ability to detect what those are, and to preserve that even as we scramble and make it de-identified.
Kasey Alderete (20:45):
Yep. And then this one is more about... Just that ease of use with getting to your purpose built data set. So whether it's a subset, or whether it's... you're replicating your production database in full, it's really an iterative process. As you identify new fields that need to be de-identified, as you maybe are collaborating with colleagues on what kinds of generators would be appropriate here, the commenting facilitates that, undo redo is really right in your environment as you're adding really complex logic into generators. If you accidentally delete one, we wanted to be able to retrieve that logic, so that really facilitates that.
Kasey Alderete (21:31):
And then job warnings, it might sound maybe a little bit boring, but it's really helpful. Sometimes your data is actually pretty messy. As I mentioned, you might have, for example, in a Postgres database, you have a year over 11,000. You have a birth date that got fat fingered, and it's in the many thousands. And Tonic previously, we might have to fail a whole job because we didn't know how to handle that data type. But with warnings, we're actually able to still replicate that entire dataset, apply all the generators, and we'll actually fail just that single row, and we'll let you know. So we won't copy that row over. We'll hold it, it'll be flagged, the job will complete, you'll just have a warning there.
Omed Habib (22:23):
Can we show how the commenting looks?
Kasey Alderete (22:25):
Yeah. Yeah, and this is actually used by some of our larger customers. So this one I can actually see, there's already a comment right here, so I can log in as an auditor, and maybe I don't have as much knowledge about the generators, but I do know that there's a particular policy that needs to be adhered to, and I can reference that here. Or if I am doing kind of a check, and I think that something's leaked through, I can comment on any particular value here.
Chiara Colombi (23:01):
We've got a question about notifications, both for the job failures and for commenting. How do the notifications come through?
Kasey Alderete (23:09):
That's a good question. We... On commenting, I know that we have settings for notifications, so you can decide if you want to be notified for all of the comments in a particular workspace, or just when you're mentioned. So that's something we do email notifications for. On the job notifications, I think that's something you can set up, but I'd actually have to check. I'm not positive on that.
Chiara Colombi (23:34):
But it would just... It would pop up in the UI, when you're running the job.
Kasey Alderete (23:39):
Yes. I actually have a window right here with some jobs completed. I'm going to show that, pop that in here. So the warnings that I mentioned, those would be available here in the job details, and that would also pop up with an alert, that the job completed with a warning. So I don't think I have any warnings here ready, but you can see where I would get notified of the row that failed, and that was not copied over, but I would not have to, put a hold on populating and passing through the rest of the data. I think that's actually most of the features that I want to cover today that we're really excited about.
Omed Habib (24:27):
Yeah. So I wanted to mention earlier, if you're interested in receiving more product updates, we do have an email that goes out, so please visit tonic.ai. We have a subscribe form in the footer you can subscribe to, you can also, of course, request a demo, or request a sandbox. There's a bit of a waiting list right now on the sandbox, but we'll send you an email with some questions about your environment, but you can always request a demo if you want to get immediate access to seeing all the cool features in what Tonic has to offer.
Kasey Alderete (24:59):
Yeah, and everything that I talked about today is already live in the product. Some of the areas that we're continuing to expand in subsetting, we're adding more preview capabilities so that you can see exactly what's going to come back in terms of what slice of my database am I getting back with this subset, and we're providing more of those breadcrumbs to you, kind of upfront, as a part of your workflow.
Omed Habib (25:24):
Chiara Colombi (25:24):
One more thing-
Omed Habib (25:24):
If you're a current customer, obviously, you can check out the change log, which is probably going to be your most bleeding edge source of truth right now. And that's on docs.tonic.ai, but we do roll up the product releases once a quarter. So for example, this one spring, if you're subscribed, you'll get an email to let you know that we have one coming up, or you can keep an eye out for a blog release, and keep an eye out on the Webinar's page for upcoming product releases. Our next one's going to, I guess, be summer 2021. Thank you, Kasey, thank you, Chiara, and thank you everyone who joined.
Chiara Colombi (26:07):
Yep. Thanks, that was great Kasey.
Kasey Alderete (26:08):
All right, thank you, bye-bye.