Fake data is about to get VAE smarter. And by VAE, we mean Variational Autoencoder. Intrigued?
Join us for our August launch event as we unveil Tonic's newest features, from Smart Linking (our VAE generator) to webhooks to Google SSO.
Data Scientist Ander Steele, PhD, will be on hand to demo Smart Linking and answer your questions about the latest advances in data synthesis, and Product Manager Kasey Alderete will introduce a number of automation tools to further streamline your integrations with Tonic.
If you've been looking for a data de-identification solution that works for both your development and your data science teams, this is the launch for you. Don't miss it.
Chiara Colombi (00:07):
Awesome. Well, I think we can kick things off. I see a pretty good turnout. Hello everyone. And thank you for joining us for our August launch today. We've been super busy here at Tonic. We have a lot of great new features and UI to share, but before we get into those, I'd like to quickly introduce my colleagues. Joining us today are Ander Steele, our data scientist, who will be introducing Smart Linking, which if you've read the content on our blog, it's our new Variational Autoencoder generator, which basically means it makes use of deep neural networks to generate really high fidelity data for both development and data science use cases. And then we also have our product manager, Kasey Alderete, who will be going over a number of new releases in Tonic, from webhooks to Google SSO. And I'm Chiara Colombi. I'm on the marketing team, always happy to connect with all of you today. I'll also be your question asker.
Chiara Colombi (00:59):
So if you have a question you'd like to ask, you can drop that in the Q and A at any time, you can also just drop it in the chat. I'll look out for them there as well. And I'll ask them on our speakers as they come up. And I've already talked about some of the features that we will be going over today, but as a quick agenda, like I mentioned, first off we’ll be talking about Smart Linking. Making use of deep neural networks in data synthesis. Then we've got a new Workspace view, which is great for getting a full visibility into your Workspaces across Tonic. Webhooks is a new feature that enables triggering actions outside of Tonic. Privacy Hub, which many of you may be already familiar with in our UI, has a brand new look with some great new features in there. And we also now offer Google SSO integration. So now that you've got the agenda, I will pass things over to Ander for Smart Linking.
Ander Steele (01:47):
Great. Thanks Chiara. All right. So here's just a quick view of this new feature. And let me tell you more about this. So Smart LInking is our new generator, which uses deep neural networks to learn the relationships or links between columns and your tables. And the rich expressive powers of deep neural networks results in synthetic or mimicked data that preserves the statistical properties of the data. This Variational Autoencoder, in some detail, learns probabilistic, encoder and decoder models, which translate your complicated data distribution to a simpler distribution. And now we can sample that distribution, pass data through the decoder and get high fidelity synthetic data. And so this data is really aimed at your data teams, which might use this for prototyping dashboards or prototyping machine learning models, debugging these things, and we'll explore all of these well, primarily the model, the use case in this brief case study. So one of the standard benchmarks for synthesizing tabular data is the UCI Adult dataset. This is extracted from the 1994 census and consists of demographic information as a mix of numeric and categorical data.
Ander Steele (03:16):
This data set consists of 14 features in one label. This label's whether or not an individual in the census, that income is above or below a certain threshold here, it's 50 K. And this dataset comes pre-split, if you will. And to a train and test data set, approximately 32 K records and 16 K records respectively. And so what we've done for this case study is we've run this training data only through Smart Linking generator, which you can now do on your environments. I'll present you a summary of these results, comparing the synthetic data through Smart Linking with the holdout dataset, which Smart Linking hasn't seen. And just to point out and be clear, most of this analysis was done in Python, outside of Tonic, but we'll just walk through this quickly. And so the first thing I want to show is, how well do we preserve the statistics of individual columns? I hope from this graph, we communicate that our distributions of synthetic data match very well with our distributions of real data.
Ander Steele (04:26):
The continuous variables, which include things like age and this numeric feature of final weight and education number, these all look quite good. And the categorical features show similar distributions. And so great. I mean, this is important, but maybe not like most exciting thing about this data, because we achieved the same results by essentially sampling, randomly sampling for each column, from your real data and then destroying the links or the relationships between the data. So let me try and convince you that we've actually preserved the relationships and learned the relationships in the synthetic data, in these next few slides. And so in this first slide, what I've done is drawn the distribution of one of the continuous features here, age. Grouped by a categorical feature, in this case relationship. And so on the left we have the real data and on the right, we have the synthetic or Smart Linking data. And you can see from this distribution that we capture the shape of each of these conditional probability distributions, which is great. That's what we'd like to see.
Ander Steele (05:44):
And if you look at just the numeric features, we also see the preservation of correlation. Here these correlograms are showing the Kendall's Rank correlation. So in particular, if you look in the left hand side, we see increases in age, corresponding to increases in hours per week. And we see that relationship preserved in the Smart Linking dataset, which is all great. So we've preserved relationships between the categorical, and numeric features and preserved relationships between them in their features. But what we want to do, what do I communicate is how this data can be used to train machine learning models. And so recall, again, this data set consists of 14 features at one label. The label is whether or not the person in this, representing that row has an income greater than or equal to 50 K. And so we train models to predict this flag on one, the real dataset and two the mimicked Smart Link dataset. And then we evaluate these two models on the held out data set, which reminder, Smart Linking has not seen. None of these models have seen. Okay?
Ander Steele (07:04):
So this is an honest evaluation of the performance of these models on new data. So we evaluate the performance of several models using this ROC, these ROC curves. And we see while the real model has an ROC of 0.9 for the random forest classifier 0.921 for the gradient classifier. Smart Linking has a comparable, if lower score of 0.861 for random forest 0.884 for the boosted trees. And this is robust across different types of models. So if you do this for logistic regression or multilayer perceptron, maybe what you see is that the Smart Linking model performs well on the real data. Not as well as the model trained on synthetic data. But it's enough to understand if there's signal in this problem. And clearly there is, I mean, but even more, if you look at feature importance for these models.
Ander Steele (08:11):
You can take, for example, the random forest model that we trained on the real data in the synthetic data, and you can see which features are most important, and you can compare those between the real and synthetic model. And we see the most important features for both the real and synthetic model are the final weight or class and age. And of course, it's not perfect, but it's a very good indication of what's going on. And so we think of this as a powerful tool for being able to prototype or debug machine learning models on sensitive data that, for example, your off-shore data team can't access the real data, but they can access synthetic data in order to do this deeper thought process.
Chiara Colombi (08:59):
Can I jump in with a quick question before the...
Ander Steele (09:00):
Chiara Colombi (09:02):
Because I think you're already starting to answer that. We're looking at ML training use case data, analysis use case. Who really needs this tool? Who stands to benefit from it?
Ander Steele (09:12):
Yeah. Great question. And I think this tool is really great for data science teams in particular, or data engineering teams, if you have to prototype or debug analytics dashboards. Make sure that the data that you're pushing through actually looks real. This is great. Or developers that just need higher fidelity data than what's typically available.
Chiara Colombi (09:38):
There's a follow up to that question, how are we seeing our customers using Smart Linking?
Ander Steele (09:43):
Yeah. So the use case I mentioned is essentially debugging models, for example, logistic regression model, with we just trained on a large sensitive dataset, which can't be shipped overseas to your data team, but they need to be able to understand what's going on with the model. Train a synthetic model, you can debug on Smart Linking data. And so what's great about this tool is it integrates naturally with your existing Tonic workflows. I'll show you how that looks for example, with this table. And so here we have our customer's table, which actually looks very similar to the UCI Adult dataset in this table. And so we already have protected a few of these columns with name columns and integer key generator. But now in order...
Chiara Colombi (10:46):
Sorry. Could you just toggle the preview so people can see if they're not familiar with the UI and how it works?
Ander Steele (10:51):
Oh yeah, definitely. So here we have our raw data and then here we have the synthetic data or preview of the synthetic data. And so now in order to train a Smart Linking model, all you do is click on the columns and a table that you want to feed through. It's like the Smart Linking generator. It will automatically guess the type of the data. And so for example, this column is categorical. This column is also categorical. It's going to create a model of this data and we can continue linking these things. And you can select whether it's the feature, the data is numeric, categorical or location, which are the three data types that we currently support. And then once you go click generate, it's going to go off and train the deep learning, the variational autoencoder and sample that decoder model to generate your, your high-fidelity data and integrate seamlessly with your existing work.
Chiara Colombi (12:01):
Awesome. Thanks for the demo. I do have a question that came through, you mentioned a little bit earlier how you performed your analysis of the data. What's Tonic's role in the evaluation of the data's fidelity?
Ander Steele (12:12):
Yeah. So Tonic right now is just responsible for producing the data. The evaluation, the fidelity is currently done all out of the product. But wait. In the future, we'll certainly have some of these analyses in the product for you.
Chiara Colombi (12:32):
Excellent. And another question that just came through. Can you set a random seed or equivalent to deterministically repeat a particular synthetic dataset?
Ander Steele (12:41):
Once the model is trained? Yes.
Chiara Colombi (12:48):
We do have another follow up question. VAEs have knobs to adjust how you sample from a distribution, Eg temperature. Are those exposed in Tonic?
Ander Steele (12:58):
Chiara Colombi (12:59):
There any plans for that in the future?
Ander Steele (13:03):
We'd love to talk.
Chiara Colombi (13:08):
Okay. Excellent. Great. Thank you Ander.
Ander Steele (13:12):
I'll stop sharing and pass it over.
Kasey Alderete (13:17):
Okay. I think I got all the controls, mute, share. Okay. Everything is in the right spot. Yeah. So I think as a transition, I really, I love listening to Ander talk about everything because every time I listen, I get a little bit smarter about what's happening under the hood. And I think just thinking about Smart Linking doing the distribution, not only preserving the distribution, but also those relationships across columns. And I think the other really important thing is that, it's always getting smarter. We're continuing to improve the model and how it works under the hood. And you don't have to build the model or know how it works in order to use it in your Workspace. So in order to preserve those relationships, I don't necessarily have to know about building the model and training the neural networks. That's all what Tonic is doing under the hood. And I, as far as transition, think about that as being the brains of Tonic of just being really smart about how we create safe and useful data. And then I think a lot of these other features are the other parts of the body helping out.
Kasey Alderete (14:16):
So this Workspace view is the eyes. It's the visibility across everything that you're doing in Tonic. It's not necessarily a new tool, a new way to protect data, but it's ways of supporting our customers whose footprint of Tonic has grown over time. So in this example, you've got several different Workspaces we're using for testing and demo purposes, but it's really pretty similar to how our customers do it. You'll often see different database types. So you'll have your MongoDB and your PostgreSQL. I don't know the numbers, but I would say probably at least half of our customers have at least two database types. All of your data does not fit into a MySQL database. You also have a Redshift instance or data in Snowflake or what have you. But there's also different destinations, different use cases. So I might be creating really realistic data for dashboards like Ander was talking about. But I also might want to be creating local subsets for developers to troubleshoot or develop features on their laptops.
Kasey Alderete (15:13):
So each of those destinations, each of those different policies that you might be applying to your data, maybe there's different tiers of privacy that's needed, depending on the internal stakeholders who's consuming that data. And as I mentioned with subsets, and you can have different where clauses. So I could have one Workspace where I'm only pulling the data for a particular region. Again, this fits well with the off shore in case. I could keep my customer support team who's troubleshooting European customers separate and get that slice of the data. See that here in the Workspace field. So we're not really creating new capabilities with this Workspaces view. But we're really enabling those centralized dev ops teams who are often one or two people who are maybe supporting hundreds of developers and making sure that they can see where are my failures, where are the schema changes? Where do I need to drill in and see, bump this data along or where to refresh it? And so that's what this new Workspace gives you.
Chiara Colombi (16:18):
Quick question. Within Workspaces, would you see all of the Workspaces in your organization or would you only see yours?
Kasey Alderete (16:26):
Yes. So we're not changing how sharing works. And the way that Tonic works is that, each creator will Workspace owns their Workspace. You can see I'm the owner and a lot of these. But I will also see all of the Workspaces that are shared with me. And this is part of our role-based access control, which is part of our enterprise offering. See some of the roles there. There's editor, auditor. There's also viewer. Viewer is one that has the least privileges. I can not see the data that's there, but I can see that the Workspace exists and see maybe some of the settings, schema columns without seeing the actual data. So this Workspaces, we will continue to respect those permissions. And we'll show you everything that has been shared with you or created by you.
Chiara Colombi (17:09):
Kasey Alderete (17:10):
Also looking new is the Privacy Hub. I think of this as the heart of Tonic. This is really core to what we do. This is the landing page when you log in. To see what is my progress? What has Tonic identified as sensitive or what have I identified as sensitive and how am I doing against protecting those and figuring out the right strategy for how to mimic that particular piece of data? So really just new here is, some of the facelift and the counts that are really a summary of some of the counts that you could've gotten before, across your columns and all the tables for particular database steam. Again, this is specific to a Workspace context. So I'm in a Workspace right here, looking at my progress. This one is the sleeper hit, I think. I think Webhooks is a very simple solution, but it creates these really robust workflows. And it brings the focus onto whatever it is that you're using your data for. So we realized that creating safe data is not something you're probably selling. That's just an input to your larger value chain.
Kasey Alderete (18:21):
So think about your developers, that you're enabling. Your data scientists, that you're enabling. Those digital experiences, and websites and products that you're building. Webhooks helps you kick off that next step in the process. So again, recognizing that you don't maybe want to log into Tonic, to see is my job done and a refresh. So we have already a lot of customers doing some interesting things here, with both of these examples that we've mentioned. Slack or GitHub actions, to notify what's the next step in the pipeline. It's part of your CICD pipeline, maybe you're generating other environments as a result of the Tonic data being ready or completing. So definitely an area we're excited to expand, to support any other use cases that aren't currently. So I'd love to hear what people are using Webhooks for and how Tonic fits into your bigger, overall workflow.
Chiara Colombi (19:14):
I do have a question there too. Some folks are already familiar with Post-Job Scripts and Tonic. Users can make use of both. Both are still available. Right?
Kasey Alderete (19:23):
Yeah. So that's a good question. So Post-Job scripts was the first Post-Job action that we introduced. So it's just SQL scripts that can run after Tonic writes your new data. You might want to do some changing of the data or running a job using SQL code. And that will complete before Webhooks kicks off. So you can think of it as, Tonic writing your output data. The Post-Jobs Scripts, which are SQL scripts, will be applied to your output database. And then the web hook will be triggered.
Kasey Alderete (20:01):
And again, with the reaching out aspect of Webhooks reaching out again, thinking about this body analogy, moving from the Smart Linking to the Workspaces view, Privacy Hub, Web Hooks, and then now Google SSO. So connecting to external identity management systems. And I have a lot of customers making use of this. Who again, aren't managing Tonic as the only tool. That's not their whole job. This is one of many tools that they support to enable their teams. And so we can respect group membership. So maybe you want to apply, share a Workspace with a particular compliance team so that they can audit those Workspaces and those rules. Then they change teams. They no longer need access. You don't need to come to Tonic to manage that role update. We can respect that and use your centralized identity provider.
Kasey Alderete (20:54):
And another way to think about it, I was using the body analogy, but another way is just to think about your workflow. And Workspaces view is allowing our customers to really grow their footprint and see and manage all of that and understand what Tonic is doing. Smart Linking and the Privacy Hub are helping you, as you on your journey to protect all of that data and do that in new and better ways. And then, with Webhooks, it's all about thinking about what's next. How is this data going to be used downstream?
Chiara Colombi (21:27):
Awesome. Thank you Kasey. We do still have a couple questions that came through. So if folks would like to stick around for Q and A. One question is, can you blend stochastic and rules-based generation. For example, generate a random city field subject to the constraint that it's in the randomly generated state field?
Ander Steele (21:48):
Right. I assume this is in the context of Smart Linking, that's a great question. In that particular example, the solution is feed everything to the Smart Linking generator and indicate that as location. But in general, like, not yet. But I have to think about that.
Kasey Alderete (22:19):
Well we'd probably be interested to hear what's driving the need there.
Ander Steele (22:21):
Kasey Alderete (22:24):
Certainly interesting combination. Because I know, I assume when we talk with the rules-based, of course we're talking about the non-Smart Linking side, right? And so we do support this in ways of linking categorical generators, custom categorical, things like that.
Chiara Colombi (22:38):
Ander Steele (22:38):
Chiara Colombi (22:40):
Another question that just came through. Any plans to add text files, CSV text masking, in Tonic?
Kasey Alderete (22:47):
We do support flat files, largely through our spark connector. So a lot of times we'll see like Parquet, Avro files, things like that. We have at different times supported just CSVs or file upload. But it wasn't really used in a way that customers expected, so we removed that. But the product's pretty flexible to probably support your use case depending on the file type.
Chiara Colombi (23:13):
Thank you. Another question about Smart Linking. Do Smart Linking features only use other Smart Linking features to turn values, or is it a value of a Smart Linking feature based on all features in the table?
Ander Steele (23:25):
Yeah, currently, the model is only seeing the features, which are the columns, which are using the Smart Linking generator. So the model is only responsible for those. But again, this is related to, to the previous question. So we'd love to learn more about this use case.
Chiara Colombi (23:48):
And another thing I think, a good follow-up to that is, what types of data is Smart Linking most useful for?
Ander Steele (23:54):
Right. This I mean, this is primarily for numeric data, categorical data and location data, combinations of these.
Chiara Colombi (24:06):
Right. I think I only see one more question left in the chat. Is it possible for users to try Tonic out on their own, including Smart Linking? Anybody want to take that or should I? I can take that too.
Kasey Alderete (24:20):
I can go for it. You can add on. Yeah. So we do have a special opportunity right now. We have opened our sandbox program. So let us know if you'd like to try out the sandbox. There is a button to request, I think access today, that'll be getting some updates this week, but we do have a limited two week trial on our hosted environment. It's got all of the features of Tonic in it. So you would get to play with things like Smart Linking as well. And yeah, we're really excited to be able to make that available to a lot of our technical customers, who are trying to evaluate Tonic. And it's hard to know without trying it out yourself.
Chiara Colombi (25:03):
And the one thing I would add to that is, we also have a workshop coming up in two weeks time to help you learn the ropes of Tonic. So if you do sign up for a sandbox, you can also get a sandbox at that event as well. And we'll just be able to help you onboard as well. And I think another question that came through. As of now, clustered PKS, not supported in the GUI, need to edit the JSON file to add multiple keys. Will the UI support this feature in the future? Oh, I mean, to set foreign keys across tables.
Kasey Alderete (25:41):
Yeah. I don't think we have plans to expand the UI right now, but I'm certainly happy to dig in more and look into this. I don't think there's necessarily a technical reason. If it's supported via the JSON upload, then we could probably look into doing that in the UI as well. Just for context for other people. So we have a foreign key tool in Tonic that allows you to create virtual foreign keys in case your source database does not have these relationships defined. This is particularly important for subsetting where we're trying to keep those references intact. So we do have a UI that lets you link and add foreign key relationships. And it sounds like there's a more complex use case here. And we also support uploading it via JSON, in addition to doing the UI.
Chiara Colombi (26:27):
I think I've seen this question before as well. Does Tonic have support for graph QL? Have you had that question come up?
Kasey Alderete (26:36):
I have not gotten it. I have heard it in other jobs I've had. Yeah. I'd love to dig in to understand. GraphQL is more of a front end language. Understand more where you're wanting to plug in and use Tonic functionality using graph QL or how that fits in with your architecture. But no, that is not something that we've gotten before or have on our radar.
Chiara Colombi (26:58):
Okay. All right. If any other questions come through, we do have a minute more we could spend on questions and if not, I will wrap things up here. Thanks to everyone again for joining us. It's been great talking through Smart Linking. Thank you Ander for your demo. And Kasey thank you for the overview of all the other new features that we've got going on. Just if you think that Smart Linking in particular is the generator you've been looking for, for your development and data science teams, we'd love to chat. And you can book a demo on tonic.ai. And as we also mentioned, you can also request a trial account also on our website. So yeah, we'd love to hear from you. Reach out to us and join the workshop in two weeks as well. We've got more webinars coming up on our webinars page. Always check there for the latest. And thanks again to everyone.