What does it take to build and deploy a data generation platform online? How does it differ from on-prem?
In an open conversation of war stories and lessons learned, Tonic CTO Andrew Colombi and Mockaroo CTO Mark Brocato will compare and contrast the architectures they’ve built from the ground up. For Tonic, it’s the story of an on-premises enterprise software with critical security considerations to keep in mind. For Mockaroo, it’s the story of an online app unexpectedly shoe-horned into on-prem due to customer demand.
Join us as we pull back the curtain and take you behind the frontend of two of the leading data generation platforms today.
Chiara Colombi (00:07):
Hello everyone. And welcome to today's conversation with the CTOs of Mockaroo and Tonic. My name is Chiara, I'm on the marketing team at Tonic. And today, we're going to focus our talk on software architecture, specifically the choices that were made in building two fake data generation platforms from the ground up. Before I introduce those platforms, I'd like to introduce you to our speakers. We're very glad to have Mark Brocato, founder and CTO of Mockaroo and Andrew Colombi, co-founder and CTO of Tonic joining us today to share what they've learned.
Chiara Colombi (00:35):
We've got a lot of great questions already lined up to ask them, but this is the perfect time to say that we really want to hear your questions. So please send them my way in the chat or in the Q&A at any time, I'm going to try and ask them as they come up because we really want to make this as much of an audience-led conversation as possible. So first, before getting to the questions, I'm going to provide a quick high-level overview of Mockaroo and Tonic and kind of the key difference between them that defines the use cases for each tool.
Chiara Colombi (01:02):
Many of you are probably already familiar with Mockaroo, it's the online data generator that allows you to create mock data from scratch. It's got an extremely intuitive UI, it's got a free tier. You can just hop online and quickly spin up smaller datasets on the fly. It's also got several paid tiers, including an enterprise tier that allows you to deploy the software on-prem.
Chiara Colombi (01:20):
The key distinction of its approach to data generation is that it employs rules-based data synthesis. In other words, it allows users to define the rules of the data they need and then generates the data based on those rules. So for example, maybe you need birth dates specifically between 1960 and 1990, or you need to manage the categorical frequency of male to female to null across the dataset. Mockaroo doesn't require any seed data to start with, it's really useful when you're building a brand new application and you don't have any data to work with, or maybe you need to spin up massive datasets for load testing.
Chiara Colombi (01:54):
Another common use case is creating realistic mock datasets for sales demos. Mockaroo has thousands of users, from development teams of one to enterprise customers like Salesforce. And it also has a very active community online. If you haven't already checked it out, I highly recommend it. You can compare techniques, learn new tips and Mark also hops in there as well to answer questions sometimes.
Chiara Colombi (02:16):
When Tonic was founded, Mockaroo was already a leading tool in the space and our co-founders reached out early on to Mark to learn more about his experience and his success as well. At the same time, we also set out to do things a little differently. We were looking to solve for a problem that our founders had run into in their own careers, which is specifically the problem of having production data that is too sensitive to share with developers for use in testing and development.
Chiara Colombi (02:42):
Where Mockaroo is rule-based and can also equip developers with rules-based data, Tonic generates data based on your existing data by using your data as a seed. So a big part of what Tonic enables is secure de-identification. And when you combine that de-identification, you can also call it anonymization, obfuscation with our database subsetting and synthesis capabilities, what you get is what we are calling data mimicking.
Chiara Colombi (03:06):
So our core use case from the start has been enabling developers with useful, safe test data that looks and feels real because it's made from production data. With Tonic, the way it works is you connect directly to your database. And we support a long list of databases from Postgres to Redshift, to MongoDB, which we recently launched. And once you've connected to your database, you build a model of your data. Then using that model, Tonic will transform your data in flight and hydrate the separate output database with secure de-identified data, but it preserves all the utility and complexity of the original data.
Chiara Colombi (03:39):
So our use cases specifically are getting production data out of your lower environments to equip developers with safe data, protecting sensitive data for compliance, legislation, privacy laws, data security, and risk mitigation, or we also enable many teams to create targeted subsets of massive datasets. So it's easier to manage the data on a local laptop. And it's also easier for debugging and QA.
Chiara Colombi (04:05):
Much like Mockaroo, we've found a strong response for a need for what we've built. Our customers include eBay, PwC, VMware, many healthcare companies like Everlywell and Allegiance. They represent a broad cross section of industries, but at each we're speeding up development cycles by making it quick and easy for developers to get the data they need.
Chiara Colombi (04:25):
The last thing I'll quickly mention before we jump into our questions for Mark and Andrew is, I'm sure many of you have tried out using Mockaroo in some form, either as a paying customer or on their free tier. And we're excited to share that Tonic recently launched our own free tier of sorts in the form of a developer sandbox. So if you're interested in trying that out as well, you can pop on over to our website. I can show you where that is, go back here and you can sign up for a developer sandbox right here or even better, we are hosting a hands-on workshop next week.
Chiara Colombi (04:57):
You can go to the webinar's page and at the workshop we will provision sandbox accounts and kind of guide you through learning the ropes of using Tonic. I'll put a link to this in the chat as well, for any of you who are interested. All right. I think that's enough of the high-level overview. Let's switch to the topic of the day, software architecture. And I think the best place to start would be to have Mark, Andrew, if you guys could lay some groundwork by answering the high-level question of what is software architecture?
Andrew Colombi (05:28):
Oh, you're muted.
Mark Brocato (05:29):
Sorry. I guess, I'll start. You always get these kinds of questions in interviews where it's some obvious thing that you should be able to explain and you're like, "I don't know." So software architecture, I think the way I think about it is kind of the way that I think about trying to organize teams.
Mark Brocato (05:46):
Software is a bunch of components and evermore these days with AWS and Azure and all these cloud environments, putting together a piece of software architecture is trying to figure out the different components and the different ways that they'll interact to accomplish a task. The same way of putting together a team of people, is figuring out how to make use of everybody's best skillsets and resources to solve a certain task.
Andrew Colombi (06:07):
Yeah, and I agree with that analogy. There are interesting ways that analogy actually feeds on itself, I think as well, which is when you set out to design your company, you're going to be making trade-offs and compromises about which teams are going to be able to operate quickly together because they're close to each other in the organization and which teams are going to be further apart, so that'll be more difficult for them to operate together.
Andrew Colombi (06:38):
And the same is true with software, the decisions you make about your architecture are going to make some things easy and some things really hard and it's all about finding those compromises. It's always about compromises in software architecture. So yeah, I do like that; it's kind of building a company, but in its own way.
Mark Brocato (06:59):
I've seen an interesting parallel come up, one of the most famous books in our space, I think is Mythical Man-Month, which talks mostly about project management and assembling teams and whatnot. But one of the theses in there was; the more people you add doesn't make things go faster, it makes things go slower because of all the communication points.
Mark Brocato (07:16):
And one of the big discussions in the architecture world today is monoliths versus microservices. And I can definitely attest, I work in both extremes in my various lives. Mockaroo is an unashamed Ruby on Rails monolith, love it. But I work a lot with microservices as well, and I can definitely say that every communication point you put between components makes life more difficult in certain-
Andrew Colombi (07:39):
Yeah. And it's absolutely a thing that we think about. At Tonic, our applications are a little more sophisticated... it's just, there's more pieces; it's a bigger team, more things going on. And it's like, "Do I want to add a robust queuing system, like a queuing microservice, or do I want to just make it work with Postgres?" And both of those are valid approaches depending on what you want to make easy, what you want to make hard, how much you want to invest in this queuing service, how important is it that it'd be robust in these different ways?
Andrew Colombi (08:16):
And it's not always the right answer to pick the best, it's not always right answer to say like, "Well, we're going to use SQS or RabbitMQ, or Kafka for this queue that's going to get 100 messages a day and we can put some retry logic in there, whatever. It's not always the right choice to pick what seems to be the gold star of whatever it might be.
Mark Brocato (08:41):
Yeah, for sure.
Andrew Colombi (08:42):
And that's like the Mythical Man, sorry, just to bring that back like the Mythical Man-Month there; it's not always best to add more people, sometimes it's not best to add more blocks to your block diagram and you need to be very economic or economical about your architecture.
Mark Brocato (09:01):
Yeah, I totally agree.
Chiara Colombi (09:03):
So what is one of the first considerations that kind of impacted the architecture that you chose? Was it maybe the deployment environment, anything along those lines.
Andrew Colombi (09:13):
Yeah. Do you want me to go first or you want to go first?
Mark Brocato (09:15):
Go ahead, Andrew.
Andrew Colombi (09:16):
Yeah, sure. So I think there were two things that really impacted our original thinking, which is, first, we wanted to work directly against databases. We wanted the input to be a database and the output to be a database, we thought that would be the easiest thing for our customers.
Andrew Colombi (09:37):
And the second thing is, we didn't want our customers to have to send their data to us. We knew that the data is going to be sensitive, that's why they're using Tonic, and sending it to us might be a sales hurdle that we didn't want to have to deal with. So those two things were really prominent in our minds when we started to architect how this would work.
Andrew Colombi (10:06):
The second one kind of leads you to an on-prem path or something that's going to be close to on-prem. And the first one leads you to and the first one being the databases that we wanted to go right from database to database, we really wanted a language that we picked to have a broad support for a lot of different databases and so it needed to be very mainstream.
Andrew Colombi (10:33):
I think in the data synthesis world, there are a lot of interesting academic problems out there. And some of them are more easily expressed in languages that aren't as widely adopted for database technology. That was definitely something that weighed on our minds.
Mark Brocato (10:50):
Now I'll give you the totally opposite story, which was, Mockaroo started at very humble beginnings. I never even considered making this a business for at least a year when I launched it. I literally was just, I had an itch I needed to scratch.
Mark Brocato (11:04):
I was working in a small life sciences startup that had a high priced product with only a few users and I'd been there for years, and I wanted to make something that a lot of people would use. And I also wanted to learn Ruby on Rails because I've been working in Java forever and I wanted to learn something new.
Mark Brocato (11:22):
So it was literally those two things. I chose the architecture just based on developing my resume, unashamed. And it turns out that it, by random luck, was actually a really good choice because I still think today, if you're a one person or a 10 person or even a 50 person startup, Rails is probably the best choice for most cases. It's just so good at guiding you towards the pit of success and getting so much out of a team with no time and no resources. It does so many things for you.
Mark Brocato (11:56):
I had a lot of luck along the way. Eventually, Mockaroo became popular enough to where larger companies started asking for an on-prem version. And it was like, "Oh God, this is not going to be good." Ruby on Rails on-prem, anybody who's ever had to set up a Ruby on Rails environment knows it's a pain to set up natively, and thank God, Docker has come along. And so, I basically-
Andrew Colombi (12:16):
A big thumbs up.
Mark Brocato (12:17):
Yeah. I ship it as... wrap as a Docker container, and enterprises can do all sorts of crazy things with that. Some people want it in Kubernetes, everything under the sun on every provider. So thank God for that, but I never banked on that early on, but I had some of that same friction where I need to be able to run on-prem.
Mark Brocato (12:37):
So fortunately, the only real dependencies for Mockaroo is you’ve got to have something that runs Docker, you’ve got to run Reddis somewhere and you’ve got to run Postgres somewhere. But AWS provides all those things as services, so I got lucky I didn't design it that way.
Andrew Colombi (12:51):
Yeah. And just a one up on that Docker thing, 10 years ago when I worked at my last company, which was Palantir and we did a bunch of on-prem installation back then too, Docker was much more nascent and we spent two man years developing an installation software for Palantir.
Andrew Colombi (13:12):
That maybe more probably if you look at the overall, but my 30 person team, their product, two man years on installation. And I mean, I don't know, but we've spent like one man month on installation at Tonics because it's just Docker and done.
Chiara Colombi (13:30):
Mark, I wanted to ask you a question, your use case when you first developed Mockaroo was for healthcare data as well. How did that impact the architecture that you went with?
Mark Brocato (13:40):
Yeah. So I was leading a team of developers and QA, maybe 15 people or so. And so, this was a really complicated domain that no one really learns about, unless you happen to have been a scientist for a while in a biology lab. So it's very difficult to have junior QA engineers understand the domain to the extent that they can put in valid test data when they're testing one of these long workflows.
Mark Brocato (14:06):
And so, very quickly, your product gets filled with this kind of garbage data and it makes it easy to test the first part of the workflow, the first screen, but then 10 screens later, it's just incomprehensible. So I built Mockaroo as an internal tool, originally, just to help engineers build data that had some integrity to it and kind of modeled the real world.
Mark Brocato (14:25):
And then, I figured, "Hey, this might be useful for other people in other domains, and it's pretty easy to get some test data." At least, that's what I thought at the time. And so, I put it up on the internet and it took off from there. One thing that was shocking is, almost from day one, people just materialized out of nowhere to start using it. I didn't promote it, I didn't know how that happened. I think I answered one or two questions on Stack Overflow.
Mark Brocato (14:47):
The internet is such a vast place that if you build something with any merit, somebody will notice. I never knew that going into it, this whole thing has been a learning experience from day one. I came from this very enterprisey kind of sheltered background in the beginning.
Chiara Colombi (15:03):
Andrew, I feel like you kind of already touched on how the use case impacted things, because you mentioned sensitive data, but if there's anything that you'd like to add there.
Andrew Colombi (15:11):
Yeah, I think I covered it; wanting to directly access databases. I think that kind of covers my perspective.
Chiara Colombi (15:23):
Well, there was one thing that you mentioned about... Oh, actually Mark, a question about when you had to make that switch to on-prem, did that change the architecture in any way? How did that influence things?
Mark Brocato (15:35):
It actually did. It kind of defined in a way the architecture for Mockaroo as it stands now, but it's hazy to remember whether it was the on-prem or the scale. So the background was, a very large company came to me on a very large project and Mockaroo couldn't generate large datasets at that point, it literally did everything in the main web thread, it was just thrown together; no background test, nothing.
Mark Brocato (16:04):
And the guy came to me, he was like, "Well, I love your app, how much can you scale up? Can you generate terabytes of data?" And it sounded like a very lucrative contract. So I'm smart enough to know just to say yes and then figure it out later, but I had to figure it out later. And what I arrived on, which is a very standard solution in the Rail space is to use background workers with sidekick to do things in parallel and to make use of multiple computers and multiple processors.
Mark Brocato (16:33):
So I threw that together and fortunately, that actually works out really well with the cloud model that most companies install Mockaroo on. I had basically two types of Docker containers, an app instance that runs the website and then a worker instance that generates data. And then, you can just have as many worker instances as you want, as long as you can flip the Amazon bill. That all came about kind of at once as I was developing the first on-prem solution.
Chiara Colombi (17:02):
Okay.
Andrew Colombi (17:02):
I want to comment on something there. Just if you're in this audience thinking like, "I want to do my own startup at some point, and what architectural trade-offs should I make?" The example, Mark, you just brought up of everything was in the main web thread, that's exactly the kind of architectural decision you should be doing, that's the first step. It's really easy, especially as an engineer to think, "Well, obviously this needs to be in the background thread. I should get that going right away."
Andrew Colombi (17:38):
And it doesn't matter until it matters; your customers aren't going to care until it matters. And then, when it does, you tell them it works and you fix it. It's very easy to get caught in the premature optimization, that's what this is. And I highly recommend anyone who's thinking about starting their own startup to be doing the unscalable things. These are the trade-offs that are okay to make, because you can just say yes and fix it; you know how to do it, it's just, you shouldn't do it until you need it.
Mark Brocato (18:14):
Yeah.
Chiara Colombi (18:14):
Yeah. That makes a lot of sense. Actually, we have a question for you, Mark, from the audience. Do you recommend using Ruby on Rails when starting a new project today?
Mark Brocato (18:20):
Yes. I don't care what anybody says, I'll die on Ruby on Rails. I love it. And I do a lot of other things, so it's not like I have a very slim background. I probably spend most of my life in front-end JavaScript, Next.js, Vue, Svelte, React. I actually spent the entire last two months doing Rust Programming, which is the opposite end of the universe from Ruby on Rails.
Mark Brocato (18:45):
And they all have their merits, but where Ruby on Rails has its merit is if you're a small bootstrapped endeavor and you need to get something up and running and get to an MVP quickly, there's nothing better than Ruby on Rails. I don't care what anybody says. And there's still no better darn ORM than ActiveRecord. I've used Hibernate in Java. I recently used Diesel in Rust, which is actually really good. ActiveRecord is still the OG, it's still the best; it gives you the most control over SQL, it just works.
Mark Brocato (19:17):
I don't care that Ruby on Rails has been around for a while. I've been around for a while, so what are you saying? I still think it's great. So any project I start today, assuming it's the same kind of MVP seeking thing, I'll use Rails.
Mark Brocato (19:34):
And believe it or not, Mockaroo gets a ton of traffic; it's got thousands and thousands of daily active users and broaching a million, I think, total people who've signed up and it still serves it on very cheap infrastructure, way cheaper than probably other jobs I've done with more sophisticated teams. So you can get a long way with Ruby on Rails, especially if you do things in the background. So don't knock it until you try it.
Chiara Colombi (20:01):
Okay. I have a followup question to that. Do you recommend Ruby, even against PHP for a quick MVP?
Mark Brocato (20:10):
The two are competitors, I think, and both would be reasonable choices. In the end, I think your code winds up being more well factored and more maintainable in Rails than PHP. And I have always loved the Rails approach to life, which is just, you bite off big chunks of functionality when you add a library and everything is highly opinionated and thought through for you.
Mark Brocato (20:41):
It's always seemed to me to be put together by really talented people in that community. Some communities are lucky to have really talented people and some communities have a lot of people and some of them are talented. The JavaScript community has an endless amount of people, some of which are very talented, but many of which maybe are not. Rails seems to be a smaller community, but really talented folks. So I'd stand by it. But I think you're, at least in the right ballpark if you're comparing the two because both allow you to get up and running very quickly and allow you to fail fast.
Chiara Colombi (21:10):
Okay. Andrew, a question for you, because we've heard a lot about the architecture that Mark has gone with. How did Tonic sit down the team and weigh kind of the pros and cons of using different technologies when we first started building the platform, did you just apply what you knew and ran with it, or did you kind of-
Andrew Colombi (21:28):
No, it was a little bit more thoughtful than just go with what we know. I mean, certainly, it's a column A and column B; we weren't considering things we didn't know at all. For example, Ruby on Rails, none of us knew Ruby on Rails, and so we didn't even consider it. Sorry, Mark.
Andrew Colombi (21:49):
But we did look at, well, we knew those initial things that we really needed to hit. We wanted to be able to do an on-prem installation, we wanted to be able to hit all the databases that are out there, including the ones that you don't hear about as often. We knew that we would have customers that wanted Db2, that's an ancient technology, truly ancient technology. But if you're working with FinTech or insurance, they're going to have Db2.
Andrew Colombi (22:19):
The desire to work with both ancient databases and new databases, it kind of left a, and by new databases, I mean, obviously Postgres, MySQL, but also newer cooler things like Spark or MongoDB or Redshift, Snowflake, et cetera. We basically created a table of, these are the things that we want and scored different approaches according to what those things could be.
Andrew Colombi (22:45):
And some of them were technology things directly like database support, et cetera. Some of them were like, how easy is it going to be to hire for people there? How easy is it going to be to deploy on-prem? How easy is it going to be to debug things on-prem? Because that was definitely a concern that I had from my Palantir years of, if something goes wrong, your deployment is super far away, you might not even get logs, how can you have any debugging? So there was definitely a process of thinking about what are the different things we care about and how do different approaches rank up, and what do we know?
Chiara Colombi (23:23):
Yeah. That's... Oh, go ahead, Mark.
Mark Brocato (23:26):
Sorry, I'm just so excited. So you guys arrived on Java as the main cortex-
Andrew Colombi (23:32):
No, .NET.
Mark Brocato (23:34):
Oh, okay. I've never worked in .NET. But if I had to pick one, given those requirements, it would be Java because I just assume Java has support for every darn database out there no matter how-
Andrew Colombi (23:46):
Yeah. There were different considerations, but anyway. Yeah, go ahead. What was your...?
Mark Brocato (23:55):
That's really interesting. And so, I guess the actual code is C#?
Andrew Colombi (24:00):
Yeah. That's all C#. I mean, the front-end is obviously TypeScript, but the backend is all done... And there's a bit of Python actually, because we do some machine learning stuff that just wasn't worth trying to do it in C#. So we have a microservice that handles some of the Python stuff.
Chiara Colombi (24:17):
So kind of a follow-up question to that is, given the number of databases that Tonic supports, is it hard to staff an engineering team with the skills to cover all those-
Andrew Colombi (24:31):
Yeah. We interview for it. I mean, if you interview as a backend engineer or a full-stack engineer at Tonic, expect questions about databases. And if you don't do so well on the database stuff that’s not... It's important for us. And there are some things that you can't interview for though, like we're never going to find a Db2 expert if we just interview.
Andrew Colombi (24:56):
So for that, we basically contracted out teachers essentially. So we won't hire someone to do a Db2 implementation, instead we'll hire someone to answer whatever questions we have as we do it ourselves and that gives us the expertise in-house while also leveraging someone else's knowledge. And then, for some of the technologies, they're just so new, stuff like Redshift that you just want to hire people that are eager to learn because that's the only way you're going to be able to catch up with the fast and the new.
Chiara Colombi (25:34):
Yeah. Turning that question, actually, over to you, Mark, how much of your processes involved you adding to, or further developing your existing areas of expertise?
Mark Brocato (25:45):
Well, so I guess one thing that is different with Mockaroo versus Tonic, is Mockaroo has just a ton of data types. So having to find those and mind them and understand them and figure out... I get requests all the time for this data type or that type of data type and trying to figure out which ones are actually more likely to help the most people, I think that's expanded my skillset.
Mark Brocato (26:10):
Certainly, in the beginning of Mockaroo, I really didn't have a lot of experience building an application that would be widely used. I dealt with applications that had very complex data and large datasets, but never had to deal with concurrency of users and highly availability, and those kinds of things. All right. Now I'm going to start knocking Ruby on Rails.
Mark Brocato (26:30):
But one of the downsides of Rails is, it has memory leaks. I think pretty much everybody that has an app with Rails that does anything significant, it runs out of memory and it needs to be restarted. And so, I really had to learn the skills of managing servers and managing services and making them reliable and dealing with all the chaos of real-world applications in that setting.
Mark Brocato (26:53):
Before, I had mostly done Java and Java's VM is just so rock solid that you can really rely on it to just stick around and not be restarted for years. Rails, not so much, it's a little bit more simplistic. I mean, it could be worse. I could be using Node. Node needs to constantly be brought out in the back and shot and then restored. Kubernetes is just great for that, but Rails is kind of a happy medium. So I had to learn a lot of skills.
Mark Brocato (27:21):
And then, through other jobs and stuff, I wanted to be using those quite a bit. So my piece of advice for any employer out there by the way, is seek out people that have side projects and encourage them to do so because I brought so much knowledge from stuff I did on my side projects that go and help my main job that I really feel sorry for some companies that looked down upon that or are very cautious about allowing people to use their own time. You only make your staff stronger when you do that.
Chiara Colombi (27:50):
Yeah. That's a great point. We have an interesting question from the audience, how do each of you think about being opinionated versus being more developer friendly, customizable with code?
Andrew Colombi (28:03):
Yeah. I can take a first stab at that. I think it's a really great question. I mean, and once again, everything is a trade-off and there isn't a right answer, but my answer is for... there's value in having more of an opinion within an organization. And then, maybe using libraries that enable you to achieve those goals or those opinions. So for Tonic itself, we are running various static code analysis, various checks on any commit or whatever, that makes sure things are formatted according to their standards, but also other kind of code quality things that might come up.
Andrew Colombi (28:52):
And also, just the architecture of, for example, the front-end, it's there to provide you with the rail that you need to be on so that you can do the thing the way it's supposed to be done. And if someone were to make a PR, I was just looking at a PR recently and one of our engineers introduced an alternative async paradigm. And I was like, "Let's not introduce any new async paradigms, let's stick with the async paradigms we've got, so that we're a little bit more... at least things are a little bit more consistent."
Andrew Colombi (29:26):
So I think within an organization, it's important to be opinionated because it helps communicate, it helps keep people on the same path and then be able to have someone fill in when someone else is sick or whatever, people can read each other's code more carefully. When it comes to the library, if I were developing a library and the libraries we use, we tend to use libraries that are pretty mainstream.
Andrew Colombi (29:51):
And therefore, as the libraries get more and more mainstream, they often tend to add more and more features and we're using .NET and Entity Framework and that kind of thing, and they're pretty flexible and I'm fine with that. If we were smaller, maybe something even more opinionated would be better. But I think, when you pick your libraries, pick the ones that are going to support the opinions you want.
Mark Brocato (30:16):
Yeah. I don't know if I'm internally consistent on this one because I'm a lover of Ruby on Rails, which is the opinionated web framework. And I totally agree with what you said, in building a team, I always try to get the team to use as few technologies as is practical. Because it's just so hard to have team members jumping between five or six different languages, to find and recruit a staff that has that skillset. And so, we try to make conscious choices about finding the best tool for the job, but balancing that with using the fewest amount of tools possible because what I want...
Mark Brocato (30:59):
My experience is all in startups and fast moving environments, and you want people to be able to drop in and out of any part of the system at will as is needed. It's so comforting to work on a team when anybody can solve any problem. And you facilitate that by having as few different patterns and technologies as possible so that people can context switch easily.
Mark Brocato (31:20):
But that said, for opinionated things, it's fine, as long as I agree with the opinion. And there's some mainstream ones out there that I've just never agreed with. I'm part of this clubhouse room every Thursday called JavaScript Thursdays, where we just talk about the latest stuff in JavaScript.
Mark Brocato (31:35):
And every discussion devolves into, so what state management framework should I use for React? Every week it's the same thing. And Redux is the most popular one and I hate Redux, just hate it with a passion. It is just, over-thought, over engineered, God why, none of this stuff makes any intuitive sense. What the heck is a Reducer? It sounds like-
Andrew Colombi (31:55):
What's the answer?
Mark Brocato (31:58):
Since Hooks came out, I just use Hooks. Hooks and new state and context. I just roll my own. Before that, I was very much a proponent of MobX, which I just thought was way more performant, way more simple. The only reason I think that people have a bone to pick with MobX is, it's kind of anti React and that it takes control away from React in its rendering reconciliation. But it does so by, it makes the whole app much faster and much more reliable and performance, and it's just a simpler model.
Mark Brocato (32:26):
So I get why people use Redux, it's great. The bigger the team you have, the more order Redux enforces. It's got great dev tools. But for me, personally, no, it's just too weird. And the tools that React now gives you, many of which owe their existence to Redux in the patterns that it established, I feel like are just a better choice for the teams I've worked on. They give you a bit more flexibility and are just simpler.
Andrew Colombi (32:52):
So I followed a similar path to you, but I kind of got off the train a little... I never left the train I was on, which is, I use Redux. I was just like, "Man, there's so much boilerplate. Let me try something different, let me see what else is out there." And then, I had tried MobX and that's what we're using right now and we haven't made the leap to Hooks and the new React stuff, but maybe someday.
Mark Brocato (33:21):
Yeah. Earlier this year, I rebuilt the front-end for Mockaroo and Mockaroo used to have a little bit of MobX in there, but now it's just new state and context. I don't know if that's the right choice for everybody because I've been with React since the beginning and it's literally a huge part of my day job to be an expert in React. So I have a very different angle on it, I'm very comfortable in the internals of it. But with other junior members on my team, it seems to be fine to just stick to that and not use a framework or a state management library.
Andrew Colombi (33:57):
Yeah.
Chiara Colombi (33:58):
I'll jump in with another question from the audience. Do you think and I think this is probably for both of you, this testing tool is easy for a business analyst team to adapt with, as they know more about business use cases and which data elements need mock-ups of data versus scrambling PII data?
Andrew Colombi (34:19):
Okay, go ahead.
Mark Brocato (34:20):
I can take that at least to start. So one of the things I'm proud of with Mockaroo is that it appears very simple. And so, there's a layer of it that is really meant for anybody of any technical background. Anybody can show up at the homepage and put together data that resembles what they need, but it does have a level under the surface where you can do very sophisticated things with the Ruby scripting, formulas, custom distributions.
Mark Brocato (34:49):
And so, I don't know that I designed it this way, but it turned out this way that it does have a very wide appeal that folks who were non-technical or not developers, but maybe they're data scientist or just other business users use it quite frequently; salespeople and QA engineers and people that aren't developers. So I hope the answer is yes, that it adapts well for those folks.
Mark Brocato (35:13):
And one way that everybody out there can help is to be vocal in asking for new data types, don't feel like you have to engineer everything from scratch when it's not there. I add new data types all the time, and it is what it is because of what people have asked for. So, sorry, Andrew, go ahead.
Andrew Colombi (35:29):
Yeah. No, great. I mean, it is what it is, because what people ask for is the story of every great product. So I completely agree with that sentiment. I feel like in many times, a startups advantage or any company's advantage is really just who talked to them and if you got the right conversations, if you were there earlier so you had more conversations, that's your advantage; you know more because you talked to more people.
Andrew Colombi (36:00):
To answer the question, I mean, I also agree. I think Tonic is a tool that is not geared only towards developers, I mean, it's actually very point and click. We have many business analysts and DBAs using Tonic every day to create datasets that satisfy their testing needs and grow with the testing needs as they change. It's a pretty straightforward UI to use, and I think that opens up the ability for anyone to use it.
Andrew Colombi (36:39):
And furthermore, it also kind of, it's keeping track of your database. So one of the big differences with Tonic and Mockaroo, Tonic is connected to your database and actively using your database as a way to know what to create. And because it's connected to your database, it actually can monitor it for changes that might require you to intervene and say like, "Oh, well, a new table was added. Let's make sure that we get data for that table." Or, "A new column was added or PII showed up in a place that PII wasn't showing up before." So those are things that we've all thought about first, as first-class features that we want.
Chiara Colombi (37:21):
So following up on the, “it is what it is because people ask for it.” This is an interesting question about when you're scaling up, how have you thought about quality versus velocity in your engineering tact, particularly when going through scale-up periods for your products?
Andrew Colombi (37:34):
Yeah. I mean, that's another great question. I feel like it's a wisdom thing to a degree or maybe some people just get it right from the start, I don't know. But it's very much a careful trade off and a careful balance you need to hold because the more you require certain quality checks, it's going to have an impact on your ability to move quickly.
Andrew Colombi (38:06):
Part of the purpose of deep automated testing is to make sure you don't change things. When you create a sophisticated automation test, end to end testing, you're testing that this thing stays the same as it ever was. And sometimes the right answer... How many times, raise your hand if you've had to change unit tests because actually the desired outcome is changed. It's not like, now we have a different outcome. And so, that means you're doing more work.
Andrew Colombi (38:38):
Definitely that's not to say unit tests are bad, they're very valuable and we use them at Tonic, especially there's a lot of algorithmic stuff in Tonic that you need to test the algorithm and it'd be impossible as a human to just look at the output and be like, "Oh yeah, this looks statistically right." That's not a thing, you need a computer to examine it, really, and understand the distributions of the input and the output and compare those distributions, et cetera. But it is a really valid question, how do you trade off velocity versus quality?
Andrew Colombi (39:20):
The advice that I would give you is kind of similar to the advice that I gave earlier, which is, it doesn't matter until it does; don't prematurely optimize, be economical about this too. Knock on wood, Tonic hasn't had any major security failures or any major bugs that have caused a really big issue for any of our customers and now we have over 50 customers.
Andrew Colombi (39:51):
So perhaps, you're going to ask me this question again in like two months after there's a major outage and I'll be like, "Quality first every time." But for now, because we have a lot of aspirations, a lot of ambitions and we want to keep the velocity high so we're trying to be economical, but not too economical. It's a really tough question, there's no right answer.
Mark Brocato (40:20):
Yeah. I think it depends on the risks involved and the domain that you're in and what's right for that domain; if you're in finance, then quality first. If you're in safety, quality first. And there's an interesting trade off that I've wrestled with over the years of learning, unit tests versus integration tests. And I recently had to have a kind of a come to Jesus moment in the work that we're doing with Rust.
Mark Brocato (40:49):
So Rust, for those of you who don't know, it's a pretty different language, but it's a language where the compiler really beats you up. A compiler is a strict Catholic school nun over you with a switch, just beating on you to write the perfect piece of code that has no memory leaks, that has no concurrent modification of shared references, all this low-level stuff. But it's not easy to write unit tests for, it's not as well developed a unit testing community as JavaScript I think is.
Mark Brocato (41:23):
JavaScript and Ruby actually, both have really great unit testing communities. In Rust, you can't really get a coverage report yet, I mean, I hope that changes. What I found is that unit tests are not as valuable unless you're committed to 100% coverage, which sounds like a tall order, in that the difference between projects where we say we're going to have 100% coverage all the time versus projects we're going to have really good unit testing, there's actually a very wide gap there. Because when you have 100% coverage, you can immediately know if something lapsed, you immediately know if there's something uncovered because going from 100 to 99 is an easy thing to notice.
Mark Brocato (42:09):
Going from like everything's in the 90s and one thing changed by 1% and noticing that in a PR, it's very difficult. If I'm going to really invest in unit testing, I want to go all the way and say, "We're going to do 100% coverage and we're going to check that on every PR." But that aside, we also do a ton of integration testing. And this is not on Mockaroo, but this is another thing that I've been working on for years.
Mark Brocato (42:38):
We do a lot of integration testing, and I found the integration testing is actually much more valuable than the unit testing; it finds more real-world problems. And so, I think if I had to choose one, I would invest more in integration testing. It's a little bit harder to develop, it slows down your velocity at the beginning, but then it really finds the real world problems.
Andrew Colombi (42:59):
Yeah.
Mark Brocato (43:00):
If you're not going to do 100% code coverage with unit testing, where you can make unit testing help you is sometimes there's code that's easier to write test first. I'm not one of these people that subscribes to the test first development as a religion, but sometimes it's just harder to write code that you're going to have to manually interact with your browser other than to just write the test and write the code alongside it; it actually helps you write the code faster.
Mark Brocato (43:23):
So it can be used as a tool just for increasing your velocity if you use it right. I would not have said this five years ago, but I'm a big proponent of 100% test coverage now because of what I've seen; the safety that it ensures across teams and the easiness to notice when that diverges from that 100%.
Andrew Colombi (43:43):
There's a couple of things that I wanted to follow up on there. So the idea of what matters and where does it matter, that's a really good point, Mark. Even with that code base too, you can have sort of FinTech and healthcare or whatever, safety, but within our code base, Tonic's code base, there are some really sensitive algorithms that really, really need to be right and hard to know if they're right.
Andrew Colombi (44:08):
For example, we have encryption algorithms, we have a lot of statistical stuff that is pouring through data and then trying to create new data like it, we have a subsetting algorithm which is just a really intricate algorithm that slowly crawls our database. Those things are great candidates for automated testing, it's slightly higher than unit testing, but maybe not as far as integration testing, whatever you want to call that, I usually just call it unit testing.
Andrew Colombi (44:38):
And we definitely focus on those because if those go wrong, then data does not get produced the way we promised it did and that's bad for Tonic, we want our customers to be confident that the data we're producing is good. We skimped maybe on some of the UI unit tests, but we're heavier on some of the backend algorithmic tests.
Mark Brocato (45:01):
Yeah. I definitely forgive that in my projects, where we do the least amount of testing is the UI unit testing wise. I've been on projects where we're actually building component libraries, like for React, and then unit testing is really great there because your surface area is so large, your surface area is your API, so you need automated tests. But if you're just building an application, that's not one of the first things I would invest in, is automated testing of the UI. People that can do that well are not cheap and it's a lot of work to do.
Andrew Colombi (45:32):
And it's very hard. Yeah.
Mark Brocato (45:33):
And then, you change your design because you arrive at a better UX and then it's like, "Well, throw out all your tests now." That's a frustrating road to hoe.
Andrew Colombi (45:40):
Yes, for sure.
Chiara Colombi (45:44):
We have a question that's more about the space that these platforms are in, are there any specific problems that you think synthetic data tools can solve in the future that you're excited about? And the example they gave is, testing and QA for machine learning. And I think there already is work being done there.
Mark Brocato (45:59):
God, I need to read a book on machine learning. It's just so hard to know everything in tech and that is one of my weakest areas. And if I knew more, Mockaroo would be a more useful tool and that's probably where I need to go next.
Mark Brocato (46:16):
One of the things recently I was talking to a customer about was, it's actually an emerging financial standard for credit card information interchange. It's not adopted yet, but they're banking on it being adopted and being the future. And that's where Mockaroo is really helpful, they don't have any data to reproduce or to fake.
Mark Brocato (46:44):
And so, their ask was a very tall one in that there's hundreds and hundreds of fields they need to synthetically create, but I'm proud to be able to help with that kind of stuff. And my heart has always laid with innovation and creating new stuff, and so Mockaroo serves that pretty well, but Andrew-
Andrew Colombi (47:06):
Yeah. It's a really open space right now, and there are a lot of companies taking at it from different angles. Tonic's perspective is that we've been focused on structured data, so that's data that looks like either a table or a JSON document. And I think for Tonic, interesting avenues of further exploration would be around the unstructured stuff like images and texts, there's phenomenal advances in both of those already in the academics and also in industry; there's the GPT-2 and 3, and then there's the Google deep image synthesis stuff.
Andrew Colombi (47:54):
So for Tonic, that's an interesting next step that I think we are interested in taking. But there are other companies, that's their first step. There are companies that are looking at image synthesis as the number one thing to solve for them. So it's hard to even know what's the next step for data synthesis because it's such a broad field right now and so many people are in it, but to Mark's earlier point, it's impossible to know everything in tech and so my next step might be someone's first step.
Chiara Colombi (48:27):
Yeah, that's a good point. A couple more questions from the audience, what trade-offs are there building for a few users versus many users or for building by few versus many engineers?
Andrew Colombi (48:41):
Good questions. I'll start maybe with the few users versus many users. One thing I would say is, well, there's a variety of things here, there's the architecture of few users versus many users. I don't have as much to say about that even though that is the topic here. I was going to say, the user experience of few users versus many users is also another interesting thing which affects architecture.
Andrew Colombi (49:12):
I typically designed for many users. And by that, I mean, your UX, how it's going to feel to use your product. Design for many users because that's going to open the gates of who's going to use your product, and making your product easier and more accessible is always better. Even those users that are like, "Oh, I'm the power user. I don't care, just give me a fricking terminal. That's all I need."
Andrew Colombi (49:47):
There are ones that actually are true about that, but a lot of them that use the terminal if you gave them some point and click, they would like it more and they'd be stickier and they would use it longer, and all those things. So I'm a proponent of trying to make your designs as simple as possible.
Andrew Colombi (50:02):
On the architecture side of it, in my time at Palantir and Tonic, we've never had to really architect for many users and internet-scale many users. Most of my services are dealing with hundreds of users at most. And so, I don't know if I have as much to add there. I think, Mark, you might have some thoughts.
Mark Brocato (50:25):
Yeah. My recent work in the last five years has been internet scale, for sure. So I think one thing, designing for few users sucks. One of the best things, in my early days in my career, I designed a product that had few users. It was a very valuable product. It was a great job, loved the people I work with.
Mark Brocato (50:51):
But what bothered me was, you get these whale customers that come in and they want you to add a salad spinner to your application, it's the most important thing for them. They dictate a lot of your design, and that can hurt down the line where you start getting more users. And then, your application is this weird design that made sense to this one guy who had a huge contract with you, and that just stinks.
Mark Brocato (51:16):
The faster you can get to even a lot of free users, I feel like you better your product will be in terms of design. Because all those opinions and people banging away at it, harden it very quickly. In the beginning, it kind of humbles you as an engineer as a designer, but it's a relief when you arrive at something that is working every day for lots of people of different skillsets and backgrounds.
Mark Brocato (51:41):
For me, personally, I will always try to guide my career towards the things that have lots of users, because it's more fulfilling and it's actually easier, I think, because the feedback comes so rapidly that you know when you've done something wrong and you don't get too far down a wrong path before the internet guides you back to something that's more usable.
Mark Brocato (52:05):
Designing for large teams versus small teams is an interesting one too. I've mostly worked on small teams, but have some experience with larger teams as well. And with the larger the team, the more concrete you need to be in the contracts between either the services or the teams. All those different touch points need to be well-documented and need to be well tested.
Mark Brocato (52:30):
Not having to do that is an advantage for small teams, for sure. If you're a team of five or 10 and you can get away with everybody banging away on the same monolithic code set, you're going to be much faster. That just doesn't scale to infinity, and it's hard to make monoliths scale technically as well.
Mark Brocato (52:52):
The thing that makes startups possible these days and why we have such a boom in innovation is the clouds. The story of the last 15 years of computing was AWS and commoditizing cloud resources. And everybody needs to build for taking advantage of that type of cloud today. And so, whether you like it or not, that is many services that are specialized and using services like AWS to scale. So there's that tension between monoliths and microservices again, but I think you need to be level-headed about picking and choosing.
Andrew Colombi (53:26):
Yeah. I mean, that was a great point about it. I forgot about the big teams versus small teams. And I agree with everything you said, just that point about the contracts needing to be very, very, very solid. And that's where you spend a lot of time communicating, documenting, and testing, is 100% true once you go...
Andrew Colombi (53:47):
And I would encourage people to try to keep things as small as possible for as long as possible because you're so much faster and there's so much less ceremony, so much less overhead. But once you get past 30, 40, 50 developers, things start needing to be different than 15 people crushing it on a monolith.
Mark Brocato (54:07):
Yeah. And in business speak or Silicon Valley speak, don't scale until you have product market fit. That's the other way of thinking about it. It's brutal to be in a business where you thought you had PMF and you've scaled, and now you need to go find PMF. It's just so difficult to move quickly.
Mark Brocato (54:28):
And to try to figure out how to make use of a large team of engineers, you find work for them that really doesn't need to be done and then you have to notice that you're doing that. So yeah, definitely stay small as long as possible until you're forced to get bigger by demand.
Andrew Colombi (54:45):
Yeah.
Chiara Colombi (54:46):
We're coming up on the hour and we've got a question that is great to end on. There are more questions from the audience, what I might do is just ping you guys afterwards and I can shoot off some emails to people to get their questions answered. For this question, I'm going to ask it in two different ways. If you could do everything over again, what would you do differently? Or... What do you know now that you wish you'd known when you started?
Mark Brocato (55:10):
Well, I can take that. There are parts of Ruby they're not a great choice for an application like Mockaroo. Ruby's not the highest performing language out there. And I would be able to scale to much larger sizes, more cheaply. Now you can pay Amazon to scale anything, no matter how slow to any size, but your bank account will go so deep.
Mark Brocato (55:36):
But if I had the foresight to use something lower level, and I think then these technologies didn't even really... the choice would have been something like C or maybe Java. Now, there's a language that I love, which is the same syntax as Ruby called Crystal, which is basically a systems level language with very forgiving syntax. It's super, super fast; it's fast as C.
Mark Brocato (56:01):
Someday, I will rewrite parts of Mockaroo in Crystal, and I'll be able to generate 1,000 times as much data in the same amount of time. I wish I'd had the foresight to know how successful it would be and maybe choose some different technologies for the computationally intensive parts of Mockaroo. So yeah, that's my, the one thing I would love to change.
Andrew Colombi (56:23):
Yeah. When you're starting a company, you're always trying to, what is the analogy, is a hero is trying to skate to the puck and not know where the puck is or whatever, ice hockey analogy. And I think when we started Tonic, on-prem was a really important model for us and I think it's still a very important model for us. I think 10 years from now, looking back on that, you're going to see that there's more and more hosted opportunities, more and more cloud.
Andrew Colombi (57:01):
And it's not that I would necessarily, if I could do it over, I would just say, "No, we're a cloud only, cloud first." But I definitely foresee in the next years that being a big tension for products like Tonic. And maybe 10 years from now, I'll look back and be like, "That was actually dumb, we should have just done hosted cloud first." Well, we'll see, there are different... And then, with all the legislation happening right now too, maybe on-prem will get like a resurgence because of data sovereignty, and all that jazz.
Chiara Colombi (57:42):
Yeah. There were more questions we could have asked, but this has been awesome. Thank you guys both so much for your time. Any final thoughts you'd like to chime in with or did we cover everything?
Mark Brocato (57:53):
Well, I hope everybody enjoyed, this is a little bit of a different talk for us, nerds talking about architecture and coding and stuff. I love doing this stuff and there was a pretty good turnout, so hopefully we can do something along these lines again. Just thank you everybody for turning out and thank you for using Mockaroo.
Andrew Colombi (58:11):
Yeah, absolutely. I echo those words. Thank you for showing up, and I hope you learned something about Tonic, about Mockaroo, et cetera.
Chiara Colombi (58:20):
Yeah. Definitely we'll be hosting more events like this, because it's wonderful just to have you guys kind of share your thoughts and ideas in all these areas. I did put up on the screen how you can contact us if you'd like to reach out. You've got an email address for both Mockaroo and Tonic. You can find us on Twitter at our different handles and our websites as well.
Chiara Colombi (58:41):
Let me see, just one last quick mention of the workshop that we are hosting next week. Go on to tonic.ai, click on webinars and you'll find the registration page there. Thanks to everyone for joining us. Thank you, Mark. Thank you, Andrew. And we look forward to the next conversation.
Mark Brocato (58:59):
Likewise, thank you, Chiara.
Chiara Colombi (59:00):
Yes.
Andrew Colombi (59:01):
Thanks. Bye.
Chiara Colombi (59:01):
Bye, everyone.