De-identifying test data: K2View’s entity modeling vs Tonic’s native modeling

Andrew Colombi, PhD
March 8, 2024
De-identifying test data: K2View’s entity modeling vs Tonic’s native modeling
In this article

    The process of de-identifying test databases can be approached in a variety of ways, and we’re often asked how our approach differs as compared to others. In this article, we’ll explore how our approach differs from that of “Data Product Platform” K2View, since we’ve discovered that we’ve built our technologies in two very, very different ways. Read on to learn which approach will work best for you.

    K2View’s Entity Model Approach

    Similar to, K2View offers technology for de-identifying sensitive data for use in testing and development. Our end goal is the same, but how we reach that goal is, suffice it to say, not. And when it comes to evaluating implementation and configuration, it’s all about the journey and not the destination.

    To implement K2View, your journey begins by creating an entity model that represents, conceptually, the contents of the database you’d like to de-identify. What is an entity model? It’s probably easier to describe with an example. Imagine a database with two tables, People and Houses. Each row in People represents a person, and each row in Houses represents a house where people live.

    A sketch showing two tables, one for People and one for Houses, with a foreign key column in the People table linked to the primary key ID column in the Houses table.

    The entity model of this simple database represents each row of People and Houses as entities. The columns of these tables become properties of the entities, and the foreign key between People and Houses forms a link between the Person and House entities.

    A sketch showing how different People entities map into different House entities, with multiple people mapping into one house.

    Seems easy right? All you need to do is identify all the entities, properties, and links in your database, et voila: an entity model. Well, having worked at Palantir for nearly 10 years, I know a thing or two about entity models, or “ontologies,” in Palantir’s parlance. (Early Palantir built its entire business on the idea that the right ontology can unlock powerful analytics.) And I can tell you, defining an entity model is not as easy as it sounds.

    The Challenges of Entity Modeling

    To illustrate the challenges of entity modeling, let’s imagine a simple database with three tables that represent people using meeting rooms in an office. The 3 tables are,

    • People. Each row represents a person in the office.
    • Rooms. Each row represents a room in the office.
    • Attendance. Each row represents the time a person attended a room. This table has three columns, a person_id, room_id and timestamp for when the person was in the room.
    A sketch of three tables, including People, Attendance, and Rooms, and showing how a column appears in each table.

    Problem 1: Modeling ambiguity

    In this three table example we have two obvious entities: Person and Room, derived from the People and Rooms tables respectively. But what about Attendance? Is Attendance an entity, or just a link between the Person and Room entities?

    A sketch illustrating a scenario in which Rooms and People are entities, vs a scenario in which Rooms, Attendance, and People are entities.

    To me, it feels more like a link. It doesn’t really represent a real world thing, and I feel like entities should be reserved for something that’s more concrete. 

    Now let’s say I make one small tweak to the database: let’s rename the Attendance table to Meetings. Woah! A Meeting is a real world thing that happens to people in a room at a certain time. That’s an entity! If you’re feeling uneasy that the name of a table might fundamentally shift the way you think about your entity model, then you’re not alone. There is no right or wrong; the exercise is fundamentally ambiguous.

    This modeling ambiguity presents real challenges. The first is that it’s subjective, and subjectivity kills productivity. Are you looking forward to the meeting to collect feedback on whether Meetings are entities or links? Now multiply that by the countless other examples in your rich dataset that will inevitably be ambiguous in other ways. (Is an Address a property of an entity, or an entity itself? If Address is an entity should the City and State also be entities that are linked from the Address or should they be properties?) And every decision is another point of friction to getting to productivity.

    Another issue is that your choice of entity model will have downstream impacts on how you can interact with the modeled data. Tools that work with data ontologies have different actions you can perform on entities, properties, and links. Your modeling will have direct consequences on what you can do, and those consequences are not always obvious upfront. Instead they may show themselves only after you’ve worked with the entity model for some time. Realizing your workflow is blocked because you decided Attendance should be a link instead of an entity is a special kind of frustrating.

    Finally, subjectivity means that different teams may choose to model the same data in different ways. These incompatible entity models will prevent teams from effectively working together down the line.

    Problem 2: Ambiguity at scale

    We’ve established that within one dataset it’s not always obvious how to model data. Now bring multiple data sets into the mix. Our simple three table database is but one microservice among many. Imagine we have another database that represents HR data, e.g. each employee’s name, title, address, phone numbers, emergency contact, etc. It’s just one table, People, and it has very different data from the other People table. The data may even be incompatible, e.g. name vs first_name and last_name. Are these both modeled as the same Person entity, or different kinds or Person?

    A sketch of an HR People table.

    Each entity modeling approach will have its own advantages and disadvantages. Which you pick will be a matter of opinion and experience; and you’ll probably regret whatever decision you choose because there never was a right answer to begin with. In other words, it’s all the same challenges as Problem 1.

    Problem 3: Entity model maintenance

    These challenges never end. Your initial model must adapt as the underlying data changes or as you onboard new datasets. In fact, the issues compound as your entity model grows. With every addition of data, your modeling decisions must be compared against all past modeling decisions, meaning the more you model, the more laborious future modeling becomes. For example, Address is an entity in this database; does that mean it should be an entity everywhere? What if some databases don’t have a separate Address table? How do we reconcile these differences?

    Changes to your underlying data may necessitate changes to the overall entity model too. Every future change to your database – e.g. adding a column, deleting a table, or adding a foreign key – will need to be reflected in the entity model you’ve developed. This kind of tax to development work becomes quickly tiresome.

    How Does an Entity Model Protect My Data?

    It doesn’t. Remember, entity modeling is not the goal—it’s simply a journey to get you to the goal. The goal is to create useful, protected data based on production data, and you don’t need an entity model to do that. With Tonic Structural, we focused on streamlining the journey to get you to your goal as efficiently as possible. This means working directly with data in its native form. Tables, columns, rows, documents, JSON, XML, CSV, Avro, Parquet, Postgres to Oracle, Databricks to Snowflake, Tonic connects directly and immediately to your data where it lives. No need to answer complex modeling questions before getting right into the heart of your data.

    The benefits of this approach are many, including avoiding all of the problems created by entity modeling. With native data connectors:

    1. There are no ambiguous questions about how to model your data, because there is no entity model.
    2. Changes to your underlying data do not require updating an entity model too, because there is no entity model.
    3. You can start protecting your data immediately without engaging in a complex modeling project. Start with just one column or table, and get immediate value. Iterate from there to scale up to your whole data ecosystem.

    Entity modeling is a fine tool if you want to enable analysts while hiding the complexity of the underlying data. But, as data practitioners ourselves, operating at a higher level of abstraction does not serve us, it slows us. Check Tonic Structural out yourself: start with a free trial or book a demo to connect directly with our team.

    Andrew Colombi, PhD
    Co-Founder & CTO
    Andrew is the Co-founder and CTO of Tonic. An early employee at Palantir, he led the team of engineers that launched the company into the commercial sector and later began Palantir’s Foundry product. Today, he is building Tonic’s platform to engineer data that parallels the complex realities of modern data ecosystems and supply development teams with the resource-saving tools they most need.