Blog
Technical

Masquerade: A Postgres Proxy to Mask Data in Realtime

Author
Adam Kamor, PhD
April 24, 2019
Masquerade: A Postgres Proxy to Mask Data in Realtime
In this article
    Share
    Side by side of psql running against the proxy and against the database
    Left side shows psql connected to the proxy while the right side shows psql connected directly to the DB.

    Redact and Replace in Realtime with no Additional Infrastructure

    Many of our customers have multiple databases, complex application logic, and limited time. One of the easiest ways to protect your data is to add a proxy between the consumer (analyst, application, developer, etc) and the data base. Since the proxy doesn’t clone the data, there are no additional infrastructure costs. Most importantly, because the proxy sits between your application the database there is no need to replicate application logic and your application doesn’t need to care whether it’s communicating with the real database or the proxy.

    Have you ever:

    • Wanted to let devs/QA test off production quality data without exposing customer information?  Pull this off without setting up a new staging instance.
    • Wanted to grant access to a database for some people to generate reports, but not want them to see sensitive data as they’re creating them?  Accomplish this without heavy ETL or data sanitization.
    • Provide data to sales for demos with minimal effort?  Configure complex relationships simply.

    An Open Source Solution

    About 6 months ago, Tonic open-sourced Condenser, a python-based tool that helps people subset Postgres databases in a way that preserves referential integrity. People have been using it to create dev instances and quickly subset databases to save on storage costs—and we’ve enjoyed working on it. In a similar vein, we are open-sourcing another one of our in-house tools. This one is called Masquerade (our Head of Engineering’s favorite type of party).

    How does it work?

    Diagram showing query and data flow through proxy

    Masquerade can anonymize data in real-time enabling anonymous analytics, application development, and QA testing with next to no overhead. It does this by operating a TCP proxy between your Postgres client and Postgres database and modifying the result-sets generated by SELECT statements according to a set of user-defined rules. The end-result is that specific tables, columns, and data types can be masked so that users connected via the proxy cannot see the true underlying data but instead only see a masked version of the data.

    Want to learn more, check our technical details post.

    Running Masquerade for the first time

    The repository contains everything you need to try out Masquerade, including a Postgres database.

    To get started, run:

    <p>CODE: https://gist.github.com/chiarabeth/7709ba1b9c9455b6fde636bb9384174b.js</p>

    to bring up the database. The database contains a single table called ‘employees’ that has 500 rows, each representing an employee in a fictitious organization. After the database is up, run:

    <p>CODE: https://gist.github.com/chiarabeth/8c9ca1b7eed9931c78bdd939fb1de4bc.js</p>

    to start the Proxy. You can then connect to the proxy using a Postgres client. Using psql, enter:

    <p>CODE: https://gist.github.com/chiarabeth/f08c2dcb5756a8ffdba1406db33666ad.js</p>

    to connect to the proxy using the password ‘password’. Note several things in the above command:

    • By default, the proxy runs on port 20000.
    • Connect to the proxy using the same dbname and credentials you would use when connecting to the real database. In the future, we plan on allowing connections to the proxy with different credentials (since users often won’t have access to the underlying database).

    When you are ready to connect to your own database, go into config.json, modify the db_connection_details object, and update the connection details in your Postgres client.

    Future plans

    There is a list of known issues you can find in the README that will be addressed first. Additionally, we would like to add to our masking framework to more easily support masking based off of other values in a given row. For example, masking an e-mail address based on the firstname column since e-mails usually include a person’s first name. For those interested, consider contributing to help solve these problems, or if you just want a solution, check out Tonic for an out-of-the-box solution that supports a variety of different databases and many complex scenarios and relationships.

    Adam Kamor, PhD
    Co-Founder & Head of Engineering

    Fake your world a better place

    Enable your developers, unblock your data scientists, and respect data privacy as a human right.