We introduced Masquerade as a way to redact data in real time. You can check out of the motivation here. Briefly, adding a proxy between your database and your application is a great way to preserve privacy in a very low friction way. What follows are some of the implementation details.
Postgres clients communicate with Postgres via a messaging protocol over TCP. The best place to learn the details of the protocol is in Postgres’s own documentation. This post will go over the absolute minimum amount of information needed to understand how the proxy works.
The messaging protocol supports two types of messages which are FrontEnd and BackEnd messages. As you may have guessed, FrontEnd (FE) messages originate from the client and BackEnd (BE) messages originate from the server.
Masquerade does not alter FE messages. They are always sent to the server unaltered. Additionally, most BE messages are left unaltered as well, with the exception of:
During initial startup for a connection, the client might issue an SSLRequest request to determine whether or not the server supports SSL. The server will return either a ParameterStatus (’S’) or a NoticeResponse (‘N’) to indicate that it does or does not support SSL, respectively. Because the proxy currently does not support SSL connections, when we see an SSLRequest come from the FE, we intercept the server response and return a NoticeResponse response regardless. This tells the client that SSL is not supported. If the client uses sslmode=require then the connection will not succeed.
The remainder of this article (and a majority of work in the proxy) is related to how we handle modifying DataRow responses. DataRow messages include a single row from a SELECT query result-set. Their structure is simple and defined below:
Byte - ’D’ to indicate what follows is a DataRow message
Int32 - Length of message contents in bytes, including self
Int16 - Number of column values that follow
Next, the following pair of fields appear for each column:
Int32 - Length of column value, not including self. -1 indicates NULL.
Byte[] - Value of the column. Will be 0 bytes if NULL.
Masquerade provides a hook when a DataRow message is received so that users can easily modify DataRows in place, on the wire, and hence modify what data the Postgres client actually sees.
Modifying a DataRow message comes in several parts. First, the actual Byte[] value of a given column is modified according to some logic (we’ll discuss this more in the section on masking). Second, once the new value is computed we must re-calculate the Int32 that corresponds to the length of the column value. Finally, once all applicable columns in a DataRow message have been modified we must re-compute the Int32 of the message content length.
If you would like to see how this is done in Masquerade, please examine the following function in PostgresDataMasker.cs.
<p>CODE: https://gist.github.com/chiarabeth/1b33aa18504a9b621e4b1a731592bbff.js</p>
Notice how modifyDataRow also takes a RowDescription? RowDescription is another type of BE message which gives additional details on the DataRow messages sent afterwards. To view details on how modifyDataRow gets called, you can refer to PostgresBackendStateMachine.cs. The tl;dr is that we maintain a state machine that tracks exactly where we are in the TCP byte stream. The state machine tracks what type of BE message is currently being streamed, what the latest RowDescription message is, etc. and also ensures that modifyDataRow is called only at the appropriate time, i.e., once a complete DataRow message has been received.
This proxy can be useful without masking data for those interested in getting a better insight into how the Postgres Messaging Protocol works. Those willing to modify the code can use the proxy to register event listeners that fire when the FrontEnd or BackEnd send specific message types. Despite that potentially interesting use case, we expect most people to use Masquerade for actual data masking.
Currently, there are 3 settings one can use to control the masking process.
Masking configuration is determined via config.json, in the root of the repository. It looks something like this:
<p>CODE: https://gist.github.com/chiarabeth/c6beb1aa61294eabf09046dc8851db3e.js</p>
The preserve_keys value determines how we treat foreign and primary keys. If set to true, they are not masked; if set to false, they are masked like any other column.
data_type_masks and column_masks allow you to assign masking rules based on either a data type (which would apply to all columns of that data type in the entire database) or based on a specific column. Specific columns are identified by their schema, table name, and column name. column_masks settings override data_type_masks settings.
Each column and data type mask requires the user to specify a masking function. There are currently a variety of simple masking functions included in this first release and more will be added over time. A masking function takes a column value as a string and returns a new string with the corresponding masked value. There is no requirement that the returned string be the same length as the original string. Users can create their own masking functions by creating a function with the following signature:
<p>CODE: https://gist.github.com/chiarabeth/88e79ee7b776ead50527243c7ae4c342.js</p>
in the PgMaskingProxy.Maskers namespace. The masking_function name given in config.json should be all lower case and should match the name of the function.
Typically the proxy is connected to a fast-follower database that might not support writes. We would not recommend connecting a web application directly to the proxy because, most likely, your application will need to run INSERTS, UPDATES, and DELETES. A workaround is to connect to the proxy via pgdump and dump a masked version of the database then run a restore via pg_restore to give yourself a new database on which you can run any operations.
Enable your developers, unblock your data scientists, and respect data privacy as a human right.