Blog
Technical

De-Identifying Your Text Data in Snowflake Using Tonic Textual

Author
Adam Kamor, PhD
February 7, 2024
De-Identifying Your Text Data in Snowflake Using Tonic Textual
In this article

    Appropriate handling of sensitive customer data is paramount for companies of any size. Period. As businesses increasingly turn to Snowflake for their data warehousing and analytics needs, the challenge of maintaining data privacy without sacrificing utility becomes more complex. While Snowflake does offer some native dynamic masking capabilities for structured data, unstructured text data is an entirely different beast, and the majority of data masking solutions don’t support it. A better option is to use Tonic Textual, our new synthetic data product for automated recognition and redaction of sensitive entities in free-text data.

    Textual offers an innovative approach to preserving the confidentiality of text data that utilizes proprietary named entity recognition models trained on a diverse and comprehensive corpus of text data. Our models are trained on data spanning domains and contexts, allowing them to have better generalized performance than open-source models. We’ve designed the product with an SDK-first approach that lets you integrate Textual directly into your workflows and notebooks. Today, I want to show you how you can de-identify data directly within your existing Snowflake workflows and pipelines, ensuring compliance and maintaining customer trust without hindering data utility.

    In this article, we also provide a step-by-step guide on creating and implementing a User-Defined Function (UDF) in Snowflake that leverages the power of Tonic Textual. My hope is to empower Snowflake users with a new approach to effectively manage your data privacy requirements and enable new use cases to leverage Snowflake data safely. But first…

    Why is this important for Snowflake users?

    Your Snowflake warehouse is a highly valuable asset, filled with rich data about your organization and customers. As an organization that cares about data privacy, security, and customer trust, Tonic Textual makes it easy to protect your text data in Snowflake, unlocking its value for use in machine learning, data analytics, and testing data pipelines without compromising data security. Alright, let’s dive in.

    Creating and Implementing the UDF in Snowflake Workflows

    Configuring a UDF to redact sensitive text data with Tonic Textual in Snowflake is fairly easy to do. I’ve recorded a short demo video to walk you through the process:

    All of the code I used to do this is available in Github here. Step-by-step, it looks a something like this:

    1. Obtain a Tonic Textual API key

    • Sign up for a free Tonic Textual account and create a Tonic Textual API key.
    • Give your API key a name, like "snowflake UDFs", and create it.
    • Make sure to copy this API key, as you'll need it for setting up the UDF in Snowflake.

    2. Set up the UDF in Snowflake

    • Navigate to the GitHub repository to access code examples and scripts for setting up the UDF in Snowflake.
    • Run the 'UDF setup' script in a Snowflake Python worksheet.
    • Insert the API key you created earlier into the appropriate place in the script.
    • This script will create the Python UDF and set up necessary permissions.

    3. (Optional) Using Sample Data

    • Optionally, run 'sample_data.sql' to create a table called 'conversations' with sample data for testing purposes.

    4. Execute the UDFRedact Function

    • With the UDF set up, you can now use the UDFRedact function.
    • The function can be called on a string of text or directly on a table column containing this data.
    • For example, SELECT UDFRedact('Your text here') will return the redacted version of the input text.
    • To redact data from a table, use the function in a query like SELECT UDFRedact(column_name) FROM your_table.

    5. Viewing Redacted Output

    • The UDF returns redacted data where sensitive information is replaced with context-specific, tokenized values.
    • These tokens are unique to the original values and are optionally reversible.

    And there you have it! Protected free-text data in Snowflake, safe and ready to be put to use in your ML, analytics, and software testing workflows, with data privacy and data utility preserved. For further information, visit the Textual page and create a free account to try it out for yourself.

    Adam Kamor, PhD
    Co-Founder & Head of Engineering
    Real. Fake. Data.
    Say goodbye to inefficient in-house workarounds and clunky legacy tools. The data you need is useful, realistic, safe—and accessible by way of API.
    Book a demo
    The Latest
    Tonic Validate extends its RAG evaluation platform to support metrics from Ragas
    RAG Evaluation Series: Validating the RAG performance of OpenAI vs CustomGPT.ai
    Building vs buying test data infrastructure for ephemeral data environments