Blog
Technical

How to prevent data leakage in your AI applications with Tonic Textual and Snowpark Container Services

Author
Adam Kamor, PhD
March 14, 2024
How to prevent data leakage in your AI applications with Tonic Textual and Snowpark Container Services
In this article
    Share

    For the detailed instructions on how to setup and install the containers on Snowpark you can go to this GitHub repository. If you encounter any issues, file a GitHub issue and we’ll get back to you quickly.

    Tonic Textual provides advanced Named Entity Recognition (NER) and synthetic replacement of sensitive free-text data. It is used to safely train AI models on sensitive private data, preventing data leakage through your AI models. Today, we are excited to announce that Tonic Textual is now available on the Snowflake Data Platform via Snowpark Container Services (SPCS). SPCS enables you to run containerized workloads directly within Snowflake, ensuring that your data doesn’t leave your Snowflake account for processing.

    This means that you can take advantage of Textual’s state-of-the-art NER models without ever having to egress your data out of Snowflake’s secure environment. If you are storing unstructured text data in Snowflake and want to use this data for model training, fine-tuning, or RAG applications, Tonic Textual can help you do that safely at scale while maintaining data utility and compliance.

    Architecture

    Snowflake’s new container service allows you to deploy software within Snowflake’s secure walls, allowing you to leverage the latest technology while never having to send your data to third-party cloud services. Below is an architecture diagram showing how Tonic Textual runs from within your Snowflake infrastructure. In short, we provide you with a Textual user-defined function (UDF) called textual_redact. It takes a column as an input and returns a redacted column of text. You can use it directly in a Snowflake Worksheet or through an existing ETL or external IDE connected to your Snowflake databases. The data in your Snowflake tables is sent to the Snowpark service Textual application via a load balancer where it is then round-robined to one of the available Textual ML Model Service nodes. During installation you can determine how many Model Service nodes to stand up and auto-scaling is also supported. The processing speed of the UDF will be determined by the number of Textual ML Model services running.

    Because of SPCS, this all happens on Textual services running on Snowflake’s clusters, so your data never leaves the secure confines of Snowflake.

    An architecture diagram for Tonic Textual's deployment on Snowflake's Snowpark container services

    A basic example

    For this example, we’ll set up a single node compute pool in Snowpark. It’ll use a GPU_NV_S instance which is Snowpark’s smallest and most cost-efficient GPU instance. It uses a single A10G Nvidia graphics card which has 24GB of RAM. It additionally has 8 vCPUs and 32GB of non-GPU RAM. We’ll run our service on top of this compute pool and disable auto-scaling by setting MIN_INSTANCE and MAX_INSTANCE counts to 1.

    Let’s start with a few simple text examples which we’ll call directly without needing data loaded into a table:

    This returns a singular result of:

    My name is [NAME_GIVEN_czg72], and [DATE_TIME_joVVM9] I am demo-ing [PRODUCT_uRLPiR3X], a software product created by [ORGANIZATION_QDeGw5]

    We can see them in a side-by-side as:

    A side-by-side view of the text to be redacted and the redacted text.

    Textual’s NER models recognized all of the sensitive entities in the string and redacted them by replacing the sensitive values with redacted tokens. However, we can also tell Textual to replace the redacted tokens with synthetic text data. For example, we might decide that we’d like to redact PRODUCT and ORGANIZATION entities but synthesize people’s names. To do that, just pass in a second argument to the textual_redact UDF as:

    This will now disable DATE_TIME detection and synthesize names while redacting everything else Textual identifies as sensitive in the string. This configuration yields:

    A side-by-side view of the text to be redacted and the redacted and synthesized text.

    A longer example, using a table

    Let’s create a toy Snowflake table that holds some conversational data. The following code will create a Snowflake table representing a transcription of a customer support conversation:

    In this table, the conversation is broken up into multiple rows. Alternatively, each conversation could exist entirely in 1 row or perhaps the data is stored along with metadata in a JSON blob. No matter what though, Textual will support it!

    Now, let’s protect it using the textual_redact UDF:

    This will return a single column of redacted snippets. We can take it a step further and create a new TABLE (or perhaps even a materialized view); because the sensitive information is removed, this table can be safely shared to lower environments for downstream use cases such as model training or analytics.

    This query will give you an entirely new table of redacted data. Converting this to a view or materialized view is as easy.

    The Takeaway

    Snowflake is widely known as one of the most secure cloud data stores on the market. Because of this, you trust Snowflake with your organization’s most sensitive data. By deploying Tonic Textual on Snowflake’s clusters using SPCS, your data stays in Snowflake’s secure confines, maintaining data security while still getting the benefits of Textual’s state-of-the-art NER models and synthetic data engine. The combination of SPCS and Tonic Textual makes it safe and easy to redact and synthesize text for training AI models without fear of data leakage. 

    Have sensitive text data on Snowflake? Reach out to us for access to Textual on Snowpark at: textual@tonic.ai.

    Adam Kamor, PhD
    Co-Founder & Head of Engineering