Use Case
Generative AI

Understanding Model Memorization in Machine Learning

Author
Ander Steele, PhD
April 17, 2024
In this article
Share

Introduction to Model Memorization

What is Model Memorization?

Large generative models, such as GPT or Stable Diffusion, have been shown to memorize parts of their training data. That is, the models can generate verbatim copies of certain parts of their training data. This phenomenon is primarily seen in large language models, and is different from the classic ML problem of overfitting. Large language models have been shown to memorize parts of their training data well before they begin to overfit [6].However, like in the case of overfitting, model memorization may be a failure of the model to learn generalizable representations of the training data.

Why is Model Memorization a concern?

Memorization can be useful—for example, factual memorization is often a desirable outcome when using an LLM as a knowledge base [1]. However, model memorization is dangerous when the wrong information is memorized.

Costly examples of model memorization can be found in recent legal challenges against Microsoft and OpenAI, where model memorization of copyrighted data serves as key pieces of evidence. For example, the New York Times was able to prompt chatGPT to emit verbatim copies of articles. Similarly, the class action lawsuit against Github Copilot is based on the fact that early versions of Copilot would emit verbatim copies of code that fell under restrictive licenses. Ignoring the thorny issues around Fair Use, it’s clear that model memorization has real-world implications.

In addition to revealing copyrighted information in training data, researchers have found a large number of examples of open and closed LLMs emitting PII.  The recent work of DeepMind researchers [5] shows practical attacks for extracting training data from LLMs, even without direct access to the model. For example, they use unusual prompts to circumvent OpenAI’s RLHF safeguards and induce the model to emit samples of training data — in some cases, exposing real PII (shown below in redacted form in Figure 1).

A screenshot of PII emitted by chatGPT
Figure 1. (Redacted) PII emitted by chatGPT. Figure from Scalable Extraction of Training Data from (Production) Language Models

It is not clear if LLMs amplify the risk of accidental disclosures of private information on the public web. However, Carlini et al.'s examples clearly demonstrate the risk of using private data for training or fine-tuning LLMs. Fine-tuning an LLM on customer support transcripts risks exposing private account or financial information. Building LLMs on top of clinical notes risks exposing highly sensitive (and regulated!) personal health information. Before this information can be used to build LLMs, it must be cleaned of sensitive PII/PHI. Failing to do so can expose developers of LLM-based applications to the costly risks of compliance violations and loss of customer trust.

Preventing Model Memorization

Techniques for Reducing Memorization

Short phrases of text, like common idioms or quotes, will naturally occur multiple times in most training datasets. However, long sequences may occur in web scrapes for reasons other than the frequency of usage in human language. For example, research completed by Google DeepMind researchers [4] finds the following text occurring over 40,000 times in the Common Crawl C4 dataset

HD wallpaper. This wallpaper was upload at April 19, 2019 upload by admin in.You can download it in your computer by clicking resolution image in Download by size:. Don’t forget to rate and comment if you interest with this wallpaper.

The work of Kandpal et al. [3] shows that the risk of memorization increases as the number of duplicates increases, so finding and removing such duplicates from the training data is helpful for reducing memorization risk and improving model quality (as measured by perplexity on test set).

Finding duplicates of text is challenging enough on terabyte-scale datasets, but fortunately, the DeepMind researchers [4] have open-sourced their deduplication tool. The challenge is even more immense for finding semantically similar text—see this research paper [7] for a discussion of techniques for this.

Importance of Data Anonymization

While de-duplicating training data is an effective tool for reducing model memorization, there is still the risk that PII may be duplicated within the cleaned dataset. It is neither practical nor desirable to eliminate all duplicate strings in training data—short substrings like “the” are expected to occur frequently! The common approach is to eliminate duplicates exceeding a minimal length threshold, e.g. 50 tokens. This leaves intact short substrings of PII, so another approach must be taken to achieve data anonymization. 

As part of the data cleaning process, practitioners should make every effort to find and remove sensitive PII from training. Training or fine-tuning an LLM for customer support applications does not require training on credit card numbers, phone numbers, or other sensitive information. However, scrubbing this sensitive data at scale requires sophisticated tooling to pinpoint and redact. 

Tonic Textual: A Solution for Data Anonymization

How Tonic Textual Mitigates Risks

Tonic Textual uses state-of-the-art named entity recognition (NER) models to automatically detect sensitive data (names, addresses, payment info, etc.). Once found, this information can either be redacted or synthesized. Using synthetic data preserves the contextual structure of the text while also preserving the privacy of the entities represented in the text, ensuring that your LLM retains the contextual information it requires to perform well in production settings.

A graphic illustrating the redaction and synthesis capabilities of Tonic Textual

Tonic's Named Entity Recognition in Action

Tonic’s NER models are trained on a wide variety of data sources, allowing them to identify sensitive data from contextual clues. This detection is robust to spelling errors, unusual transcriptions, and more.  For example,  in this (fake) customer support transcript, we not only detect the obvious names, but we detect transcriptions of these names (eg “L E M Y”) and partial information about this name (“M or N?” and “M as in man”).

A screenshot of Tonic Textual's NER and redaction capabilities in action.

Best Practices for Safe Model Training

Data Cleaning with Tonic Textual

Tonic Textual allows users to easily manage privacy cleaning policies for training data. For example, names are probably not considered sensitive when using public scraped data, but we still want to redact email addresses, payment information, and other data that might be inadvertently published and scraped. On other datasets, redacting names may be absolutely necessary. Textual makes setting policy decisions like this easy.

Integrating Tonic Textual into MLOps Workflows

Tonic Textual’s Python SDK allows developers to easily integrate automatic PII redaction and synthesis into their MLOps workflow. 

Tonic Textual is also available as a self-managed deployment, ensuring that sensitive data never leaves your VPC. Moreover, Textual can easily scale horizontally, ensuring that trillion-word datasets can be protected quickly in hours.

Future of Secure Model Training

Evolving Techniques in Data Privacy

Deduplication and data anonymization are only two techniques in the growing field of AI and data privacy. Other approaches, such as training with differential privacy, provide additional layers of privacy protection. Given the high cost of sensitive data disclosure, it’s imperative that practitioners consider incorporating multiple layers of protection to achieve secure model training.

Tonic Textual's Role in the Future of AI

As the training sets used to pre-train LLMs asymptotically approach the limit of publicly available data, the next frontier will be private data, which often contains sensitive information about individuals. We believe that Tonic Textual is an essential layer of protection for training or fine-tuning LLMs on potentially sensitive data. It is an important tool for organizations to limit the liabilities of third-party risk and accidental data leakage, enabling them to become good data stewards and practice responsible AI.

References

  1. AlKhamissi, B., Li, M., Celikyilmaz, A., Diab, M., & Ghazvininejad, M. (2022). A Review on Language Models as Knowledge Bases. arXiv. https://doi.org/10.48550/arXiv.2204.06031
  2. Carlini, N., Hayes, J., Nasr, M., & et al. (2023). Extracting Training Data from Diffusion Models. USENIX Security Symposium. https://arxiv.org/abs/2301.13188
  3. Kandpal, N., Wallace, E., & Raffel, C. (2022). Deduplicating Training Data Mitigates Privacy Risks in Language Models. arXiv. https://doi.org/10.48550/arXiv.2202.06539
  4. Lee, K., Ippolito, D., Nystrom, A., & et al. (2021). Deduplicating Training Data Makes Language Models Better. arXiv. https://doi.org/10.48550/arXiv.2107.06499
  5. Nasr, M., Carlini, N., Hayase, J., & et al. (2023). Scalable Extraction of Training Data from (Production) Language Models. arXiv. https://doi.org/10.48550/arXiv.2311.17035
  6. Tirumala, K., Markosyan, A. H., Zettlemoyer, L., & Aghajanyan, A. (2022). Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2205.10770
  7. Tirumala, K., Simig, D., Aghajanyan, A., & Morcos, A. S. (2023). D4: Improving LLM Pretraining via Document De-Duplication and Diversification. arXiv. https://doi.org/10.48550/arXiv.2308.12284
Build better and faster with quality test data today.
Unblock data access, turbocharge development, and respect data privacy as a human right.
Ander Steele, PhD
Head of AI