Skip to main content

All Tonic.ai guides

Category

AI model training

Understanding named entity recognition (NER) models

Author

Joe Ferrara, PhD

April 24, 2024

What is Named Entity Recognition (NER)?

Named Entity Recognition (NER), also known as entity chunking or entity extraction, is an NLP task in data science that identifies and classifies words in text into predefined categories, or entity types, such as names of persons, organizations, locations, dates, quantities, and monetary values.

The primary goal of an NER system is to locate and classify entities of interest to use in a broader application. The broader application may be to:

Identify private information in text
Determine whether certain subjects appear in text
Extract structured information from unstructured text to used in downstream analysis and applications

How NER works: the process

NER is usually accomplished using a model. Free text is the input into the model. The model then uses a labeling scheme to output the words in the free text that fall into each predefined entity type.

The most prevalent types of NER models are rule-based, machine learning, or neural network artificial intelligence (AI) models.

For example, a simple rule-based NER system to identify phone numbers follows the rule to identify as a phone number any string of 10 digits that is only separated by spaces, dashes, or parenthesis. You can easily imagine examples where this rule-based model succeeds and where it fails.

To create a machine learning or neural network AI model, you train the model on free text data where the entity types of interest are already identified. In this case, the NER model learns the patterns of the entities in the training data and then applies those patterns to identify entities in any given free text.

One benefit of an AI NER model is that it can learn to determine whether a word is an entity of interest based on the context around the word. For example, an AI NER model could use the context of the sentence to determine whether the word Apple refers to the organization or to the fruit. A rule-based model might struggle with something like this.

NER examples

Named Entity Recognition (NER) is a powerful tool that enables systems to interpret words based on context. For example, NER allows a search engine to differentiate between "Amazon" the company and "Amazon" the rainforest, depending on how each is used in the text. Chatbots and AI assistants also rely on NER to identify key entities in user queries, helping them deliver more precise responses by understanding the relevant names, places, or dates mentioned.

Here’s a quick example to help understand what an NER system does.

Let’s say that we want our NER task to identify the entity types person, organization, location, and date.

For the following text:

Steve Jobs and Steve Wozniak founded Apple on April 1, 1976 in Cupertino, California.

For each entity type, we would identify the following words:

Person: Steve Jobs , Steve Wozniak
Organization: Apple
Location: Cupertino, California
Date: April 1, 1976

That in a nutshell is NER.

You specify the entity types you’re interested in.
You define those entity types precisely.
In a given text, you identify the words that fall into each entity type.

For the entity types person, organization, location, and date, it’s not too difficult to define what they mean precisely. However, it can be difficult to define each entity type with no ambiguity.

Even in the simple example above, the word Apple is used to refer to the Apple organization. In different contexts, the word Apple could instead refer to the fruit.

Measuring NER results and performance

NER accuracy

NER accuracy is measured using usual data science classification metrics such as accuracy, precision, recall, and F1 score.

To measure the accuracy of an NER model, one needs a collection of text data that is labeled to indicate where the entities of interest appear.

If the NER model is an AI model, then the data used to measure its accuracy should not be contained in the data used to train the model.

NER efficiency

NER efficiency is defined as how fast an NER model processes text data. This is usually measured in words per second.

When you use an AI NER model, efficiency is an important consideration, because although AI NER models are usually much more accurate than rule-based models, they are also much slower.

For an AI NER model, NER efficiency is determined by the size of the model and the computer architecture that runs the model.

De-identify your unstructured data for use in AI.

Unblock your AI initiatives and build features faster by securely leveraging your free-text data.

Synthesizing and redacting sensitive information

The most fundamental use case for an NER system is detecting sensitive information in text so that the sensitive information can be redacted or synthesized. This is also known as data cleansing with NER.

For example, it’s common to need to remove any entity types that correspond to personally identifying information (PII). This could include names, phone numbers, email addresses, street addresses, credit card numbers, or Social Security numbers.

If you’re redacting sensitive information, then you use the NER model to identify the words that are sensitive, and then you redact those words. For example, the example sentence:

Steve Jobs and Steve Wozniak founded Apple on April 1, 1976 in Cupertino, California.

would be redacted as:

[PERSON_0] and [PERSON_1] founded [ORGANIZATION] on [DATE] in [LOCATION].

If you’re synthesizing sensitive information, then you use the NER model in the same way, but instead of redacting the words, you replace them with fake non-sensitive words. So our example sentence would be synthesized as something like:

John Doe and Jane Moe founded Try Inc. on May 7, 1984 in Kansas City, Missouri.

The sensitive information is replaced with non-sensitive information:

Steve Jobs → John Doe
Steve Wozniak → Jane Moe
Apple → Try Inc.
April 1, 1976 → May 7, 1984
Cupertino, California → Kansas City, Missouri

When synthesizing, the synthesized text looks like and has similar meaning to the original real text, but the sensitive information is removed.

Tonic Textual’s advanced NER technologies

Tonic Textual is a text de-identification product that uses state-of-the-art proprietary named-entity recognition (NER) models to identify sensitive information in text.

A screenshot showing how Tonic Textual redacts sensitive information, in this instance, in a legal document.

The NER models in Tonic Textual are artificial intelligence models that are trained on proprietary data that Tonic owns. The data is specifically tailored to enterprise use cases, and the models achieve state-of-the-art NER accuracy.

Tonic Textual supports both redaction and synthesis of sensitive information, and can identify a variety of entity types, with new types added regularly. The supported entity types in Tonic Textual include first name, last name, street address, city, state, zip code, phone number, and datetime.

Tonic Textual’s custom models feature uses the most advanced LLM technology to allow the user to create an NER model on an entity type that Tonic Textual’s core models do not cover. The custom models feature uses LLM technology to generate NER training data for any entity type that the user chooses. The custom models feature then trains an NER with the generated training data. This is a form of custom data models for NER.

You can deploy Tonic Textual on-premises, with capabilities to run efficiently on a CPU or to leverage GPUs for optimized performance. Tonic Textual’s deployment has configuration options to maximize NER efficiency, so that you can easily do bulk document redaction.

Superior performance over open source solutions

Tonic Textual NER models are trained on proprietary data that is specifically tailored to enterprise use cases, which gives Tonic Textual’s NER models much better performance than open source solutions.

Open source solutions rely on open source NER datasets for training data. These datasets (CoNLL 2012 and CoNLL 2003 for instance) come from public data that is very different from the internal data a company may want to work with.

In addition to being trained on different data than open source models, Tonic Textual’s NER models support more entity types than the most popular open source solutions.

Tonic Textual’s NER models are also easily configurable on your hardware to be optimized for speed and efficiency on CPU’s and GPU’s. The deployment support and configuration we provide is not easily obtained for open source models.

Application scenarios for Tonic Textual’s NER models

Use cases in data cleansing and model training

Tonic Textual can fit in your language model training pipeline to synthesize the training data. You can remove any private information you do not want your language model to have access to.

Declassifying and protecting documents

Tonic Textual is great for declassifying and protecting documents. In addition to working with text files or being used programmatically via its API, Tonic Textual can do NER on PDF files, which allows you to redact classified entity types in your PDF files.

Integrating Tonic Textual into MLOps workflows

The Tonic Textual API allows Tonic Textual to fit easily into an MLOps workflow integration.

Whether you deploy Tonic Textual on-premises or you use the Tonic Textual cloud offering (textual.tonic.ai), you can use the Textual API to access the Textual NER models programmatically

The tonic-textual Python SDK allows you to easily integrate Tonic Textual into your data processing pipelines to de-identify your free text data at any step in the process.

Getting started with Tonic Textual

To get started with Tonic Textual, sign up for a free account at textual.tonic.ai or book a demo directly with the Textual team. Also, be sure to check out the Tonic Textual documentation for comprehensive overviews of the product’s capabilities.

Joe Ferrara, PhD

Staff AI Scientist

Joe Ferrara is a Staff AI Scientist at Tonic.ai, where he uses the latest developments in artificial intelligence to improve named entity recognition and synthetic data generation in Tonic Textual. He holds a Ph.D. in Mathematics from the University of California, Santa Cruz, and a B.A. in Mathematics from the University of California, Berkeley. Prior to his role at Tonic.ai, Joe served as a Data Scientist at ICW Group, where he gained experience applying traditional data science techniques in the context of the insurance industry. His background in theoretical mathematics research gives him a unique perspective on his work in artificial intelligence and machine learning.

Continue with the next guide in this series

Quickly building training datasets for NLP applications

Related Guides

Understanding model memorization in machine learning

Data privacy in AI

Guide to synthetic test data generation

Data synthesis

What is data de-identification?

Data de-identification

Make your sensitive data usable for testing and development.

Accelerate your engineering velocity, unblock AI initiatives, and respect data privacy as a human right.

Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.

Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.