Named Entity Recognition (NER) is a task where words within a body of text are identified and classified into predefined categories known as entity types. Common entity types are names of persons, organizations, locations, dates, quantities, and monetary values.
NER is a part of the larger area of data science or data analysis known as natural language processing (NLP).
NER’s primary goal is to locate and classify entities of interest to use in a broader application. The broader application may be to:
NER is usually accomplished using a model. Free text is the input into the model. The model then uses a labeling scheme to output the words in the free text that fall into each predefined entity type.
The most prevalent types of NER models are rule-based, machine learning, or neural network artificial intelligence (AI) models.
For example, a simple rule-based NER model to identify phone numbers follows the rule to identify as a phone number any string of 10 digits that is only separated by spaces, dashes, or parenthesis. You can easily imagine examples where this rule-based model succeeds and where it fails.
To create a machine learning or neural network AI model, you train the model on free text data where the entity types of interest are already identified. In this case, the NER model learns the patterns of the entities in the training data and then applies those patterns to identify entities in any given free text.
One benefit of an AI NER model is that it can learn to determine whether a word is an entity of interest based on the context around the word. For example, an AI NER model could use the context of the sentence to determine whether the word Apple refers to the organization or to the fruit. A rule-based model might struggle with something like this.
Here’s a quick example to help understand what NER does.
Let’s say that we want our NER task to identify the entity types person, organization, location, and date.
For the following text:
Steve Jobs and Steve Wozniak founded Apple on April 1, 1976 in Cupertino, California.
For each entity type, we would identify the following words:
Steve Jobs , Steve Wozniak
Apple
Cupertino, California
April 1, 1976
That in a nutshell is NER.
For the entity types person, organization, location, and date, it’s not too difficult to define what they mean precisely. However, it can be difficult to define each entity type with no ambiguity.
Even in the simple example above, the word Apple
is used to refer to the Apple organization. In different contexts, the word Apple
could instead refer to the fruit.
NER accuracy is measured using usual data science classification metrics such as accuracy, precision, recall, and F1 score.
To measure the accuracy of an NER model, one needs a collection of text data that is labeled to indicate where the entities of interest appear.
If the NER model is an AI model, then the data used to measure its accuracy should not be contained in the data used to train the model.
NER efficiency is defined as how fast an NER model processes text data. This is usually measured in words per second.
When you use an AI NER model, efficiency is an important consideration, because although AI NER models are usually much more accurate than rule-based models, they are also much slower.
For an AI NER model, NER efficiency is determined by the size of the model and the computer architecture that runs the model.
The most fundamental use case for NER is detecting sensitive information in text so that the sensitive information can be redacted or synthesized. This is also known as data cleansing with NER.
For example, it’s common to need to remove any entity types that correspond to personally identifying information (PII). This could include names, phone numbers, email addresses, street addresses, credit card numbers, or Social Security numbers.
If you’re redacting sensitive information, then you use the NER model to identify the words that are sensitive, and then you redact those words. For example, the example sentence:
Steve Jobs and Steve Wozniak founded Apple on April 1, 1976 in Cupertino, California.
would be redacted as:
[PERSON_0] and [PERSON_1] founded [ORGANIZATION] on [DATE] in [LOCATION].
If you’re synthesizing sensitive information, then you use the NER model in the same way, but instead of redacting the words, you replace them with fake non-sensitive words. So our example sentence would be synthesized as something like:
John Doe and Jane Moe founded Try Inc. on May 7, 1984 in Kansas City, Missouri.
The sensitive information is replaced with non-sensitive information:
When synthesizing, the synthesized text looks like and has similar meaning to the original real text, but the sensitive information is removed.
Tonic Textual is a text de-identification product that uses state-of-the-art proprietary named-entity recognition (NER) models to identify sensitive information in text.
The NER models in Tonic Textual are artificial intelligence models that are trained on proprietary data that Tonic owns. The data is specifically tailored to enterprise use cases, and the models achieve state-of-the-art NER accuracy.
Tonic Textual supports both redaction and synthesis of sensitive information, and can identify a variety of entity types, with new types added regularly. The supported entity types in Tonic Textual include first name, last name, street address, city, state, zip code, phone number, and datetime.
Tonic Textual’s custom models feature uses the most advanced LLM technology to allow the user to create an NER model on an entity type that Tonic Textual’s core models do not cover. The custom models feature uses LLM technology to generate NER training data for any entity type that the user chooses. The custom models feature then trains an NER with the generated training data. This is a form of custom data models for NER.
You can deploy Tonic Textual on-premises, with capabilities to run efficiently on a CPU or to leverage GPUs for optimized performance. Tonic Textual’s deployment has configuration options to maximize NER efficiency, so that you can easily do bulk document redaction.
Tonic Textual NER models are trained on proprietary data that is specifically tailored to enterprise use cases, which gives Tonic Textual’s NER models much better performance than open source solutions.
Open source solutions rely on open source NER datasets for training data. These datasets (CoNLL 2012 and CoNLL 2003 for instance) come from public data that is very different from the internal data a company may want to work with.
In addition to being trained on different data than open source models, Tonic Textual’s NER models support more entity types than the most popular open source solutions.
Tonic Textual’s NER models are also easily configurable on your hardware to be optimized for speed and efficiency on CPU’s and GPU’s. The deployment support and configuration we provide is not easily obtained for open source models.
Tonic Textual can fit in your language model training pipeline to synthesize the training data. You can remove any private information you do not want your language model to have access to.
Tonic Textual is great for declassifying and protecting documents. In addition to working with text files or being used programmatically via its API, Tonic Textual can do NER on PDF files, which allows you to redact classified entity types in your PDF files.
The Tonic Textual API allows Tonic Textual to fit easily into an MLOps workflow integration.
Whether you deploy Tonic Textual on-premises or you use the Tonic Textual cloud offering (textual.tonic.ai), you can use the Textual API to access the Textual NER models programmatically
The tonic-textual Python SDK allows you to easily integrate Tonic Textual into your data processing pipelines to de-identify your free text data at any step in the process.
To get started with Tonic Textual, sign up for a free account at textual.tonic.ai or book a demo directly with the Textual team. Also, be sure to check out the Tonic Textual documentation for comprehensive overviews of the product’s capabilities.