Technical deep dive

Deep dive: small vs large language models for token classification

Author

Ander Steele, PhD

Author

June 20, 2025

Check out this talk from Ander Steele, Head of AI at Tonic.ai, which was first presented at ODSC East in Boston, and explores the evolving landscape of named entity recognition (NER); specifically comparing the performance of small language models versus large language models for token classification tasks. With real-world examples, practical evaluation criteria, and insights drawn from experiments using both public web data and proprietary datasets, this presentation offers valuable guidance for practitioners working with unstructured data in domains like healthcare, customer support, and compliance. Whether you're fine-tuning models or evaluating the trade-offs between cost, speed, and accuracy, this session delivers a nuanced perspective on how to get the most out of modern NLP tools.

Interested in a custom demo to support your specific use case? Book a demo with one of our experts.

Full Transcript:

Let's get into talk which is about small language models versus Large Language Models (LLMs) for token classification. Do smaller models still outperform LLMs? And so the token classification task here that I want to talk about is named entity recognition, meaning let's extract named entities from unstructured texts.

A named entity is something like a name or organization or locations or specific entities in the text that need to be detected for a number of reasons. It could also be things like nationality or religious preferences; modifiers or adjectives about specific entities in the text.

Or it could be other interesting nouns like diseases or medications or dosages of those medications. These are all examples of things that one would try to detect. In the context of named entity recognition. This is for a variety of purposes; information extraction is one.

Imagine you have a clinical note and you want to extract the medications and dosages from that note into a machine possible format. From our perspective at Tonic, we care about this from the point of view of protecting privacy. We want to be able to detect the named entities in text– this includes elements like personally identifying information; and then either redact or synthesize those entities.

The classification task here is to start with some annotation guidelines. In other words, descriptions of what the entities are. Then take text, which we’re going to break up into tokens or – words. And from each of those tokens, we're going to predict a label.

The label can be a person, entity, or some other category of word for redaction. We ask, which entity, if any in our guidelines does that token correspond to? And from those, you can roll those up into individual entities. So the way we're gonna frame our evaluation here and, and think about structuring this problem is motivated by the CoNLL conference natural language learning benchmark,, from 2003, which was a named entity, recognition benchmark, which posited the following evaluation criteria. This states that we should evaluate predictions based on spans; meaning, an entity's start and stop location in the text.

Start and stop indices in the text as well as their labels. So in the example of, my name is ‘G. Ander Steel’, my given name, which goes from characters 11 to 19, is ‘G. Ander’ that's highlighted in the light blue.

My family name ‘Steele’ is highlighted in that, that green and that's from character 20 to 26. And in order to be scored correctly with respect to this evaluation criteria, I would have to be required to produce two predictions with the same start and stop indices and the same labels. And so all these other predictions, even though they overlap, are considered incorrect.

The first example, here, I've broken up, ‘G. Ander’ into given names, which is not correct according to my ground truth labels or annotation guidelines. So there's a very strict evaluation criteria. That's one thing to keep in mind. And I mentioned this CoNLL 2003 benchmark, which is basically: Extract names, locations, and organizations from Reuters headlines.

And you can see the progress over time on this benchmark over the past 10 years, this benchmark has kind of saturated, around a .94 F1 score — but there's a couple of interesting points here that are not reflected in this plot. The first interesting point is the release in 2019.

That’s BERT, Google's small language model, that could be used for fine-tuning on a variety NLP tasks; so you can see here with this LST-CRF-ELMo-BERT+Flair model, which is a combination of various techniques to get the best scoring model (of its time).

The other point which is not visible here and which is interesting is that March 14th, 2023, when GPT-4 was released; we don't see a point here indicating that this benchmark has been solved by large language models, which may or may not be surprising, uh, depending on how optimistic you are about LLMs.

I'll talk a little bit about small language models included in this cast of characters. In this talk, we mean fine tuned versions of BERT or RoBERTa. These are encoder models that take sequences of tokens and produce sequences of vector embeddings for each of those tokens that represent not only what the token is, but what it's doing in context.

These are very powerful because you can take vector embeddings and build classifiers on top of those. So in particular for the NER task, you can classify Build Classifiers on top of those token imbeddings, which classify each of those tokens as either one of our entities or not.

In order to do that fine-tuning, you need a large number of annotated samples of data; which can be in the order of a thousand examples of your entity in context. The quality of inference on the fine-tuned model depends very much on the quality of the training data.

Let’s say you have enough data, and maybe it's a thousand, or for more complicated classes it's many more. There's no hard and fast rule here. But you, you need a significant amount of human label data to make this work. And these labels, of course, need to be internally consistent and correct.

[The labels] need to follow the annotation guidelines, and be consistent across the many examples. Finally, the data that you're using to fine-tune this model needs to represent the data that you're going to be inferring on. A concrete example of this is if you take a model that's trained on public web data like spaCy and you try to use it to do inference on automatic speech recognition transcripts, which are not part of the training data – the performance will generally suffer quite a lot – because the inference data is well out of distribution with respect to the training data.

So your training data needs to be representative of the downstream task to leverage these datasets successfully. On the other side of this, we have large language models (LLMs), which unlike BERT and RoBERTa have been fine tuned on, have been pre-trained on much more data.

These are much more comprehensive ‘world models’, if you will. Based on their pre-training, which is built on a 1000x more tokens. These models appear to work well for a variety of information extraction problems. Which means they should [also work for this use case], in principle, right?

We've all seen demos of people using LLMs to structure notes; i.e. extract JSON representations of text. In principle, this should work, and we could even imagine this working even better if we provide examples and context using Few-Shot prompts.

We can engineer prompts for LLMs to work well by giving examples of inputs and expected outputs. By 2023 we're thinking maybe NER is simply solved by GPT-4; but there are certainly lots of challenges with making that work in practice.

One of the big challenges was structuring the outputs because we're basically predicting tokens here. We're predicting entities, which are substreams in our text. And so we have to make this into a machine parsable list. The output [structured output like JSON].

At the time that was difficult to do, but now it's trivial. This is a solved problem. All the major providers offer structured outputs; and if you're running your own model, you can, impose constraints on the generation to make sure that you're outputting the kind of JSON structure that you expect.

So that's one problem solved. But there's still another problem, which is how do you localize entities in the text? And I'll give you two examples of sentences here where a string and a substring have many different meanings.

Sentence 1: The Washington monument in Washington celebrates the life and achievements of Washington.

In this first sentence, if we were asked to extract names and locations: Which Washington corresponds to which? The Washington Monument in Washington; celebrates the life and achievements of [George] Washington. The second here is obviously location; and the third here is a person. So whatever our prediction is needs to indicate which occurrence of the string is what.

Sentence 2: Buffalo buffalo buffalo buffalo buffalo buffalo buffalo

In this second sentence, which is a famous pathological sentence of many different occurrences of the word buffalo. The first ‘buffalo’ indicates the region in New York; the second ‘buffalo’ is the animal. The third buffalo is the verb for harassing.

So these ‘buffalo’s’ all have different meanings, even though they're the same word. That's a problem that needs to be solved because of the output of LLMs, which are tokens. You need to be able to describe which word is what.

Finally, there’s the general problem of using LLMs. You need to align it to your intent. In this case it means basically crafting good annotation guidelines that the LLM could follow. This is similar to how you might build annotation guidelines for human annotators. We have the same problem of describing what you want in a way that can be followed as a recipe.

It may not be obvious how hard this is in practice. A simple example of the instruction ‘tag all person names’. If we were labeling names, this already has like several questions that have to be uh, answered.

For example, ‘does Dr. Watson include the prefix doctor? Should ‘the president’, which refers to a specific individual but not by name, should this count? Do we care that Dr. Watson is a fictional character? All of these questions have to be answered. And of course I have strong opinions on these because I've written annotation guidelines for these.

And I personally think the answer to all of these questions should be ‘no’. But that's for my annotation guidelines. The counterpoint is that when you're writing these annotation guidelines, it may be your priority to figure out all of these rules and edge cases. So it's actually pretty important to review the data, as a source of edge cases, and to formulate these rules.

So, with all that said about the challenges of using LLMs, I want to also talk about some experiments where we compare the small language models versus large language models on a couple of different data sets and with a few different tasks.

The two different data sets that we're going to look at are: One) public web data, which presumably is in the, the pre-training corpus of these large language models. These are things like common crawl samples or SEC filings. The data themselves probably belongs to the pre-training corpus, but the annotations are our own. And so it's reasonable to expect that those haven't been trained on.

The second data set that I want to talk about is ASR transcripts. So automatic speech recognition transcripts. This is data that, you know, we reasonably believe not to be included in the pre-training corpus of any LLM. These are customer transcript conversations between a customer support agent and a customer.

These, of course, are rife with PII, things like payment information, names, locations, and all of that. We want to compare the performance of these models. The models that we'll evaluate are the small language models on one side, which are these fine tuned repair models. We'll show two different variants of it.

One is Tonic Textual; our product, which we've trained on tens of thousands of these examples. But also to showcase the importance of good training data, we will fine-tune a version of Textual, which hasn't seen any ASR transcripts, so we'll call this ‘non ASR Textual’. So we've just thrown away all of the ASR data in our training corpus to train that model on the LLM side. We will just take a smattering of foundation models; the premiere non-reasoning models from the big labs, and evaluate those.

To describe a little bit how we do this evaluation: I've prompted these LLMs to get these structured outputs, following the schema where basically each entity is going to be predicted as a JSON object of text. That's the substring. The label of the substring, and then the occurrence of that substring in the text.

So, in particular, like ‘which Washington is it?’ ‘Is it the first Washington, the second Washington or the third Washington?’ That is that occurrence index. We’ll output a list of these and that will potentially solve the problem of localizing text in the outputs.

We'll also provide pretty comprehensive annotation guidelines to the model. These include the annotation guidelines that our internal annotators use. As an example, here's our guidelines for given name. We have the definition of rules describing in detail, some of these things like prefixes, which we choose not to include.

How do we treat middle names? Et cetera. There's also a bunch of examples in edge cases, which we include, and I haven't shown here. Those are all part of the annotation guidelines that the LLM is prompted with. Now on public data, for this problem that Textual performs quite well, and it's very fast.

It beats all of these models on this particular task, pretty substantially. The best performing large language model is Gemini 2.5, which has like an F1 of around 0.84, which I would consider to be a good model; whereas all the other models never break the F1 threshold of 0.8, which is what I consider the qualitative threshold between a good and bad model.

So it's surprising that there’s the chasm between Gemini and these other models. But also what's worth noting is Gemini is substantially slower and substantially more expensive than any of these. With these other LLMs so you can achieve reasonably good performance using large language models on this specific task of NER on this dataset. But, it's quite slow, and it's very expensive. It's much more cost effective and much more performant and even more accurate to use a small language model here.

We can also break it down by entity types. It's interesting to note the performance here, varies on types. And it's funny that money becomes a difficult thing for these models to predict – as well as date times. This probably has to do with the kind of rules around the boundaries of these spans.

Names of course are also difficult for the non-Gemini models; which is surprising. But, probably has to do with issues around boundaries of span. So let's focus specifically on the problem of names and let's focus this on the most difficult example of ASR transcripts where names could occur over multiple terms of a conversation where a customer is trying to spell their name out to a support representative.

We provide very detailed rules on how to handle that, and how that should be tagged. Our model performs quite well here with an F1 score of 0.95. Essentially all these other models fail to follow these annotation guidelines comprehensively.

Gemini is almost at the point, a barrier here for the subtask of names, and still outperforms these other models. But again, it's very slow and expensive. If we expand the problem to talk about other PII types like payment information, credit card numbers, et cetera. Then there are more things that we can predict and so the scores change.

Again, Textual of course, has an F1 greater than 0.95 on this test set. Geminis still breaks the 0.8 barrier. It's reasonably good here, but still not nearly as performant as your small language model, and that's been fine-tuned on similar amounts of data.

It has been fine tuned on lots of similar data. It's worth pointing out that the LLMs haven't seen any of this data, and so in some sense it's a comparison. Right? The textual model has been fine-tuned on many thousands of examples of similar data, whereas the LLMs haven't seen any of the data.

They've just seen the annotation guidelines. So let's actually talk in a little more detail about some of the failure modes of the LLMs and ways we try to make the performance of these models better. I've shown comparisons here between LLMs and our small language model, but the design space of prompts is infinite.

I haven't proven that LLMs perform worse. I've only proven that this particular indication of them performed worse.So we should do an earnest job of actually making these things work. Here is the first analysis of the failure mode of these models to predict real entities.

Even though I've described a schema for entities, like with some of the failures that happen we predict a substring that doesn't occur in the text; or they'll predict a substring with an index that doesn't occur in the text. And all of these are invalid; hallucinated entities in the text, which are useless.

They can't be reconciled with the text in any meaningful way. They're significant in some models like GPT-4 has basically 5% of the entities that we detected here were invalid. So that's not very useful. GPT-4 0.5 does a much better job of following the instructions here and not hallucinating entities; but still qualitatively the predictions are not good.

Gemini 2.5 Pro does almost as well. But overall the entities that it's predicting are more correct. So how can we make this better? Well one of the experiments we tried was providing short versus full annotation guidelines. The short version of the annotation guidelines is basically ‘use the sentence description of the entity’. And the full guidelines were the full thing I showed you with rules and edge cases and examples. And we see in most of these models that there's a significant performance boost in the LLMs for using the full guidelines except for GPT-4 0.1, which strangely performs worse.

Perhaps because it already had some sense of what it needed to do here, it was distracted by the context. The other example, which I think is kind of striking here for its strangeness, is how performance increases when you go from zero examples of annotated data to a few. In this case two examples of annotations.

A lot of these models perform significantly better with the exception of Gemini, which was our best performing model and still appears to me confused by the examples we chose here. These models were carefully chosen by myself. So maybe these were not useful to the model and misaligned it.

It's worth exploring this as an area for further improvement. Maybe there's a way of using rather than two examples or hundreds of examples, but some small number of useful examples to boost performance of the LLM on the sanitation task.

I'll stop here with just a couple of conclusions.

One thing I haven't addressed, which is the fine tuning of large language models for this task. This has essentially been answered by prior work, the universal NER paper, which fine tunes less than 32 billion parameter LLMs on the task of predicting these NER samples.

It found that the performance of these models was neck and neck with the fine-tuned small language models. I don't see a whole lot of benefit at the moment for fine tuning large language models on the NER task in the same way that we fine-tune small language models, it's more cost-effective and better performing.

However, as these LLMs are getting better and better; I think it's increasingly possible to use them as data annotators – particularly when you have humans in the loop generating test sets against which these can be reviewed. So this is something that we use in practice today at Tonic.

We use LLM Annotators to help extend our training corpus. I think there's an interesting angle here for using reasoning models, which haven't touched on it all. We use them internally as data annotation reviewers. And it's interesting watching the reasoning traces of these models as they compare guidelines with annotations. We found it helpful to filter annotations for quality. So with that I’ll pause for questions.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Ander Steele, PhD

Head of AI