Use Case
Generative AI

What is model hallucination?

Author
Ethan P
April 26, 2024
In this article
Share

Large language models (LLMs) like GPT-3 and GPT-4, have revolutionized the technology landscape. They've become incredibly adept at generating human-like text, answering questions, and even creating content that's indistinguishable from what a human might write. However, there is one major limitation to these models, hallucination.

What causes hallucination in AI?

Hallucination occurs when a language model generates information that is either false, nonsensical, or not grounded in reality. This doesn't mean the AI is 'delusional' in a human sense, but rather that it's producing content without a clear basis in its training data or real-world facts. Hallucinations can range from minor inaccuracies to outright fabrications, and they represent a significant challenge for the reliability and trustworthiness of language models.

The causes of model hallucination are multifaceted. They can stem from biases in the training data, the inherent limitations of the model's architecture, or the ambiguity of the user's prompt. Additionally, because LLMs generate responses based on patterns in their training data rather than retrieving specific pieces of information, they're inherently prone to making things up as there’s no ground-truth data to pull from.

Retrieval-Augmented Generation (RAG) systems

This is where Retrieval-Augmented Generation systems come into play. RAG systems combine the generative capabilities of LLMs with retrieval-based components, which pull in information from external databases or documents in real-time. The idea is to ground the LLM's responses in actual data, thereby reducing the likelihood of hallucinations.

In a RAG system, when a query is received, the retrieval component first searches a database or set of documents to find relevant information. This information is often called the “context” or “retrieved context” for the LLM. This retrieved context is then fed into the LLM, guiding it to generate responses based on actual data rather than solely relying on the patterns it learned during training. This approach not only helps mitigate hallucinations but also enables the model to provide up-to-date information that wasn't available during its training period.

How RAG systems address AI hallucinations

RAG systems directly address the problem of model hallucination by providing a factual grounding for the LLM's responses. This happens via scanning through your data, finding the pieces of data that most closely relate to the question posed to the LLM, and then injecting the data into the prompt for the LLM. By integrating retrieval into the generative process, these systems ensure that the LLM has access to real-world information, which can significantly reduce the incidence of fabricated or irrelevant responses.

Moreover, RAG systems can be designed to highlight when information is retrieved from a specific source, offering users a way to verify the accuracy of the response. This transparency is crucial for building trust and ensuring that the generated content meets the user's expectations for accuracy and relevance.

The limitations of RAG systems

RAG systems do have one major limitation though. Even if you provide ground-truth facts through retrieval, hallucination can still occur. The reason for this is multi-faceted. Sometimes, the LLM ignores the retrieved information and generates its own answer. Other times, the LLM includes all the information provided to it, but then adds additional information not supported by the ground-truth facts into the response. The LLM can also misinterpret the facts from the context (which is the information that the RAG system retrieved) and give an incorrect answer. Finally, the retrieval step itself can fail by injecting irrelevant information into the prompt which confuses the LLM causing an incorrect response.

The role of Tonic Validate in enhancing RAG systems

As you can see, even with RAG, model hallucination is still a problem. This is part of the reason we built Tonic Validate. Validate is a tool we designed to help measure the performance of RAG systems and LLMs. It consists of two parts:

  1. A Python library for evaluating performance
  2. A web app for visualizing the results from the Python library

In our Python library, we have many different “metrics” which measure your LLM/RAG system’s performance. Each metric targets a specific problem with RAG systems to help you figure out whether or not your system is performing as expected. For instance, here are some of the things Validate can help you measure.

A screenshot of the Tonic Validate UI

Ignored retrieval information

As mentioned in the previous section, many times hallucination can occur in a RAG system when the retrieved information is ignored. Thankfully, Tonic Validate has two metrics which can help with this.

Augmentation Precision: This metric scores whether or not the relevant context is in the LLM’s answer. For example, when querying the capital of France, the system might retrieve information stating "The capital of France is Paris'' and "France is a country in Europe." Augmentation Precision checks whether the LLM's response contains the relevant context— "The capital of France is Paris." A response containing this information (e.g. “France’s capital is Paris”) will have a perfect score. Meanwhile, a response without this information (e.g. “France’s capital is London” or “France is a European country”) will score poorly.

Augmentation Accuracy: This metric scores whether or not all the retrieved context is in the answer. This metric differs from Augmentation Precision in that it doesn’t try to determine whether or not the context is relevant. Instead, it just checks that all of the context is used in the answer. So following the previous example, an answer of “Paris is the capital of France and France is in Europe” will score a perfect 1.0/1.0 whereas an answer of “Paris is the capital of France” will score 0.5/1.0 (since it uses one out of the two pieces of retrieved information). Finally, “The capital of France is in London” will score 0.0/1.0 since it uses none of the retrieved information.

Adding unsubstantiated information

Hallucination also can occur when the LLM adds information that isn’t in the retrieved context. However, Validate has another metric which can help with this called “Answer Consistency”.

Answer Consistency: This metric checks whether there is information in the LLM answer that does not come from the context. So if you ask the question “What is the capital of the UK” and the RAG system might retrieve “The capital of the UK is London”. The LLM’s answer is “The UK’s capital is London. London is famous for Buckingham Palace” then that answer will score poorly as it’s including information about Buckingham Palace which isn’t in the retrieved information.

Misinterpretation of facts

LLMs can also misinterpret the information that the RAG system retrieved. This often happens when an LLM is provided with a context that is ambiguous or hard to understand. Say you ask “What powers a quasar and what does it emit?”. The retrieved context might be “A quasar is powered by a GCCO” and “A black hole is theorized to emit Hawking Radiation”. This might confuse an LLM since the LLM might not realize a GCCO is another name for a black hole. However, our Answer Similarity metric can help avoid this issue by helping you flag cases where the LLM misinterprets facts.

Answer Similarity: This metric essentially measures whether the LLM’s answer matches a ground-truth answer. So in the quasar example, the reference answer would be “A quasar is powered by a black hole which is theorized to emit Hawking radiation.” If the LLM response with an answer like “A quasar is powered by a GCCO, but we don’t know what it emits” then it would score low on Answer Similarity since the LLM misinterpreted the facts and thus didn’t provide an answer that matches the reference answer. Meanwhile, if the LLM correctly interprets the facts then the Answer Similarity score will be high.

Retrieval failure

The final way where hallucination can occur in a RAG system is when the RAG system’s retriever fails to retrieve the correct context. To prevent this from happening, Validate offers two metrics that you can use.

Retrieval Precision: The Retrieval Precision metric measures how relevant the context is to a given question. For instance, if you ask “What is the capital of France” then a retrieved context of “France’s capital is Paris” will score high while a retrieved context of “The capital of Japan is Tokyo” will score low. This helps you determine whether or not your RAG system is retrieving information that helps you answer the question posed to the LLM.

Answer Consistency: In addition to checking for unsubstantiated information, the Answer Consistency metric can also help measure retrieval failure. For instance, if your RAG system retrieves information that is incorrect or irrelevant, then the LLM will answer incorrectly. Answer Consistency can measure how correct the LLM’s answer is based on a ground-truth answer. So if the LLM’s answer is incorrect, then you can infer that the RAG system’s retrieval might be bad and do a deeper dive to check the retrieved context for the question you asked.

Summary

In the rapidly evolving landscape of AI, Large Language Models like GPT-3 and GPT-4 exhibit exceptional text generation capabilities but are prone to hallucinations, where they generate misleading or incorrect information. Retrieval-Augmented Generation (RAG) systems aim to mitigate this by anchoring model responses in real-world data, yet challenges such as ignoring crucial information or introducing erroneous details persist.

Tonic Validate, a Python library for evaluating RAG system performance, can help detect and prevent hallucination on RAG systems. It offers a range of metrics to pinpoint areas where RAG systems falter, such as neglecting retrieved information or adding unsubstantiated content. By providing actionable insights, Validate aids developers in refining their systems, ensuring responses are both accurate and trustworthy.

Ultimately, Tonic Validate empowers developers to improve their RAG systems, fostering more reliable and fact-based AI responses, thereby enhancing user trust and the practical utility of language models in real-world applications. Get started for free today.

Build better and faster with quality test data today.
Unblock data access, turbocharge development, and respect data privacy as a human right.
Ethan P
Software Engineer

Fake your world a better place

Enable your developers, unblock your data scientists, and respect data privacy as a human right.