Today, we officially launch Tonic Validate, a new platform built for engineers implementing RAG in their work. We’ve been implementing generative AI technologies in our products since the very beginning of Tonic.ai. Our engineering team is obsessed with finding the latest and greatest technologies out there to experiment with and see how they can be incorporated into our products to bring our customers more value.
Recently, our experimentation involved internal efforts to build Retrieval Augmented Generation (RAG) applications. Never heard of RAG? Check out this blog post to get up to speed. In short, RAG is a solution designed to enhance the performance of LLMs.
Building safe and effective tools on top of LLMs presents novel engineering challenges, including the fundamental problem of testing and measuring success when inputs and outputs are natural language. As a test data company with experience in generative AI, we quickly saw that our RAG testing tool could be useful to any engineer building RAG applications — and thus Tonic Validate was born.
What is Tonic Validate?
Tonic Validate is a platform that streamlines evaluating and iterating on the performance of RAG applications. It consists of three components:
- Tonic Validate - A UI for visualizing and tracking your RAG experiments. Sign up for a free account to check it out.
- Tonic Validate Logging (aka tvallogging) - An open source Python SDK for sending your RAG application’s inputs and outputs to Tonic Validate.
- Tonic Validate Metrics (aka tvalmetrics) - An open source Python package containing RAG metrics for evaluating the accuracy of RAG responses. These metrics were developed by us and are built into Tonic Validate, but they can also be used independently of the UI.
The combination of these components provide an evaluation framework for your RAG application. With Tonic Validate, you get:
- Automatic logging and metrics calculations in just a few lines of Python code.
- A convenient UI to help visualize your experiments, iterations, and results.
Tonic Validate gives you the tools you need to rigorously improve your RAG applications.
Why Tonic Validate?
Earlier this year we built a RAG chatbot to allow Tonic employees to intuitively interact with our internal documents. While it is often easy to build compelling LLM demos using cherry-picked examples, robust application performance depends heavily on the careful tuning of many RAG parameters and pipeline components. Tuning these parameters without metrics or evaluations is an impossible task, and the tooling for managing these experiments did not yet exist. So we developed a standard evaluation framework for measuring the accuracy of responses from a RAG application that follows three steps:
- Create a benchmark dataset of questions and answers to test the RAG application with.
- Measure the accuracy of the RAG application’s natural language responses by using LLMs to derive scores from different aspects of the responses.
- For each change to the architecture of the RAG application, obtain new responses to the questions in the benchmark dataset and observe how the metrics change.
Defining the metrics with which to evaluate the RAG application was a challenge, however. The challenge is two-fold:
- A RAG application is built of many moving pieces. Any small change to one of these pieces can have a significant impact on the answers returned by the application.
- It is difficult to evaluate natural language responses. Unlike numeric predictions in a supervised machine learning task, there aren’t well-defined mathematical metrics for measuring the accuracy of natural language responses.
How are Tonic Validate’s metrics calculated?
The steps of a typical RAG application are:
- The user asks a question.
- The application retrieves contextual information relevant to the question from the documents database, called the retrieved context.
- The application augments the user’s question to prompt an LLM.
- The LLM generates an answer to the question, which is returned to the user.
In this flow, the retrieved context and returned answer are the two outputs of the RAG application that need to be evaluated and improved. In order to evaluate the natural language outputs, Tonic Validate uses an LLM evaluator (i.e. GPT-4) to score these outputs by comparing them to the question asked and the correct answer (what the answer should be). The LLM evaluator is asked the following questions for each metric to assess each aspect of the RAG application:
- Answer similarity score: How well does the returned answer match the correct answer?
- Retrieval precision: Is the retrieved context relevant to answer the question?
- Augmentation precision: Is the relevant retrieved context included in the LLM prompt?
- Augmentation accuracy: How much of the retrieved context is in the answer?
- Answer consistency: Does the answer contain information that is not derived from the retrieved context?
The LLM is prompted with each question in such a way that it will return a structured, numeric score. For example, for the answer similarity score, the LLM is prompted as follows:
“Considering the reference answer and the new answer to the following question, on a scale of 0 to 5, where 5 means the same and 0 means not similar, how similar in meaning is the new answer to the reference answer? Respond with just a number and no additional text.”
It may sound counterintuitive to score the performance of an LLM with another LLM, however, research has found that using an LLM as an evaluator of natural language is almost as good as using a human evaluator. Further, breaking down the question, answer, retrieved context, and correct answer into the questions above gives the LLM evaluator simpler tasks to complete than the LLM generating the responses for the RAG application.
For more information on these metrics, including the complete definition of each, check out our docs and this Jupyter notebook to see these metrics in action and how they vary with the most common RAG parameters.
Get Started With Tonic Validate
RAG is a revolutionary method for improving LLM performance when you have a large and dynamic free-text dataset. We are excited to share our open-source libraries and intuitive UI to help you avoid the struggles we experienced in improving our own RAG application.