RAG Evaluation Series: Validating the RAG performance of OpenAI vs

Adam Kamor, PhD
February 27, 2024
RAG Evaluation Series: Validating the RAG performance of OpenAI vs
In this article

    This is the sixth installment in a multi-part series evaluating various RAG systems using Tonic Validate, a RAG evaluation and benchmarking platform. All the code and data used in this article is available here.


    We’ve been focusing our RAG evaluation series on specialized tools or frameworks that primarily integrate retrieval-based components with LLMs. These tools are great for hackers and developers who are exploring RAG use cases for their generative AI apps; however, there is also another side of the market that we’ve been following focused on abstracting away the technical complexity of building a chatbot on top of your company data: no-code. It’s great for organizations that may not have the technical talent or resources to build a de novo system on their own.

    One, in particular, that you might have heard of has been making waves in the space since the launch of ChatGPT in 2022: is a platform designed to offer no-code solutions for implementing LLMs into chatbots, allowing users to create, customize, and deploy AI-powered chatbots without needing to write code.

    I’m excited about this one because it enables an entirely different side of the market to make use of the awesome potential of generative AI. For this head-to-head evaluation, given that the two products essentially share the same Zodiac sign, I wanted to see how’s out-of-the-box RAG capabilities compared with OpenAI’s. As always, I used Tonic Validate to compare the answer quality of both RAG systems and took advantage of the native performance metrics, logging, and visualizations that the platform offers.

    Setting up the experiment

    As in our previous posts, I used a collection of 212 Paul Graham essays and evaluated each RAG system on a set of 55 benchmark questions and ground-truth answers from the corpus of text.

    In order to get started we’ll first need to take care of a few housekeeping items:

    1. Create a account
    2. Obtain an API key from OpenAI
    3. Create a Tonic Validate account

    You can also use this Github Repository to grab the Jupyter Notebook used for this article. In the following sections you’ll see code for exactly how to implement each step, but the Jupyter notebook contains more details and allows for you to run the experiment yourself. Setup

    Setting up is fairly straightforward. Once your account is created on, you can create a CUSTOM_GPT_API_KEY. With this API key, you can first set up your project by uploading the essay file. For convenience, we’ve combined all 212 essays into a single text file

    The get_customgpt_rag_response function does the following: it takes in a question and answer (a.k.a. a Benchmark Item, in Tonic Validate parlance) and a project so the agent knows which knowledge base to query from.

    OpenAI Setup

    I set up the OpenAI RAG assistant using the same process used in our other blog installments using the tool. I’ll use the Assistants API to create an assistant, upload the combined text file, and then ask questions to the chatbot. Based on prior experience, I know that the OpenAI Rag Assistant has a limitation of working with 20 files at once, so combining all 212 essays into a single text file is a requirement. I’ll hold my comments on the practicality of this setup for production deployments for another time.

    From here, you can initialize your client by running the following:

    Spot checking the systems

    To start the comparison, let’s give both and OpenAI an easy question about one of the essays: "What was Airbnb's monthly financial goal to achieve ramen profitability during their time at Y Combinator?"

    The reference answer from our benchmark set is:

    Airbnb's financial goal for ramen profitability was to earn \$4000 a month.

    OpenAI and both scored a 5/5. OpenAI responded with:

    Airbnb's monthly financial goal to achieve ramen profitability during their time at Y Combinator was $4000 a month, which included $3500 for rent and $500 for food.

    And answered:

    Airbnb's financial goal for ramen profitability was to earn \$4000 a month.

    Both answers are correct. OpenAI provided slightly more detail and verbosity. This can be a good thing for many use cases but sometimes brevity in response is desired as well, particularly when character-based tokens seem to be the major currency of our time.

    A more thorough evaluation

    For a more thorough analysis I’ll use the Tonic Validate platform. Tonic Validate ships with 6 different metrics which you can use to evaluate the quality of an answer, its likelihood of being a hallucination, and the relevance of the RAG provided context. Each of the metrics requires a different set of information to be provided, and additional examples and documentation for each metric is available in their GitHub repository

    For today, I’ll be using our Answer Similarity Score metric, which compares the RAG-provided answer with a reference answer provided by a human (the Benchmark). It yields a score between 0 and 5, with 5 indicating perfect similarity to the human-provided reference answer (in the benchmark set). 

    I’ll run each of the 55 questions through both RAG systems and have Tonic Validate calculate answer similarity scores for each system. This code will loop through our benchmark set and collect the LLM responses for each system:

    Now that the LLM responses are collated, I’ll ask Tonic Validate to score both sets:

    Now, using the results of Tonic Validate’s evaluation, I can draw some comparisons about the relative performance of each system. Let’s start by looking at some aggregates and score distributions:

    A bar graph and table showing the distribution of scores for OpenAI and CustomGPT.

    A few quick observations. Both systems perform admirably with generally high scores. However, wins on a few fronts. First, its aggregates are better, with a mean score of 4.4 vs OpenAI’s score of 3.5. Additionally, only provided 6 answers with a score below 4, which is really fantastic and generally performs better than most systems we have reviewed in the past. Compare this to OpenAI’s results which yielded 21 answers below a 4. Of note,’s median score was a 5, which is not something seen before by the RAG assistants we’ve evaluated.

    Let’s take a deeper dive and analyze a question on which the RAG systems performed poorly. Both assistants scored a 1/5 on the question:

    What is the main constraint on the number of new startups in the US?

    The correct answer, generated by a human, is:

    The main constraint on the number of new startups in the US is the immigration policy that doesn't allow potential startup founders to enter the country.

    The answer is clear and concise: the US’s immigration policy. It’s a simple answer that a few words would explain sufficiently.

    OpenAI gave an answer of:

    The main constraint on the number of new startups in the US, as suggested by Paul Graham, is not explicitly mentioned; however, he does discuss what would happen if the number of startups increases dramatically. He points out the challenge for those whose job it is to judge new startups, specifically investors and acquirers in handling a dramatic increase in applications. Graham questions how to deal with, for example, a situation where there are 10,000 applications in a year, up from 1,000, indicating that the existing systems would need to improve to manage such growth. He suggests that finding a solution would likely involve writing some software. This implies that the ability to effectively evaluate and manage an increasing number of startup applications is a crucial factor in the growth of new startups.

    While said:

    The main constraint on the number of new startups in the US, according to the context provided, is the pool of potential founders. It is suggested that some people would make good founders, and others wouldn't, and that the pool size can be estimated by identifying who is relentlessly resourceful, as this is a key trait of a good founder.

    OpenAI was unfortunately not able to find an answer and attempted to feign a verbose response using the sea of context it was provided with, whereas provided an answer which disagrees with the reference answer. The answer, by the way, can be found very easily in the essay titled “Founders Visa.” The lack of answers by both assistants is unfortunate, but on a positive note Tonic Validate correctly identified both answers as subpar and the scores given are certainly appropriate.

    On a final note, we have previously noted the performance issues with OpenAI’s RAG Assistant in previous posts using the system, with slow answer times and some flakiness in API responses. Unfortunately, in my experience, this remains true today (over 3 months after the initial product was released). I recommend using an exponential backoff approach when polling the assistant to check the status of a query.

    Conclusion is the clear winner in this RAG face-off. It is easy to configure, just like OpenAI, but the answer quality was superior. Of course, I should mention that’s secret sauce is in their proprietary embedding and retrieval models and prompt engineering strategies, which you get out-of-the-box with little configuration needed. Even more important, if you choose to use the tool, you won’t need to write any of the code I used above.

    Overall, offers an approachable, user-friendly, no-code platform aimed at enabling non-technical users to build and deploy AI chatbots with advanced features like RAG, without needing to understand the underlying technology. I think this is going to be a powerful paradigm in the future adoption of LLM technology across industries and am looking forward to watching this side of the market grow.

    P.S. Like many LLM-based applications, uses OpenAI GPT models under the hood, so OpenAI does deserve some credit here 🙂.

    Adam Kamor, PhD
    Co-Founder & Head of Engineering