Technical deep dive

RAG evaluation series: validating OpenAI Assistant’s RAG performance

Author

Ethan P

Author

November 15, 2023

Updated 1/29/24: Since the publication of this article, we have released a new version of the tvalmetrics sdk called tonic-validate. The new SDK includes several improvements to make it easier to use. Due to the release of the new SDK and the discontinuation of tvalmetrics, certain elements within this article are no longer up to date. To see details about the new SDK, you can visit our Github page here.

This is the first of a multi-part series I will be doing to evaluate various RAG systems using Tonic Validate, a RAG evaluation and benchmarking platform, and the open source tool tvalmetrics. All the code and data used in this article is available here. I’ll be back shortly with a comparison of other RAG tools, such as LlamaIndex, Vectara, LangChain, and more!

Disclaimer

I conducted this analysis between 11-13-2023 and 11-14-2023 and the results described below are based on the Assistant’s API versions active on those dates. During my analysis, the performance issues briefly improved for a 10 minute timespan on 11-14-2023 and I got the following results:

A table displaying the analysis results from a 10 minute timespan on 11/14/2023

However, after getting those results, the performance went back down to the numbers shown in this article. It seems that OpenAI has brief time periods during which their RAG system is able to work properly on multiple documents, but in my testing I was only able to get good results that one time and that’s it.

Introduction

Like many other developers, I was excited when OpenAI announced GPTs and the Assistant API on Dev Day, particularly the promise of building a retrieval-augmented generation (RAG) system directly into OpenAI’s platform.

For those new to the space, RAG is a framework used to enhance AI response quality in LLMs by seeding the model with more up-to-date and contextually relevant information. For instance, if you ask an LLM, "What was Airbnb's goal during YCombinator?", the RAG system might return a piece of text like "Airbnb's goal was 4,000 a month." This piece of text is then included in the prompt to ChatGPT to help ChatGPT answer the original question. For instance, the RAG system will change the prompt sent to ChatGPT to "What was Airbnb's goal during YCombinator? Here is some context that might help: 'Airbnb's goal was 4,000 a month'." This method allows you to contextualize ChatGPT’s responses based on your data or documents, thereby enhancing the accuracy of its responses.

Historically, to build a RAG system, there were few out-of-the-box solutions available; you could do it by hand (not a good option at scale) or use an open-source library such as LlamaIndex, which integrates with various data sources like Notion or Google Docs. Either approach requires a significant amount of effort, and in the case of open-source libraries you would have to maintain the RAG pipeline itself and a separate vector database to hold all the contextual information needed to run the RAG pipeline. If OpenAI can provide an out-of-the-box RAG system that works well at scale, then the barrier to entry to build a use case-specific GPT without having to maintain your own infrastructure is drastically lowered. Instead, you upload your documents, and that's it. As a result, I became very curious about how OpenAI's RAG solution worked and if it could serve as a native replacement to other open-source RAG systems like LlamaIndex.

To benchmark the performance of OpenAI’s RAG system, I used the Tonic Validate Metrics (tvalmetrics) Python package. Tonic Validate Metrics provides fundamental RAG metrics and an evaluation framework for experimenting with RAG applications. You can read more about tvalmetrics and try it out yourself here.

Using OpenAI’s RAG

There are two ways to access OpenAI's RAG system. The first is to use GPTs in ChatGPT Plus. GPTs are customized assistants that can do things such as browse the web or answer questions about files you provide to the GPT (ie, RAG). This is likely how most consumers will interact with OpenAI's RAG system since it's built directly into ChatGPT. However, anyone building their own LLM-powered app is more likely to use OpenAI Assistants. Assistants are similar to GPTs but can be accessed via the Assistants API, allowing developers to integrate the technology directly into their own applications. Given that I am trying to see if OpenAI can replace an existing RAG system in a custom app, I will be focusing my analysis on the Assistants API.

To get started with assistants, you first need to set up the files and the assistant itself. We will use a collection of Paul Graham's essays to test the RAG system, which you can find here. I used the following code to set up the assistant with these files:

This code uploads the Paul Graham essay files to OpenAI, creates the assistant with the files, and primes the Assistant with a system prompt that tells it how to answer questions. The assistant uses GPT-4 Turbo as OpenAI’s RAG tool requires Turbo for usage (source). However, in the process of running this code, I encountered my first issue:

Apparently, OpenAI has set a maximum of 20 files per assistant, with a further limit of 512 MB per file.

This limitation poses a problem for anyone wanting to use OpenAI's Assistant RAG in production. A sufficiently large amount of data would not be compatible with their RAG system given these API limits. One workaround is to start combining multiple files, although you would eventually hit the 512 MB limit if you had enough data (which is likely given the scale of data in the era of AI). This makes it impractical to use in comparison to a tool like LlamaIndex, which has no limits on file counts or size.

Despite this limitation, I was able to continue the test by combining the files as previously mentioned. To do so, I wrote the following code:

This code divides the original files into five groups, puts them together, and appends each group to its own file, resulting in 5 files containing all the data from the original 212 files. Why only 5 files when the limit is 20? The reason is that during testing I discovered that the more files you have, the harder it is for the assistant to find the appropriate answers in the data pertaining to your question/prompt. As a result, I decided to limit it to 5 files to give the assistant a better shot at performing well. With 20 files, the Assistant API was almost never able to respond correctly. Additionally, the API failure rates for OpenAI with 20 files were too high for me to run a full analysis for 20 files.

After creating the test set, I configured the Assistant again with the new test set:

Now, we can start testing!

Testing the Assistant API

Preparing the experiment

To evaluate OpenAI's assistant, we will use a series of question-answer pairs relevant to 30 random essays from the 212 original essays. The reasoning behind selecting questions from only a portion of the essays is that in the real world, a user will only ask questions from a small percentage of the documents indexed. It's unlikely a user will ask a question about every single document indexed by a RAG system. To generate these questions, I tasked GPT-4 Turbo to read over the 30 documents and generate a series of 3 question-answer pairs about each document. I then exported these question-answer pairs to JSON and cleaned them up to fix some bad question-answer pairs generated by the LLM. The question-answer pairs used for my testing can be accessed here. To load the question-answer pairs, I executed the following code:

‍

To evaluate the performance of the assistant, I set up a function called get_response, which collates the answers from GPT-4 Turbo during testing, and tvalmetrics, an open-source library we developed to measure the performance of RAG systems:

‍

In this configuration, tvalmetrics will only score how similar the LLM answer is to the ideal answer. It does this by using GPT-4 Turbo to compare the answers and assign a score from 0 to 5, with 0 being no similarity and 5 being perfect similarity.

‍

Spot checking the setup

Once I had everything set up for evaluation, I tried a quick spot check to see how the Assistants API performed before running the full benchmark:

This resulted in the following answer:

Airbnb's monthly financial goal to achieve ramen profitability during their time at Y Combinator was to have revenues of only $3000 a month【15†source】.

However, this answer is completely incorrect. If we look at the relevant section of the text from the original source, it says:

For the Airbnbs, ramen profitability was \$4,000 a month: \$3,500 for rent, and $500 for food.

Further, the Assistants API says there were 15 sources for this when the text only contains one mention of the answer. If you check the context that OpenAI provides, it doesn’t contain 15 sources. In general, it seems problematic to cite so many sources when the answer only exists in one source.

For good practice, I tried the same question again. Using the same question and data, the RAG system responded correctly with the following:

Airbnb's monthly financial goal to achieve ramen profitability during their time at Y Combinator was \$4,000 a month. This amount included \$3,500 for rent and $500 for food.

Oddly, it didn't include the source count this time, but responded correctly. In general, I've noticed that the response will include the source count about 50% of the time. On my last attempt, I got the following:

The search did not return any specific information about Airbnb's monthly financial goal to achieve ramen profitability during their time at Y Combinator. It's possible that the details of their financial goals are not covered in the essays we have. If you have any other questions regarding Paul Graham's essays, feel free to ask.

As evident, the major issue here is the variability of responses where the RAG system alternates between correct answers, incorrect answers, and no answer at all. This would be expected if the data was changing or the questions were different, but this was happening for the same exact question with the same exact data. The fundamental purpose of RAG is to make LLM responses more accurate by grounding them with data from your documents. In theory, with RAG, you should be getting relatively consistent responses because the same question should generally be matched with the same documents each time. However, this isn't the case with OpenAI's assistants. Instead, the responses are so variable that it defeats the accuracy benefits of RAG.

Evaluating the RAG system

The spot check has already revealed some major problems, but let's see how the Assistants API performs when we run through the full set of question-answer pairs. To do this, I ran the following code.

The approach I used might look odd to you because it goes through a section of the question_list instead of the full list. This is because of yet another problem with OpenAI's RAG system: reliability. The RAG system will often freeze up and never return an answer. This will happen multiple times in a row for a long period where it exceeds the maximum number of exceptions. For context, in get_response, I allow 3 minutes to get an answer (seems reasonable, if not generous), and then in the code above, I allow 3 retries for a total of 9 minutes to get an answer. Despite taking a long time to get a single response, OpenAI's API often fails to return any response. At that point, the code will fail, and I wait until later to resume where the code left off by iterating over the remaining questions using question_list[len(openai_responses):]. I chose not to increase the exception count to allow more tries because often the issue won't resolve itself in a timely manner. To me, the fact that the API would fail every few minutes for hours on end made it almost completely unusable.

After hours of trying to get all the responses from OpenAI that I need to run the benchmarking, I got all the responses and ran the following code to benchmark the response quality using tvalmetrics.

‍

Table of Answer Similarity Score results

‍

Disappointingly, the mean similarity score was 2.41, and the median was 2.0. This is not a good performance at all. The main reason the responses scored 2.0 was because they mentioned the question in the response, which prevented the responses from being scored 0 for no similarity. In the cases where it does return answers, the answers are often incorrect. Only a small percentage of the time are the questions answered and correct. This is a major problem that significantly limits their RAG system's usefulness and indicates that in a production system, the OpenAI Assistants API will almost never return a correct answer to your queries except in rare cases, thus making it unsuitable for any use case.

One last attempt…

As a final attempt, instead of splitting the 212 essays across 5 documents, I put them all into one document and discovered something interesting: a dramatic performance increase. Using our same spot checking question, OpenAI finally responds correctly on a consistent basis with the following:

During their time at Y Combinator, Airbnb's monthly financial goal to achieve ramen profitability was \$4,000 a month. This amount included \$3,500 for rent and $500 for food【7†source】.

The source count is still wrong and OpenAI only actually cites one source in the list provided, so it seems there’s some hallucination there, but at least the response is correct. In terms of scores from tvalmetrics, I got the following:

‍

Distribution of scores for a single document

This is dramatically better than before, with a mean similarity score of 4.16 (median, 5.0). Another hidden benefit when you use a single document is that the reliability dramatically increases. I no longer experienced random crashes while running tests and, in general, tests ran dramatically faster (minutes vs hours).

My hypothesis is that OpenAI’s system simply cannot handle RAG across multiple documents that well. Once you put them into the same document, the performance dramatically increases because the RAG can actually work properly. I did see a mention on OpenAI’s docs that on some documents, they’d just put the whole document into the context window instead of running RAG which might explain the better performance. But when I counted the tokens of the combined document, it was over 600k tokens which is considerably over the 128k token limit. So from that, we can conclude that OpenAI is indeed running RAG here and that when it is set up in a single document, the performance is actually quite good, albeit in situations that don’t fit within their API limitations and the probable use cases for the technology.

Conclusion

While OpenAI's RAG system seems promising, the technology’s limitations with using multiple documents significantly decreases its usefulness. Most people will likely operate their RAG system on a variable document set to maximize the data provided to the LLM. While they can stuff all their data into a single document, that’s a hack and shouldn’t be a requirement of a well-performing RAG system. Additionally, the failure rate of their API made it near impossible to use across multiple files. Almost invariably,I would get an API failure every few minutes while using the API with multiple documents which is not acceptable for production use. That, coupled with the 20-file limit, makes me hesitant to recommend anyone replace their existing RAG pipeline with OpenAI's RAG anytime soon. However, there is potential to improve for OpenAI. While running some spot checks on their RAG system for GPTs, I noticed that the performance was much better on multiple documents. The bad performance is solely limited to the Assistants API itself. If OpenAI worked to bring the Assistants API quality up to that of the GPTs and removed the file limit, then I could see companies migrating from LlamaIndex if they are willing to give up some of the customizability that other systems like LlamaIndex provide in favor of a managed, out-of-the-box RAG solution with lower maintenance overhead. However, until that day comes, I don’t recommend migrating to OpenAI's Assistants for RAG.

‍

All the code and data used in this article is available here. I’m curious to hear your take on OpenAI Assistants, GPTs, and tvalmetrics! Reach out to me at ethanp@tonic.ai

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Ethan P

Software Engineer

Ethan is a software engineer at Tonic AI. He has a degree in Computer Science with a minor in Math from NYU. He has experience working on machine learning systems for robotics and finance. He also founded a not for profit that builds affordable space satellites.

RAG evaluation series: validating the RAG performance of OpenAI vs CustomGPT.ai

Technical deep dive

RAG evaluation series: validating the RAG performance of OpenAI’s RAG Assistant vs Google’s Vertex Search and Conversation

Technical deep dive

RAG evaluation series: validating the RAG performance of Amazon Titan vs Cohere using Amazon Bedrock

Technical deep dive

See All Related Guides

Make your sensitive data usable for testing and development.

Unblock data access, turbocharge development, and respect data privacy as a human right.

Book a demo

Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.

Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.

Tonic Validate

RAG evaluation series: validating OpenAI Assistant’s RAG performance

Disclaimer

Introduction

Using OpenAI’s RAG

Testing the Assistant API

Preparing the experiment

Spot checking the setup

Evaluating the RAG system

One last attempt…

Conclusion

Related Blog Posts

RAG evaluation series: validating the RAG performance of OpenAI vs CustomGPT.ai

RAG evaluation series: validating the RAG performance of OpenAI’s RAG Assistant vs Google’s Vertex Search and Conversation

RAG evaluation series: validating the RAG performance of Amazon Titan vs Cohere using Amazon Bedrock

Make your sensitive data usable for testing and development.