Technical deep dive

RAG evaluation series: validating the RAG performance of OpenAI’s RAG Assistant vs Google’s Vertex Search and Conversation

Author

Adam Kamor, PhD

Author

February 16, 2024

This is the fifth installment in a multi-part series on evaluating various RAG systems using Tonic Validate, a RAG evaluation and benchmarking platform. All the code and data used in this article is available here. We’ll be back in a bit with another comparison of more RAG tools!

I (and likely others) am curious to hear how you’ve been using the tool to optimize your RAG setups! Use Tonic Validate to score your RAG and visualize experiments, and let everyone know what you’re building, which RAG system you used, and which parameters you tweaked to improve your scores on X (@tonicfakedata). Bonus points if you also include your charts from the UI. We’ll promote the best write-up and send you some Tonic swag as well.

Introduction

Hello again! In this evaluation in the series I decided to evaluate OpenAI’s Rag Assistant to see how it compares to Google’s Vertex Search and Conversation offering. You may recall that we have evaluated OpenAI’s Rag Assistant in the past, both here and here. Well, we were curious if it’s improved over time (read ahead to find out if it has!). We haven’t reviewed Google’s RAG offering up until now so we are excited for the head to head evaluation.

Setting up the experiment

Since we are evaluating two offerings today that both have e2e RAG solutions the setup is simple. For both products we will upload the documents on which we test retrieval, we then generate our answers via the built-in RAG system, and send the generated answers directly to Tonic Validate to generate scores.

Our testing set utilized 212 Paul Graham essays which you can find in our GitHub here.

For Google Vertex you’ll need to create a new Vertex Application, upload the collection of 212 Paul Graham essays to Google Cloud Storage and create a datastore in your Vertex application which references your Google Cloud Storage bucket which contains the essays.

For OpenAI things are somewhat more convoluted. The OpenAI RAG assistant only supports up to 20 files at a time. So just like in your previous blog posts we combine ALL of the 212 essays into one large text file. The text file with all of the combined essays can be found here.

Go check out the notebook we provide to see how both assistants are setup and initialized.

Starting with a simple question

Let’s start off by asking both assistants a simple question, related to the essays. In siliconvalley.txt Paul Graham discusses the necessary ingredients to make a city into a tech hub. So let’s ask our assistants about it.

The question we ask is:

What key components are necessary to create a technology hub according to Paul Graham?

Google Vertex responds with:

According to Paul Graham, you only need two kinds of people to create a technology hub: rich people and nerds [1]. To create a startup hub, you need to establish a first-rate research university in a place so nice that rich people wanted to live there [4]. The town would be hospitable to both groups you need: both founders and investors [4].

OpenAI responds with:

According to Paul Graham, several key components are necessary to create a technology hub:

Personality and density in town planning: To attract the "creative class" which includes nerds who are a distinct subset, a town must have personality and not feel mass-produced, which often means avoiding large development projects and building one structure at a time. The town should have density, which is easier in older towns which were laid out before cars.
The right personality for nerds: Nerds are attracted to towns where people are generally happy. Such places often include quieter activities like cafes, used bookshops, hiking, and are places where smart people congregate.
Youth appeal and an intact center: A technology hub often needs to feel young and appeal to young nerds who start startups. This does not necessarily mean the town must be new, but rather it must not have a large, existing population of non-progressive people. Startup hubs have intact city centers as opposed to abandoned ones with growth exclusively in the suburbs.
Liberal culture and tolerance for odd ideas: High-tech cities in the U.S. are also the most liberal, not necessarily because liberals are smarter, but because liberal cities tolerate odd ideas which are essential for innovation and startups.
A great university: A great university is also an important attribute of a technology hub, as the original Silicon Valley began with a critical mass of experts in a new technology clustering around Stanford University.
Time and organic growth: Time is required for a startup community to develop organically. You cannot build a technology hub instantly; it must grow through a self-sustaining chain reaction as startups beget startups.
Avoid big government projects: A government that wants to build a "Silicon Valley" is likely doomed to fail because such hubs need to grow naturally, not be constructed on demand.

Our Analysis

Both services provide at least an OK answers which ultimately answer the question. It is our contention that Google Vertex performs much better here because it gets to the right answer better and more quickly, i.e.

The key components to create a technology hub are rich people and nerds.

Tonic Validate provides Answer Similarity scores here of a 5/5 for Vertex (a perfect score) and a score of 2/5 for OpenAI.

Vertex Performance

Ok, lets now look at how Vertex performs across a large set of questions and reference answers. The Q&A we will use can be found in here. It is a set of ~50 questions that are relevant to Paul Graham’s essays.

Below is how Vertex performed using Validates Answer Similarity Metric.

A bar graph showing the distribution of scores for Vertex.

These are strong results, but lets still analyze a few of Vertex’s misses. There were 3 questions where Vertex scored a 0/5. The 3 questions are:

What was the long-term goal of Airbnb as explained to potential investors?

Vertex misunderstood this question and instead answered what was Airbnb’s short term goals during its tenure in YCombinator.

Why do some people believe that wealth should be distributed equally?

Here, Vertex failed to find any relevant context in the essays and responded that it could not provide an answer. This is better than hallucinating but still a disappointment because the answer can be found in the essay gap.txt.

What should be the perceived improvement in a startup's average outcome for an investment to be considered beneficial when giving up 7% equity?

The answer can be found in equity.txt. In that essay Paul Graham talks about the perceived outcomes of an investment from the POV of both the entrepreneur and the investor. The question is asking about the entrepreneur whereas the answer provided by Vertex is from the POV of the investor. I’d like to point out that it is AWESOME that Tonic Validate was able to call out this answer as being incorrect given the subtlety involved.

OpenAI Performance

Let’s move on to OpenAI performance. From the chart below we can see a few things. First OpenAI has a lot more perfectly answered questions than Google Vertex (25 vs 17). But, they also have four questions where they scored a 0 vs Vertex’s three questions.

A bar graph showing the distribution of scores for OpenAI.

Let’s analyze a few of the questions where OpenAI scored a 0/5.

What often happens to startups that turn down acquisition offers?
How does a wealth tax differ from income tax in terms of its application to assets or income?
In what month and year was the talk regarding Lisp for web-based applications given?

In all three of the above questions OpenAI responded that it couldn’t find the relevant context to answer the question. Just like Vertex, this is better than hallucinating but the answers can be found in the essays googles.txt ,wtax.txt and lwba.txt respecitively.

What is the main constraint on the number of new startups in the US?

Here, the LLM answers by stating things that used to be constraints but are no longer constraints, such as lack of open source software, expensive hardware, improvements in programming languages, etc.

But it never really answers the questions, which according to Paul Graham is actually the US immigration policy (see foundervisa.txt).

Final Results

Alright, now for the main event. Below, we show the distribution and summary statistics of Tonic Validate scores for OpenAI Rag Assistant and Google Vertex Search and Conversation.

A bar graph showing the distribution of Vertex vs OpenAI, and a table summarizing the statistics of these scores.

The results show that OpenAI is the winner. It is a relatively close call with OpenAI having a mean score of 3.47 vs Vertex’s of 3.3. However, we should also point out that OpenAI has a bit more variability. They have more scores at both ends of the spectrum.

Something that we don’t include in our evaluation is throughput and practicability of using the technology in real-world situations. I want to briefly mention that using both solutions is a breeze during setup but there were a few things that make Vertex a more viable candidate for a production system. The first is throughput. Vertex answered all 55 questions in a matter of 1 or 2 minutes. OpenAI on the other hand took well over 30 minutes and had to be run multiple times because of intermittent failures. Additionally, the OpenAI RAG Assistant only supports up to 20 files which is limiting.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Adam Kamor, PhD

Co-Founder & Head of Engineering