Technical deep dive

RAG evaluation series: validating the RAG performance of Amazon Titan vs Cohere using Amazon Bedrock

Author

Joe Ferrara, PhD

Author

February 9, 2024

This is the fourth installment in a multi-part series on evaluating various RAG systems using Tonic Validate, a RAG evaluation and benchmarking platform. All the code and data used in this article is available here. We’ll be back in a bit with another comparison of more RAG tools!

I (and likely others) am curious to hear how you’ve been using the tool to optimize your RAG setups! Use Tonic Validate to score your RAG and visualize experiments, and let everyone know what you’re building, which RAG system you used, and which parameters you tweaked to improve your scores on X (@tonicfakedata). Bonus points if you also include your charts from the UI. We’ll promote the best write-up and send you some Tonic swag as well.

Introduction

Hello again! In this series, we’ve heavily focused on young, nascent companies building RAG tooling, but there is also a host of RAG product suites offered by the big cloud providers. So, for this evaluation, I decided to evaluate Amazon Bedrock to see how some of Bedrock’s offerings perform at RAG. Amazon Bedrock has base models in the modalities of text, embedding, and image that anyone can use to build AI applications. For RAG specifically I’ll be looking at their text and embedding models. Bedrock has text models from Anthropic, Cohere, AI21 Labs, Meta, Stability AI, and Amazon, as well as embedding models from Amazon and Cohere. As you can see, there are a lot of models in Bedrock to choose from when deciding to build a RAG application. For this post, I’ll use Amazon Bedrock to compare head to head a RAG system using Amazon’s Titan models to a RAG system using Cohere’s models.

Setting up the experiment

Setting up the experiment consists of three main steps:

Setting up a Knowledge Base in Amazon Bedrock to serve as the vector db for our RAG system.
Choosing a text model in Amazon Bedrock to serve as the LLM in our RAG system.
Writing the RAG system in Python to take user questions and use the Knowledge Base and LLM from Amazon Bedrock to answer the user questions.

In the following sections you’ll see code for exactly how to implement each step. For now, I’ll summarize what each step consists of and show the code for a base class for implementing a simple RAG system.

To set up a Knowledge Base in Amazon Bedrock, you choose an s3 folder with your data in it, and the text embedding model you’d like to use. Bedrock handles the chunking, embedding, and storing of the data in the s3 folder. Given a user query, you retrieve context from the Knowledge Base relevant to the user query through an AWS API. The Knowledge Base handles the retrieval process. A good tutorial for setting up a Knowledge Base in Amazon Bedrock is found here. As usual, I used the collection of 212 essays from Paul Graham that has been used in the previous RAG evaluation series posts. You can read more about how this dataset is prepared here.

Choosing a text model to serve as the LLM is easy, you just decide which one you want to use from the list of models in the AWS console. You interact with the chosen text model through the AWS API via the name of the model (you’ll see specifics of this below). Not all models are created equal so it may be prudent to try a couple of them and use Tonic Validate to understand the impact.

Writing the RAG system in Python consists of determining the logic for how to take a user question, retrieve the relevant context for the question from the Knowledge Base, and prompt the LLM with the question and the retrieved context to answer the question. For this purpose, we used a simple abstract base class in Python:

The knowledge_base_id parameter is used to specify which knowledge base is used as the vector db. The get_response_from_model function is where the text model specific API logic is included for calling the LLM. The boto3 Python package is used to call the AWS APIs. The RAG system itself is simple and just meant to be used to test how well the embedding model and LLM do at answering each of the questions in my test set of questions about the Paul Graham essays.

Preparing Amazon Titan RAG

For our Amazon Titan RAG system, we used the Titan Embeddings G1 - Text model as the embedding model and the Titan Text G1 - Express model as the LLM. The Knowledge Base is configured in the AWS Console UI and called using the retrieve_context function from the base class. The LLM is called through the following implementation of the base class:

Preparing Cohere RAG

The Cohere RAG system is set up the same way as the Titan one. In this case the Cohere Embed English model is used as the embedding model and the Cohere Command model is used as the LLM. The RAG base class with the calls to the Command model is implemented as:

An easy lob to start things off

To start the comparison, let’s give both Amazon Titan and Cohere an easy question about one of the essays: "What was Airbnb's monthly financial goal to achieve ramen profitability during their time at Y Combinator?"

This goal was set as the company aimed to make just enough money to pay the founders' living expenses. This threshold, as Paul Graham notes, is important for startups because it is the point at which a company can operate independently of investors.

The Titan RAG system responded to this question with the answer: $4000.

The Cohere RAG system responded to this question with the answer: Airbnb's monthly financial goal to achieve ramen profitability during their time at Y Combinator was $3500 for rent, and $500 for food, totaling $4,000 per month.

Both of these answers are correct, though Cohere’s provides much more information. This is a foreshadowing of the results we will see later on.

Evaluating the RAG systems

To run a more thorough analysis on these systems, I am going to use Tonic Validate’s Python SDK, which provides an easy way to score RAG systems based on various metrics (you can read more about these in the GitHub repo). In our case, we are going to use the answer similarity score, which scores how similar the LLM’s answer is to the correct answer for a given question. For running Tonic Validate, I created a benchmark of 55 question-answer pairs from a random selection of 30 Paul Graham essays. I can then run both RAG systems through all the questions, collect the RAG-facilitated LLM responses, and pass both the LLM’s answers and the ideal answers from the benchmark set to Tonic Validate. Using this data, Tonic Validate will automatically score the LLM responses, giving me a quantitative idea of how each RAG system is performing.

To get started with this, I ran the following code to load the questions and gather the LLM’s answers using both RAG systems:

After the LLM’s answers are stored, I can pass them to Tonic Validate to score them.

After Tonic Validate is finished processing, I observed the following results:

A bar graph and table displaying the distribution of scores for Amazon Titan and Cohere.

Across the board, Cohere performed better, although Amazon Titan’s performance was also strong (especially considering the low scores of competitor systems like OpenAI Assistants). With a higher average and minimum answer similarity score, Cohere’s RAG system provided a correct (or close to correct) answer more often than Amazon Titan’s system. The lower standard deviation further means that response quality was more consistent across the 55 tests I ran on both systems. The results are promising, but I encourage you to use Tonic Validate and replicate the experiment using your own models and benchmarks.

Conclusion

While both systems performed well, Cohere is the winner here, as a whole performing better than Amazon Titan. It was super easy to set up these RAG systems using Knowledge Bases and the models provided by Amazon Bedrock. Amazon Bedrock’s Knowledge Bases completely manage chunking, embedding, storing and retrieving your data for the data retrieval portion of RAG. Amazon Bedrock also provides a plethora of models to choose from as the LLM in your RAG system. I can’t wait to try the other ones out!

All the code and data used in this article is available here. I’m curious to hear your take on Amazon Bedrock, building RAG systems and Tonic Validate! Reach out to me at joeferrara@tonic.ai to chat more.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Joe Ferrara, PhD

Staff AI Scientist

Joe is a Senior Data Scientist at Tonic. He has a PhD in mathematics from UC Santa Cruz, giving him a background in complex math research. At Tonic, Joe focuses on implementing the newest developments in generating synthetic data.