March RAGness

We pitted the top RAG systems against each other and used Tonic Validate to rank their performance.

RAG Leaderboard

We used Tonic Validate's answer similarity score to evaluate the response quality of each RAG system on the same set of benchmark Q&As. The metric is calculated on a 0-5 scale (0=no similarity to benchmarks and 5=perfect similarity). You can learn more about it here.
1
4.62
 / 5
Cohere
RAG system built using Amazon Bedrock. The vector DB and retrieval system is an Amazon Bedrock Knowledge Base using Cohere’s Embed English embedding model. The LLM used is Cohere’s Command model.
cohere.com
2
4.40
 / 5
Haystack
RAG system built with Haystack’s Python package, haystack-ai, using default parameters. The embedding model is OpenAI’s ada-002. The LLM is GPT-4.
haystack.deepset.ai
3
4.00
 / 5
Langchain
RAG system built with LangChain’s Python package, langchain, using default parameters. The embedding model is OpenAI’s ada-002. The LLM is GPT-4.
langchain.com
4
3.78
 / 5
LlamaIndex
RAG system built with LlamaIndex’s Python package, llama-index, using default parameters. The embedding model is OpenAI’s ada-002. The LLM is GPT-4.
llamaindex.ai
5
3.47
 / 5
OpenAI
RAG system built using OpenAI’s RAG Assistant API. It is a black box RAG system, so we don’t know exactly what the embedding model or LLM is, though presumably they are the most recent version of OpenAI’s embedder and GPT-4. We tested OpenAI’s RAG Assistant API at two different times, on 11/22/23 when the API was first released and again more recently on 2/16/24. The score here is from our most recent testing. We found that the RAG Assistant performed very poorly when it was first released.
openai.com
6
3.33
 / 5
Vertex
RAG system built using Vertex AI Search and Conversation. This is a black box RAG system using a Google embedding model and LLM.
cloud.google.com/vertex-ai
7
3.16
 / 5
Titan
RAG system built using Amazon Bedrock. The vector DB and retrieval system is an Amazon Bedrock Knowledge Base using Amazon’s Titan Embeddings G1 embedding model. The LLM used is Amazon’s Titan Text G1 model.
aws.amazon.com/bedrock/titan/

Optimize your RAG stack today with Tonic Validate

Build, measure, iterate, and monitor system performance automatically in production.
About Tonic validate