Retrieval Augmented Generation, aka RAG, is a technique for augmenting LLMs’ capabilities via a document store that is separate from the LLM itself. Think a kitchen sink of all the text data in your organization—your emails, Slack channels, Notion documents, policies, etc.—made available to improve an LLM’s performance.
If you’re familiar with Tonic (6-year-old generative AI player in the data synthesis space), you might be asking yourself why we’re talking about RAG right now. In the spirit of safely unlocking an organizations’ data to maximize its use in software development and testing, we’ve been looking into how teams can safely implement LLMs to answer targeted questions using the multitudes of data organizations produce every day. In other words, we’ve been having some fun playing around with RAG.
We’ve found RAG to be the cheat code to leveraging LLMs to get responses specific to your organization by best linking your prompts to the data they need. In this article, we’ll share everything we’ve learned about this innovative new method, from how it works to how it differs from other methods like fine-tuning, as well as the benefits of RAG. Our goal? Helping you leverage RAG to speedrun your LLM development.
How does RAG work?
RAG was introduced in the paper, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks in 2020 and has exploded in popularity with the acceleration of LLM capabilities. The method uses an LLM as part of a larger system that is integrated with a document store, but does not change the LLM.
RAG works behind the scenes of your LLM queries. Given a user’s prompt, RAG consists of the following steps:
- The documents with relevant contextual information for answering the user’s prompt are retrieved from the document store (steps 3 and 4 in the above diagram).
- The user’s prompt is augmented with the relevant contextual information and sent to the LLM (step 5 in the above diagram).
- The LLM generates a response that is sent back to the user (steps 6 and 7 in the above diagram),
The retrieval process is what makes RAG unique. In the above diagram, the documents repository represents the vector database that stores the semantic information from the documents in the document store. Each document in the document store is chunked into pieces and converted to vectors that store the semantic meaning of each chunk. The vectors as well as the metadata for each chunk are stored in this vector database. An algorithm is used to take the user query and determine the relevant text chunks to retrieve from the vector database to answer the user’s query. From there, a variety of different methods are used to augment the query to construct a robust prompt to pass to the LLM.
The generated response you get from the LLM is more precise as it draws on your specific data rather than just the basic knowledge used to train a foundation LLM.
Data sources for the document store can be any of the following:
- Google docs
- Word docs
- Slack channels
- Notion documents
- …Basically anything that stores free-text data
Constructing a chatbot to converse with these documents is a versatile use case for RAG. For example, RAG can be used on your organization’s internal data and documentation to get quick answers to company-specific questions. Since, obviously (hopefully), your company’s data was not used to train any of OpenAI’s models or other publicly available LLMs, they wouldn’t be able to provide sufficient answers to company-specific queries. Using RAG, employees can “chat” with the company’s documentation to get quick answers to questions specifically about internal documents and intuitively find the references.
Another great use case for RAG is a chatbot for your company’s website. While in this case, the information on your site may have been included in OpenAI’s training dataset (as it’s publicly available data), a RAG-powered chatbot would not only be able to answer natural language questions from your customers, but it would also be able to point them to the specific page on your site where the answer can be found. Further, generated responses would explicitly use the language found on your site, making it that much more tailored to your customers’ needs.
RAG versus fine-tuning
RAG is a smart prompt engineering method to improve the accuracy and specificity of LLM-generated responses to context-specific queries.
Fine-tuning is another popular method used to generate responses from LLMs that are context-specific to a certain set of data. Rather than focusing on prompt-engineering, however, fine-tuning is a method used to further train a foundation LLM on a specific set of data to perform a specific task. This method to improve the accuracy of LLM outputs works best when:
- You have labeled data
- You want to change the format or style of the responses of an LLM
- You have a static set of training data
There isn’t a clear consensus in the AI community whether fine-tuning is an advantageous method to expand a foundation LLM’s knowledge-base to data that it wasn’t originally trained on. Detractors claim that effectively fine-tuning an LLM would require just as much data and processing power as the pre-training process of the original LLM, at which point you might as well just train your own model.
Why you should be curious about implementing RAG
The more we looked into RAG, the more excited we got about it as a method to improve and enrich LLM responses. Criticisms of LLMs often include a lack of transparency into the information sources, only being able to reference static data sources, limited context windows, and tendencies toward hallucinations. RAG addresses all of these concerns in the following ways:
✅ Returns sources of knowledge: The LLM’s response includes the information the retrieval system provided in the prompt to the LLM. This means you will always know where your answers came from and can go back and reference the source truth.
✅ Works on ever-evolving datasets: The stream of data never stops. Since RAG searches your document store for context with every query, each time it will be drawing on the most up-to-date information.
✅ Extends the context window of an LLM: In theory, if you have a question for an internal document you could just pass that whole document to the LLM in your prompt. If the document is large, however, or the answer is contained across multiple documents, this method is impractical. Since RAG provides only the context-relevant chunks to answer your question, your LLM prompts are much more efficient, freeing up space in your context windows.
✅ Reduces hallucinations: An LLM making up an uninformed answer is one of the biggest drawbacks for implementing generative AI. Since the references sent to the LLM for context are also returned with the generated output, you can easily check to make sure the answer you received is accurate. Further, you are able to design your RAG system such that the LLM will only draw upon the information passed to it from the retrieval system and not from the foundation LLM’s original knowledge-base, making sure it doesn’t get confused.
RAG is not without its challenges, however…
In our experiences playing around with implementing RAG, we have certainly run into our fair share of challenges. The main issue we ran up against was how to evaluate the effectiveness of the retrieval system. So we went ahead and built a solution to tackle that challenge. Part open-source, part intuitive UI, stay-tuned to this space—we’ll be going live with the RAG validation tool you’ve been looking for soon… 👀