The Cutting-Edge of Production-Ready RAG Techniques

Profile picture of author
Maximilian Freiermuth
20 Feb 2024
10 min read
Robot reading newspaper in kitchen

On board of the current AI hype train, turning Large Language Models (LLMs) like GPT-4 or Gemini into usable tools for enterprises requires two non-negotiable capabilities: (A) integrating the latest information post their training cut-off and (B) accessing proprietary or confidential data. So, how do we transform Large language models (LLMs) into trusted insiders, equipped with the latest up-to-date intel?

This is where Retrieval-Augmented Generation (RAG) and model fine-tuning come into play. Far from being mere additions to the AI repertoire, they're the secret sauce that makes LLMs not just smart, but also contextually aware and insightful about current events. This article, inspired by our hands-on experience with a RAG project for a public-service client in Germany, is your guide through both classic and cutting-edge strategies to give LLMs with RAG the up-to-date context they need to be really useful.

When to Use RAG, When to Use Fine Tuning?

When starting any LLM project, you should always start with a few prompt engineering experiments. It is akin to testing the waters before diving in. It's quick, it's easy, and it gives you a taste of what's possible, e.g., by leveraging "few-shot learning".

Yet, this approach only takes you so far. Eventually, you'll face a pivotal decision: whether to enhance context awareness via RAG or to adjust the communication style through fine-tuning.

When to use RAG
Turning your LLM application into something useful requires either adapting the model's style or tone (x-axis) or the model's content (y-axis) or both

Choose RAG when your application demands fresh, up-to-date information (which is imho mostly the case) or needs access to specific knowledge bases, such as dynamic Q&A systems. Opt for fine-tuning when the challenge lies not in what the model knows, but in how it communicates (e.g., speak Shakespearean English, because - why not?). This is your pick for teaching the model to mimic a new writing style, adapt to a particular tone, or even switch languages. In later stages fine tuning can also help reduce your token burn-rate by making the model more context-efficient out-of-the-box.

The most effective solutions often involve a combination of both techniques. By iteratively applying RAG and fine-tuning, it's possible to not only keep the model informed but also ensure that its outputs align with desired stylistic and domain-specific nuances while keeping costs at bay.

In essence, employ RAG if…

  • You need to incorporate new, relevant information into the model's outputs
  • You want to ground the model's responses in accurate, up-to-date context to minimize hallucinations

Probably not if…

  • Your aim is a comprehensive overhaul of the model's knowledge in a broad domain, which might be better served by fine-tuning
  • You seek to significantly alter the model's output style or language, a task for which fine-tuning is more suited
  • Cost efficiency and token usage are priorities, and the goal is to maintain content quality while managing computational resources

How to use RAG in production

Embarking on the RAG journey is a bit like assembling furniture without the manual. It seems straightforward until you're left with extra screws. But fear not, I’m providing you here with a blueprint for:

  1. Naive RAG process “out of the box”
  2. “Go-to” RAG techniques to improve the output quality into something usable
  3. Bleeding-edge / advanced techniques

1. Naive RAG process

The naive RAG process can be likened to your first date — it feels magical, but probably won’t get you far. As outlined in the drawing, here’s how most companies (we did in our first project too!) begin their RAG adventure:

  1. Ingest and index docs: Use an embedding model, such as OpenAI's text-embedding model, to convert text from various sources (e.g., a company's intranet), into a numerical vector form to allow semantic search
  2. Vector Storage: These embeddings are then stored in a vector database (e.g., Pinecone), to allow performant and scalable searching.
  3. Vector Search: When a user poses a question, the system performs a vector search to find the top-k chunks of text that resonate most closely with the query.
  4. Integration with LLM: The user's question and the retrieved context chunks are sent to the LLM (e.g., GPT-4 API)
  5. Answer Generation: Finally, the answer is generated with the relevant context baked in.

Naive RAG process
 The naive RAG process flow, from document ingestion to answer generation

2. Effective go-to RAG tuning techniques

The effectiveness and quality RAG output correlates with the adjustment of several factors during the RAG process. Below are the most-used key methodologies that should be exhausted first before moving on to more adventurous methods.

Best practice RAG tuning options
 Most effective tuning parameters for RAG in production

Parsing Techniques: The Art of Data Preparation

The initial step in the RAG process involves parsing through documents to extract and index pertinent information. Advanced parsing techniques enable the system to dissect various document types, such as presentations or multimedia files, and accurately convert them into an analyzable format. This ensures that the RAG system can interpret and utilize the full spectrum of available data.

Adjusting Chunk Sizes: The Goldilocks Principle

Optimal chunk size is critical for balancing the relevance and precision of the information retrieved. Chunks that are too small may omit necessary context, while excessively large chunks can dilute the specificity of the information and escalate operational costs. The objective is to identify a chunk size that conveys a coherent idea independently, mirroring the level of understanding expected from a human reading the text.

Metadata Filtering: The VIP List

Incorporating metadata into the retrieval process refines the search capabilities of the RAG system. By using document metadata, such as publication dates or thematic tags, the system can perform targeted filtering, effectively honing in on the most pertinent chunks of information. This approach is typically supported by many vector databases (e.g., Chroma, Pinecone) which allow for sophisticated query customization.

Hybrid Search: Best of both worlds

To enhance the RAG system's retrieval accuracy, a hybrid search approach is often implemented. This combines semantic vector search with traditional keyword search methods, such as BM25 or TF-IDF. Such a combination capitalizes on the strengths of each search mechanism, ensuring that the system can handle a broad array of queries, including those that fall outside the domain where dense retrieval might underperform. Frameworks like Haystack and Llama Index as well as vector DBs like Pinecone and Weaviate facilitate this hybrid search approach, providing a more robust and reliable retrieval system.

A word on handeling complexity in a non-deterministic system

By employing these standard tuning techniques, you should normally be able to get most RAG use cases into an accuracy territory that makes them actually deployable in production. However, it becomes evident that the process' complexity increases significantly as also pointed out by Jerry Liu (CEO of LlamaIndex) in his quote on X:

"A big issue with building RAG (+ LLM apps more generally) is that you’re adding a ton of parameters to a stochastic system that require tuning for good results" - Jerry Liu, LlamaIndex

... which is why several companies and startups have include LLM/RAG validation suites into their offerings. Besides LlamaIndex itself also Langchain ("LangSmith"), Galileo and TruLens (part of TruEra).

3.  Advanced RAG techniques

As we delve deeper into the realm of RAG, we encounter a suite of advanced techniques tailored for sophisticated applications. Let's spotlight the most promising of them!

Advanced RAG tuning options
The go-to (blue) and more advanced (red) RAG tuning techniques

Re-ranking

Re-ranking can significantly boost the relevancy of retrieved documents. The re-ranking model optimizes the set of documents initially fetched by the retriever by filtering out results or re-arranging their order. By prioritizing the most relevant documents, it limits the total number to be considered, thus making the retrieval both more efficient and responsive. This two-stage approach, where the first stage involves fast retrieval (i.e. compare query vector with document chunk vectors) and the second involves more accurate but slower re-ranking (run both query and chunks through a transformer to calculate a similarity score), is a common practice to balance speed and accuracy.

Recursive retrieval or sentence window retrieval

Recursive retrieval, also known as Small-to-Big or Child-Parent retrieval, involves initially fetching smaller 'child' information chunks. These chunks reference larger 'parent' chunks for additional context as needed. Similarly, Sentence Window Retrieval starts with a single sentence, gradually expanding the context to include surrounding sentences for deeper understanding. Both methods offer a balanced mix of efficiency and rich context.

Query transformations

Query transformations are a family of techniques using an LLM as a reasoning engine to modify user input in order to improve retrieval quality.

For example, if the query is complex, LLM can decompose it into several sub queries which are both run, thus potentially improving output quality (at the cost of tokens). Many frameworks have this capability built in, e.g., Langchain calls it “Multi Query Retriever” and Llamaindex “Sub Question Query Engine”.

Fine-tune embedding model with synthetic questions

Using a standard embedding model, such as OpenAI’s text-embedding-ada-002, may result in embeddings that are not optimized for your specific dataset, potentially diminishing the quality and precision of retrieval. The solution involves fine-tuning the embedding model to generate more meaningful representations across your data distribution, enhancing retrieval performance.

Fine-tuning requires labeled data (Queries and relevant documents) which you can create with the LLM ("You are an award-winning query writer for the following docs...") to get a synthetic query dataset. With this domain-specific synthetic query datasets (e.g., question-answering on specific legal docs), we can then fine-tune the embedding model for improved vector search performance.

Multi-document agents / agent based RAG implementation

The basic idea is to initialize an agent for each document which allows for Q&A and summary capabilities on each document. These document agents can search through embeddings and summarize responses, while a top-level 'meta-agent' can orchestrate the retrieval and utilize Chain-of-Thought reasoning to answer user queries. This approach leverages the power of LLMs like GPT-4 to create a hierarchical structure of agents that work together to produce a comprehensive response

Agentic RAG schema
Multi-document agents / agent based RAG implementation

Tunable retriever (“RePlug LSR”)

"RePlug LSR" stands for Retrieval-Augmented Black-Box Language Models with LM-Supervised Retrieval, a method that refines the retriever component by harnessing the predictive power of the Language Model (LM) to guide document retrieval. This approach dynamically adjusts the retrieval process based on the LM's feedback, aiming to select documents that contribute to a lower perplexity in the model's output. Essentially, RePlug LSR seeks to enhance the harmony between the retrieved documents and the LM's generated sequence to improve coherence and accuracy.

RePlug LSR training process
The  RePlug LSR training process: (1) retrieving documents and computing retrieval likelihood, (2) scoring the retrieved documents by the LM, (3) minimizing the KL divergence (a measure of statistical distance between the two probability distributions) between the retrieval likelihood and the LM’s score distribution, and (4) asynchronously updating the datastore index with newly computed document embeddings​

The training process for RePlug LSR consists of four primary steps:

  1. Retrieval and Likelihood Calculation: Initially, the system retrieves a batch of documents and calculates a retrieval likelihood for each one, indicating their relevance to the input query
  2. Scoring by the Language Model: Next, the language model assesses the retrieved documents to determine their potential impact on reducing the model's perplexity (how “surprised” was the model in seeing the new data?), effectively scoring each document's usefulness
  3. KL Divergence Minimization: The retriever's parameters are then adjusted by minimizing the Kullback-Leibler divergence, which measures the difference between the retrieval likelihood and the language model's score distribution
  4. Index Updating: Finally, the document embeddings in the datastore index are updated asynchronously to ensure the retriever reflects the most current model parameters and understanding.

Through these steps, RePlug LSR refines the retriever's accuracy, ensuring it selects the most contextually relevant documents to support the language model's performance.

Outlook

The domain of gen AI in general and RAG specifically is advancing at light speed. So, whatever is written here will likely be outdated within a few months from now.

For instance, with Google's Gemini showcasing an ability to handle context windows of up to 10 million tokens, the future of RAG application and implementation might evolve. Such advancements could potentially allow the integration of extensive external knowledge bases directly into the context window, reshaping current RAG methodologies. Nonetheless, the need for continuous updates and the implementation of access controls like RBAC remains a critical aspect where - in my opinion - RAG retains its relevance.

Exciting time to be alive! What we're seeing today is yet another step-change in the capabilities of neural networks. With each (unpredictable) leap forward, we seem getting closer to a future where artificial intelligence integrates seamlessly into every facet of our lives, redefining our relationship with technology and opening doors to uncharted possibilities.