A blog cover on the topic: How Contextual Retrieval Enhances AI Understanding
Generative AI
Boonyawee Sirimaya
4
min read
October 10, 2024

How Contextual Retrieval Enhances AI Understanding

According to Anthropic, for an AI model to be effective in specific scenarios, background knowledge is required. For instance, customer support chatbots need insights about the business they represent, and legal bots must have access to historical cases to function effectively.

A commonly used approach to enhance AI’s knowledge is Retrieval-Augmented Generation (RAG). RAG retrieves relevant information from a knowledge base and attaches it to the user’s input, improving the AI’s responses significantly. However, traditional RAG methods often strip away context when encoding data, which sometimes leads to irrelevant or incomplete retrieval from the knowledge base.

Anthropic presents an improved solution called Contextual Retrieval, which incorporates two techniques: Contextual Embeddings and Contextual BM25. These methods help reduce retrieval errors by up to 49%, and when combined with reranking, the improvement rises to 67%. This leap in accuracy leads to better downstream task performance.

Using Longer Prompts for Small Knowledge Bases

In some cases, the simplest solution might suffice. If your knowledge base is under 200,000 tokens (roughly 500 pages), you can bypass RAG entirely by embedding the entire knowledge base in the model’s prompt. This eliminates the need for retrieval and enhances efficiency.

Additionally, Anthropic’s recently introduced prompt caching for Claude speeds up this process and reduces costs. Prompt caching allows frequently used prompts to be stored between API calls, reducing latency by over 2x and cutting costs by as much as 90%.

However, for larger knowledge bases, a more scalable solution like Contextual Retrieval is necessary.

Scaling with RAG: Managing Larger Knowledge Bases

For bigger knowledge bases that exceed the context window, RAG remains the preferred solution. It works by processing a knowledge base through the following steps:

  1. Break the knowledge base into smaller chunks of text.
  2. Convert these chunks into vector embeddings that encode meaning.
  3. Store these embeddings in a vector database for efficient semantic searches.

At runtime, when a user submits a query, the vector database identifies the most relevant chunks based on semantic similarity, which are then added to the prompt.

While embeddings are great at capturing meaning, they may overlook exact matches. That’s where BM25 comes in—a ranking function that finds precise matches by focusing on lexical similarities. This is especially useful for queries with technical terms or identifiers.

Optimizing Retrieval with BM25

BM25 enhances traditional RAG by accounting for exact term matches, building upon the TF-IDF (Term Frequency-Inverse Document Frequency) method. This approach helps pinpoint exact matches, making it ideal for specific queries, such as retrieving documentation based on error codes or unique identifiers.

By combining semantic embeddings with BM25, developers can improve retrieval accuracy by following these steps:

  1. Break the knowledge base into manageable text chunks.
  2. Encode these chunks using both TF-IDF and semantic embeddings.
  3. Use BM25 to identify the top chunks based on exact matches.
  4. Use embeddings for semantic similarity.
  5. Merge and rank results for optimal performance.
  6. Rank the top chunks to prioritize relevant information.
Flowchart showing preprocessing and runtime stages of a standard Retrieval-Augmented Generation system, including corpus processing, embedding model, TF-IDF, vector database, and generative model.
A standard Retrieval-Augmented Generation (RAG) system utilizes both embeddings and Best Match 25 (BM25) for information retrieval

This hybrid approach enables scalable and cost-effective retrieval from vast knowledge bases. However, traditional RAG still has a major limitation—it often loses important context.

The Context Challenge in Traditional RAG

RAG typically divides documents into small chunks for efficient retrieval. While effective, this strategy can sometimes cause individual chunks to lose the necessary context, leading to incomplete information retrieval. For instance, a chunk might reference a company’s quarterly growth, but without context, it’s unclear which company or period is being discussed.

How Contextual Retrieval Solves the Problem

Contextual Retrieval solves this by attaching chunk-specific context before embedding it. This ensures that even smaller chunks retain their relevance. By adding concise contextual explanations to each chunk, AI models are better equipped to retrieve and interpret information accurately.

While manually annotating every chunk in a knowledge base would be impractical, Anthropic’s Claude model can automate this process. By using Claude to generate context for each chunk, the entire knowledge base can be preprocessed efficiently.

Diagram illustrating the new preprocessing steps for Contextual Retrieval, including chunk creation, context generation, and integration with embedding and TF-IDF processes.
Contextual Retrieval Preprocessing workflow, including chunk creation, context generation, and integration with embedding and TF-IDF processes.

Reducing Costs with Prompt Caching

Claude's prompt caching feature allows for efficient contextual retrieval by loading reference documents into the cache once and then referencing this cached content across multiple chunks. This method avoids the need to repeatedly pass in the same reference document for every chunk. 

For example, with 800 token chunks, an 8k token document, 50 tokens of context instructions, and 100 tokens of additional context per chunk, the one-time cost to generate contextualized chunks is estimated to be $1.02 per million document tokens. This makes it a low-cost solution for managing large documents efficiently.

Maximizing Performance with Reranking

Reranking adds an extra layer of refinement to retrieval by ensuring only the most relevant chunks are passed to the model. Initially, the system retrieves the top potential chunks. Then, a reranking model scores them based on relevance, selecting the top-K chunks to include in the prompt. 

While reranking introduces a small increase in latency, it significantly enhances retrieval accuracy, especially for large knowledge bases. Tests show that combining contextual retrieval and reranking can reduce retrieval failures by up to 67%.

Bar graph comparing average failed retrievals for Standard and Contextual Retrieval methods, showing significant reduction in failure rates with Contextual Retrieval techniques.
Comparison of retrieval failure rates.

Key Takeaways

Anthropic’s experiments show that combining embeddings, BM25, contextual retrieval, and reranking leads to a significant boost in retrieval accuracy. This stack of techniques can help developers unlock new levels of AI performance, particularly when managing vast knowledge bases. For further guidance, Anthropic provides a detailed cookbook to help developers experiment with these methods and achieve optimal results.

Consult with our experts at Amity Solutions for additional information here.