Two-Stage Retrieval and Reranker

Emma Ke

Emma Ke

on June 10, 2024

14 min read

Retrieval Augmented Generation (RAG) is a powerful technique for training chatbots using a specified knowledge base. This technology serves as the foundational training method for AI customer chatbots tailored to your unique knowledge repository. RAG outperforms traditional fine-tuning methods in long-context processing and chatbot development for several reasons:

  • It is cost-effective and quick to train.
  • Debugging and making improvements is straightforward.
  • The process is intuitive, requiring no advanced machine learning expertise.
  • It allows easy updates with a new knowledge base when the old one becomes outdated.
  • It is highly scalable, capable of handling large knowledge bases.
  • It supports dynamic updates, ensuring the chatbot has the latest information.
  • It enhances contextual understanding over long conversations.
  • It reduces the risk of overfitting compared to traditional methods.
  • It offers flexibility, adaptable to various domains and industries.
  • It improves user experience with more accurate and relevant responses.
  • It can integrate with external sources for real-time data retrieval.

The core functionality of RAG involves retrieving the most relevant chunks of text from the provided documents or knowledge base and presenting these chunks to the language model (LLM) chatbot as context for answering questions. Therefore, accurately retrieving the most pertinent text segments from millions of possibilities is crucial for the chatbot's performance. If the chatbot is given irrelevant context, it cannot provide correct answers, regardless of its sophistication. In this blog, we will discuss how we enhance our chatbot's accuracy in the RAG process using a two-stage retrieval and reranking system.

Recall vs. Context Windows

Before diving into the solution, let's discuss the problem. With RAG, we perform a semantic search across numerous text documents, which can range from tens of thousands to tens of billions.

To achieve fast search times at scale, we typically use vector search. This involves transforming our text into vectors, placing them into a vector space, and comparing their proximity to a query vector using a similarity metric like cosine similarity.

For vector search to be effective, we need vectors. These vectors compress the "meaning" of the text into typically 768 or 1536 dimensions. However, this compression results in some information loss, as we condense the information into a single vector.

Due to this information loss, the top results from vector search may miss relevant information. Sometimes, the relevant information appears below our top_k cutoff.

What can we do if relevant information at a lower rank would help our LLM provide a better response? The simplest approach is to increase the number of documents we return (increase top_k) and pass them all to the LLM.

The metric we measure here is recall — which indicates how many of the relevant documents we are retrieving. Recall does not consider the total number of retrieved documents, so we could technically achieve perfect recall by returning everything.

recall@K = (Number of relevant documents retrieved) / (Total number of relevant documents in the dataset)

However, we cannot return everything. LLMs have limits on the amount of text we can pass to them, known as the context window. Some LLMs, like Anthropic's Claude, have large context windows of up to 100K tokens [1]. This would allow us to fit many tens of pages of text, so we might consider returning many documents to improve recall.

However, this approach also falls short. We cannot use context stuffing because it reduces the LLM's recall performance.

Gpt 3.5 turbo recall

When information is stored in the middle of a context window, an LLM's ability to recall that information becomes worse than if it had not been provided at all.[2]

LLM recall refers to the capacity of a language model to extract information from the text within its context window. Studies indicate that LLM recall diminishes as the number of tokens in the context window increases. Additionally, LLMs tend to follow instructions less effectively when the context window is overloaded with information, making context stuffing a poor strategy.

While we can boost retrieval recall by increasing the number of documents our vector database returns, we can't simply pass all these documents to the LLM without impairing its recall ability.

The solution to this dilemma is twofold: first, we maximize retrieval recall by fetching a large number of documents; then, we maximize LLM recall by reducing the number of documents fed into the LLM. This is achieved by reranking the retrieved documents and selecting only the most relevant ones for the LLM.

Role of Rerankers

A reranking model, also referred to as a cross-encoder, evaluates a query-document pair to produce a similarity score used for reordering documents based on relevance to the query.

The two-tage RAG process with rerankers

For years, search engineers have implemented rerankers in two-stage retrieval systems. Initially, a first-stage model (an embedding model or retriever) fetches a subset of relevant documents from a large dataset. Subsequently, a second-stage model (the reranker) reorganizes these retrieved documents.

The reason for employing two stages is efficiency: retrieving a small document subset is significantly faster than reranking a large set. We'll delve into this shortly, but in essence, retrievers are optimized for speed, whereas rerankers are slower.

Why Use Rerankers?

Given the slower speed of rerankers, why do we use them? The key lies in their superior accuracy compared to embedding models.

Bi-encoders, used in embedding models, face accuracy limitations because they compress all potential meanings of a document into a single vector, resulting in information loss. Moreover, they lack query context since embeddings are created prior to receiving the query.

In contrast, rerankers directly process raw information through extensive transformer computations, minimizing information loss. By running rerankers during user query time, we can analyze document meaning specific to the query rather than generating a generalized interpretation.

While rerankers overcome the information loss of bi-encoders, they introduce a trade-off: increased processing time.

The one-step raw rag process

A bi-encoder model condenses the meaning of a document or query into a single vector. It processes the query in the same manner as it processes documents, but this occurs at the time of the user query. It's important to note that in this context, document A functions as our query.

With bi-encoder models in vector search, heavy transformer computations are conducted during vector creation. Hence, when a user queries the system:

  1. only a single transformer computation is required to create the query vector.
  2. compare it with document vectors using lightweight metrics like cosine similarity.

Conversely, rerankers operate without pre-computation. Each query and document pair undergoes complete transformer inference, resulting in a single similarity score.

The process to calculate similarity score

A reranker evaluates both the query and the chunked text together, generating a single similarity score through a complete transformer inference step. It's important to note that in this context, document A functions as our query.

To illustrate how the rerank process works, we use a concrete example to show the case.

Implementing Two-Stage Retrieval with Reranking

In this process, we bypass the initial retrieval step and assume we obtain 25 chunked texts. From these, we select the top 5 with the highest similarity scores to serve as context information for the chatbot.

Here is the original query:

My cat is having some urinary issues. I think he is actually a girl because he spends a long time in the litter box. What do I do?

Here are the 6 chunked text.

[score=0.853] People forget that litter box usage needs to be addressed from the cat’s point of view, not the owner’s. Keep in mind that urinating takes less time than defecating. Your kitten may not like to spend a lot of time in the litter box and may choose to defecate outside the litter box.

[score=0.851] The owner can gradually move the litter box to a more desirable location. If the cat attempts to use the litter box but misses, other measures must be taken. The box itself should be evaluated; it may be too small or the cat may position itself too near the edge.

[score=0.840] Toss him a treat and then send or bring him to his safe room with the litter box. Keep in mind that some cats become even more stressed when confined, so confining them is not always appropriate. Others will use the litter box when confined but continue to soil the house when let out.

[score=0.834] Toileting can occur at any age, with marking behaviors typically seen in cats older than 6 months. Housesoiling can occur in either sex, intact or altered.

[score=0.833] Care should be taken to avoid providing comfortable sitting places at bottlenecks where dominant cats could use them to keep guard.

[score=0.821] When female cats get FUS, they frequently run to the litter box and sometimes have blood in their urine. However, females usually don’t get urinary blockages and surgery isn’t necessary. Nevertheless, many people know that dry food in the diet can bring on an attack of FUS but still feed dry food to their female cat.

Without Reranker

If we do not use a reranker in the two-stage retrieval, only the first two text chunks will be provided to the chatbot to answer the question. However, the second text chunk is not related to the cat's illness, making it less useful for addressing the query.

With Reranker

We can use the following python code to apply reranker

from sentence_transformers import CrossEncoder

# Load pre-trained model
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')

# Define a function to compute relevance scores
def compute_relevance_scores(model, query, texts):
    input_pairs = [(query, text) for text in texts]
    scores = model.predict(input_pairs)
    return scores

# Query
query = "My cat is having some urinary issues. I think he is actually a girl because he spends a long time in the litter box. What do I do?"

# Text chunks
texts = [
    "People forget that litter box usage needs to be addressed from the cat’s point of view, not the owner’s. Keep in mind that urinating takes less time than defecating. Your kitten may not like to spend a lot of time in the litter box and may choose to defecate outside the litter box.",
    "The owner can gradually move the litter box to a more desirable location. If the cat attempts to use the litter box but misses, other measures must be taken. The box itself should be evaluated; it may be too small or the cat may position itself too near the edge.",
    "Toss him a treat and then send or bring him to his safe room with the litter box. Keep in mind that some cats become even more stressed when confined, so confining them is not always appropriate. Others will use the litter box when confined but continue to soil the house when let out.",
    "Toileting can occur at any age, with marking behaviors typically seen in cats older than 6 months. Housesoiling can occur in either sex, intact or altered.",
    "Care should be taken to avoid providing comfortable sitting places at bottlenecks where dominant cats could use them to keep guard.",
    "When female cats get FUS, they frequently run to the litter box and sometimes have blood in their urine. However, females usually don’t get urinary blockages and surgery isn’t necessary. Nevertheless, many people know that dry food in the diet can bring on an attack of FUS but still feed dry food to their female cat."
]

# Compute relevance scores
scores = compute_relevance_scores(model, query, texts)

# Combine texts and scores, then sort by scores in descending order
text_scores = list(zip(texts, scores))
text_scores_sorted = sorted(text_scores, key=lambda x: x[1], reverse=True)

# Print sorted texts with their relevance scores
print(f"Query: {query}")
for text, score in text_scores_sorted:
    print(f"Text: {text} - Relevance Score: {score:.4f}")

This is the outcome of after running the above code:

Query: My cat is having some urinary issues. I think he is actually a girl because he spends a long time in the litter box. What do I do?
Text: When female cats get FUS, they frequently run to the litter box and sometimes have blood in their urine. However, females usually don’t get urinary blockages and surgery isn’t necessary. Nevertheless, many people know that dry food in the diet can bring on an attack of FUS but still feed dry food to their female cat. - Relevance Score: 0.4802
Text: People forget that litter box usage needs to be addressed from the cat’s point of view, not the owner’s. Keep in mind that urinating takes less time than defecating. Your kitten may not like to spend a lot of time in the litter box and may choose to defecate outside the litter box. - Relevance Score: -0.2213
Text: Toss him a treat and then send or bring him to his safe room with the litter box. Keep in mind that some cats become even more stressed when confined, so confining them is not always appropriate. Others will use the litter box when confined but continue to soil the house when let out. - Relevance Score: -4.8956
Text: The owner can gradually move the litter box to a more desirable location. If the cat attempts to use the litter box but misses, other measures must be taken. The box itself should be evaluated; it may be too small or the cat may position itself too near the edge. - Relevance Score: -5.6148
Text: Toileting can occur at any age, with marking behaviors typically seen in cats older than 6 months. Housesoiling can occur in either sex, intact or altered. - Relevance Score: -8.5576
Text: Care should be taken to avoid providing comfortable sitting places at bottlenecks where dominant cats could use them to keep guard. - Relevance Score: -9.1630

If we choose the top 2 text chunks after applying reranking, the most relevant text chunks based on the patient's query will appear in the top positions and be served to the chatbot. The text, "When female cats get FUS, they frequently run to the litter box and sometimes have blood in their urine. However, females usually don’t get urinary blockages and surgery isn’t necessary. Nevertheless, many people know that dry food in the diet can bring on an attack of FUS but still feed dry food to their female cat," directly addresses the patient's query. This ensures that the chatbot receives more accurate context and can provide a more precise answer to the question.

How to Enable Two-Stage Reranking in Chat Data

Users from Chat Data can turn on the Two-Stage Reranking by clicking on the Enable RAG enhancement button in the Settings > Model tab

Turn on reranking in Chat Data

After enabling reranking, we will first retrieve the top 100 text chunks based on similarity scores. Then, we will apply the reranking algorithm to these 100 texts to select the 10 most relevant chunks. This ensures that the texts are more accurately related to the query.

Without the reranking process, we would retrieve only the top 10 text chunks based solely on similarity scores. This approach might introduce irrelevant text chunks and miss some important ones that are helpful in answering the query.

Due to the computational expense of applying reranking with full transformation, there will be an additional charge of 5 credits per message on top of the original message credit cost.

Embark on your journey with us today, absolutely free of charge.

Reference

[1] Introducing 100K Context Windows (2023), Anthropic

[2] N. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the Middle: How Language Models Use Long Contexts (2023)

[3] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019), UKP-TUDA

Create Chatbots with your data

In just a few minutes, you can craft a customized AI representative tailored to yourself or your company.

Get Started