Implementing an Extremely Fast and Performant RAG with Reranking (FastRank) Using Qdrant and FastEmbed (No GPU Required)

Posted at 2025-05-07

Implementing a High-Speed, Performant RAG with Reranking Using Qdrant and FastEmbed (No GPU Required)

Introduction

In this article, we describe the development of a Retrieval-Augmented Generation (RAG) system that operates solely on a CPU. This system uses re-ranking and Qdrant, a free open source library for creating vector databases, to extract relevant text snippets from input documents. These snippets are then processed by a GPT model, specifically the ChatGPT 3.5 API, known for its speed and cost-efficiency. Our system aims to increase the likelihood of selecting the most pertinent information, enhancing performance and cost-effectiveness.

Video Demo:

Background

What is RAG?

Retrieval-Augmented Generation (RAG) combines retrieval-based and generative models to enhance the quality and relevance of responses in tasks like document retrieval and question answering. The system uses a vector database to store representations of data as vectors, enabling rapid retrieval.

Main Steps in a RAG System:

Reference

Query: The process initiates with a user input, known as a query. This query is the question or prompt that the system needs to address.
Embedding Model: The query is processed by an embedding model that converts the text into a numerical form known as an embedding. This transformation makes the query comparable with other stored data in terms of similarity.
Vector DB: The query's embedding is used to search a vector database (Vector DB). This database contains embeddings of various data points. The goal here is to find the most relevant data points (or contexts) that are similar to the query's embedding.
Retrieved Contexts: The most relevant data points found in the Vector DB are retrieved as contexts. These contexts contain information that is presumed to be helpful in generating an accurate and informed response to the query.
LLM (Large Language Model): The retrieved contexts, along with the original query, are fed into a large language model (LLM). The LLM uses this combined information to generate a coherent and contextually appropriate response.
Response: Finally, the LLM outputs the response. This response is designed to answer or address the user's initial query, making use of both the input data and the information retrieved from the Vector DB.

Why Use Reranking?

Reranking plays a crucial role in the Retrieval-Augmented Generation (RAG) process, significantly enhancing the quality of initial search results from the vector database. This method assesses and adjusts the relevance of these results in relation to the input query, crucially reducing errors such as irrelevant outputs and ensuring the most accurate information feeds into the generative model. This is typically done using a cross-encoder model, which evaluates each retrieved document alongside the query, offering a more detailed relevance assessment than initial embedding-based methods.

Reference
Note: In this context, FlashRank is the name of a library used for reranking, which should be implemented following the retrieval step.

Challenges in Using RAG

Choosing the Number of Data Points (Chunks) to Retrieve

Determining the optimal number of data points to retrieve is crucial for maximizing the efficiency and effectiveness of a Retrieval-Augmented Generation (RAG) system. An ideal balance minimizes computational demands and costs by reducing the number of generative calls required for models like GPT to produce responses. Moreover, focusing on a concise set of relevant data points speeds up processing and enhances the coherence of the generated content. This precision in selection helps prevent the system from generating hallucinated or irrelevant content. To achieve this, we integrate advanced reranking strategies that fine-tune the selection process, ensuring prioritization of the most likely informative chunks.

Managing Hallucinations

One significant challenge in generative models like RAG is dealing with hallucinations, where the model produces plausible but factually incorrect information. Implementing effective reranking helps mitigate this issue by prioritizing chunks of information that are not only relevant but also verified. Our strategy includes enhancing our reranking processes to more effectively sift and validate the content, thereby reducing the incidence of inaccurate outputs.

Enhancing Computational Efficiency

Achieving high computational efficiency without relying on GPUs is a critical challenge. We address this by utilizing Qdrant and FastEmbed, which are selected for their processing speed and scalability. Qdrant is particularly beneficial for managing large data volumes efficiently, even in constrained resource environments. Additionally, our reranking process, powered by a library called FlashRank, operates entirely on CPUs. This integration ensures that our system maintains swift and efficient performance across various scenarios, effectively managing both retrieval and reranking processes without the need for GPU resources.

Motivation for the Project

The development of a high-performing RAG model addresses several critical challenges: optimizing chunk retrieval, managing hallucinations effectively, balancing retrieval with generation, enhancing computational efficiency, and refining relevance scoring mechanisms. By using tools like Qdrant and FastEmbed, and integrating CPU-based reranking through FlashRank, this project aims to create efficient and reliable RAG systems that significantly advance current information retrieval practices.

Having outlined the key motivations and challenges associated with the RAG system, the next section will dig into the practical implementation aspects. We will start by establishing the foundational technology of the RAG system—the vector database. This setup is essential for efficient data storage and rapid retrieval, both of which are critical for the successful application of reranking mechanisms. Detailed explanations and step-by-step guides will illustrate how to create and integrate a vector database using Qdrant, preparing us to apply these techniques in a real-world scenario.

Implementation

Before we discuss the specifics, it's important to note that the code examples provided in this article are simplified to maintain clarity and focus on key concepts. The complete code, featuring a full application with a user interface via Streamlit, is available through external links. To keep the article concise, we present streamlined code snippets here. For those interested in a more comprehensive exploration, additional resources and a recording of the application in action can be accessed via the provided links.

Github repo:

Creation of a Vector Database

The implementation phase begins with the crucial step of creating a vector database. This database is essential for the efficient storage and retrieval of data, which supports the advanced reranking capabilities of our RAG model. In this section, we outline the process of setting up a vector database using Qdrant. We will show how it integrates with the RAG model to enhance data retrieval speed and efficiency, providing practical insights and detailed configuration tips for building this crucial component of the RAG system.

Qdrant

Qdrant is a vector database designed for scalable and efficient vector search. It provides a robust infrastructure for storing and querying high-dimensional vector embeddings.

Qdrant Website

Qdrant GitHub Repository

FastEmbed

FastEmbed, developed by Qdrant, is a library used for creating vector embeddings. It delivers performance comparable to more resource-intensive models but operates efficiently without the need for a GPU. By optimizing the creation and use of vector embeddings, FastEmbed enhances the retrieval accuracy and speed, making it an ideal choice for RAG systems aiming to perform well under limited computational resources.

FastEmbed GitHub Repository

Installation of Qdrant

You can install Qdrant using Docker with the following commands:

docker pull qdrant/qdrant
docker run -p 6333:6333 -v $(pwd)/path/to/data:/qdrant/storage qdrant/qdrant

This setup starts a Qdrant instance accessible at localhost:6333.

Utilizing LangChain Wrapper for Qdrant

In this article, I will use the LangChain wrapper for Qdrant, along with other utility functions. It is also possible to achieve the same results using LlamaIndex or without any additional libraries, but these tools can simplify the job for a RAG application.

LangChain provides a streamlined interface for integrating with various vector databases, including Qdrant, making it easier to manage and manipulate vector data.

LangChain GitHub Repository
LangChain Website

Retrieving Relevant Documents Using RAG

Example Data

We will implement a RAG system to retrieve data using an example meeting transcription generated with GPT-4 via WebUI (https://chatgpt.com/). The original text is 20,000 characters long and written in English. It pertains to a fictitious data science meeting where the company's data science team discusses the current project statuses and advancements.

Text Extract:

Moderator: Good morning, everyone. Thank you for joining today's meeting. We have a packed agenda, so let's dive right into it. First up, we'll discuss the various projects currently underway. John, could you give us an update on Project Alpha?
John: Sure, I'd be happy to. Project Alpha is progressing well. We've completed the initial phase, and we're now moving into the development stage. The team has been working hard to meet the deadlines, and we're confident we can deliver the project on time.
Moderator: That's great to hear. Can you provide more details on the milestones you've achieved so far?
John: Absolutely. In the initial phase, we focused on gathering requirements and understanding the client's needs. We conducted several workshops and interviews to ensure we had a comprehensive understanding of the project scope. We've also finalized the project plan and timeline, and the client has signed off on it.
Moderator: Excellent. Can you walk us through some of the challenges you've faced during this initial phase?
John: One of the main challenges was aligning the client's expectations with our capabilities. 
...

Link to full text:
ML Meeting Transcription

Moreover, when creating this text, I asked ChatGPT to create some questions that I would have used to test the RAG.

- What phase is Project Alpha currently in?
- How does the retrieval component of the RAG system work?
- What technique does Optuna use to optimize hyperparameters?
- What makes BERT different from other transformer models?
- What new feature has David's team created to improve the LightGBM classifier?
- What are the quantitative metrics used to evaluate the RAG system's performance?
- How has transfer learning been beneficial for fine-tuning the BERT model?
- What challenge is associated with training the BERT model?
- What approach has been used to improve the LightGBM classifier's performance?
- What is the primary use of the BERT model in Sarah's projects?

WWe will utilize these questions to test whether the RAG system can accurately identify and select text chunks from the provided document that contain the answers.

Create The Collection (Vector Database) using Qdrant

First, we need to create a collection of vectors using the created meeting transcription. I created a txt file with the transcription and provided it to the script defined above to create a collection.

Here’s an example of how to create a Qdrant collection and add vectors to it:

from qdrant_client import QdrantClient
import os
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

def extract_text_from_file(file_path):
    """Extracts text from a file based on its extension."""
    file_extension = os.path.splitext(file_path)[1].lower()
    text = ""
    if file_extension == '.pdf':
        with open(file_path, "rb") as file:
            pdf_reader = PdfReader(file)
            text = "".join([page.extract_text() or "" for page in pdf_reader.pages])
    elif file_extension in ['.txt', '.md']:
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()
    else:
        raise ValueError("Unsupported file type: only PDF, TXT, and MD are supported.")
    return text

def create_qdrant_collection(file_path):
    """Creates a Qdrant collection from a file."""
    collection_name = os.path.splitext(os.path.basename(file_path))[0]
    document_text = extract_text_from_file(file_path)
    
    if document_text:
        embedding_model = FastEmbedEmbeddings()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
        texts = text_splitter.split_documents([document_text])

        client = QdrantClient(host="localhost", port=6333)
        client.upload_documents(texts, embedding_model, collection_name=collection_name)
        print(f"Collection '{collection_name}' created successfully!")

# Example usage with file path
file_path = 'path/to/your/document.pdf'
create_qdrant_collection(file_path)

This will create a collection on Qdrant.

For the full project, I developed a more complete script which we can call in a main function to create a collection by providing a file, or be called as a standalone script to create a collection given a filepath.

When creating a collection, one can either create a database saved into the hard drive or in memory (which will disappear after closing the docker application). In my case, I decided to save them locally so I can reuse them later.
For more information, reference : Qdrant page on Langchain

It is possible to check out the created collections locally (via browser) through:

http://localhost:6333/dashboard#/collections

Here you can view details about each collection. By selecting a collection, you can access information about the vectors it contains. For instance, in the ml_meeting collection:

In this example, we observe a "point" (vector), along with associated information:

metadata (absent in this demo) could include details like the file name or the source page of the information. This is customizable during the collection's creation.
page_content: This represents the text segment that has been converted into a vector.
default vector: Displays the vector values and Length, which indicates the number of dimensions in the vector, determined by the embedding model used.

Now we have a collection, so we are ready to use it for our RAG application.

Building a User Interface with Streamlit

We can create a simple user interface (UI) using Streamlit to allow users to input their queries and receive answers from the RAG system using the previously created collection.

import streamlit as st
from langchain.vectorstores import Qdrant
from qdrant_client import QdrantClient
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

def main():
    st.title("Simple Query Interface with Qdrant")
    st.markdown("Enter your query to retrieve relevant documents from the Qdrant collection.")

    input_query = st.text_area("Query:", height=150)

    if st.button("Submit"):
        if input_query:
            results = query_qdrant(input_query)
            st.write(results)
        else:
            st.warning("Please enter a query.")

def query_qdrant(query):
    client = QdrantClient(url="http://localhost:6333", prefer_grpc=False)
    db = Qdrant(client=client, embeddings=FastEmbedEmbeddings(), collection_name="ml_meeting")
    retrieved_entries = db.similarity_search_with_score(query=query, k=25)

    retrieved_results = [{"id": doc.metadata['_id'], "text": doc.page_content, "cosine_similarity": score} for doc, score in retrieved_entries]

    return retrieved_results

if __name__ == "__main__":
    main()

By running this code (streamlit run script_name.py in the terminal), you will activate a simple UI where users can enter queries, as shown in the screenshot below:

The query_qdrant function in the script processes the user's query by embedding it. Specifically, in this article, I utilize the collection named "ml_meeting", as introduced previously. The collection to load from Qdrant can be specified through the collection_name input parameter when defining the database:

db = Qdrant(client=client, embeddings=FastEmbedEmbeddings(), collection_name="ml_meeting")

This setup then connects to the Qdrant client to retrieve the top k chunks based on cosine similarity:

retrieved_entries = db.similarity_search_with_score(query=input_query, k=25)

I have chosen to retrieve k=25 chunks because this number generally offers a balanced dataset that's substantial enough for detailed post-analysis while effectively managing the information load. It's important to note that the optimal number of chunks can vary depending on the specific use of the document, the length of the chunks, and other factors. This flexibility allows for adjustments to optimize performance across various scenarios.

CONTINUE HERE

To test our RAG system, we will use one of the questions generated above:

What is the primary use of the BERT model in Sarah's projects?

This query will be input into the RAG system to retrieve the most relevant document chunks.

After writing the question in the area box and clicking on Submit, the script will output the most similar chunks in terms of cosine similarity (via Qdrant).

On Github, I implemented a more comprehensive user interface that allows users to upload their own documents. You can find the relevant file on GitHub at the following link:

Retrieved Chunks analysis

For improved visualization and analysis of the retrieved chunks, sorting them by their cosine similarity to the input query and plotting these values provides a clearer view of the similarity distribution. This approach makes it easy to understand how closely each chunk relates to the input.

Example Query and Results

Consider the input query:

What is the primary use of the BERT model in Sarah's projects?

The top 5 retrieved chunks ranked by cosine similarity are:

Content chunk 1: Moderator: That's very thorough.Sarah, could you provide an update on the BERT transformer model?Sarah: Sure. The BERT transformer model has been a game-changer for natural language processing tasks. We've been using it for various applications, including text classification, sentiment analysis, and question-answering systems. The model's ability to understand context and generate human-like responses has been invaluable.

Content chunk 2: Moderator: That's very interesting. How have you been applying BERT in your projects?Sarah: We've been using BERT for a variety of tasks. In text classification, it's been highly effective at categorizing documents based on their content. For sentiment analysis, it accurately identifies the sentiment of a piece of text, whether it's positive, negative, or neutral. In our question-answering systems, BERT has been able to provide accurate and relevant answers based on the input query.

Content chunk 3: David: Certainly. One example is the use of interaction features, where we combine multiple variables to create new features that capture the interactions between them. For instance, we've created a feature that combines the customer's tenure with their recent activity level to better predict their likelihood of churn. This has significantly improved the model's predictive power.Moderator: Very innovative. Sarah, do you have any final thoughts on the BERT model?

Content chunk 4: Sarah: Actually, before we wrap up, I wanted to bring up a potential collaboration between our BERT project and Emily's RAG system. I believe there are synergies we can leverage to improve both projects. For instance, we could use BERT's capabilities to enhance the generation component of the RAG system, leading to even more accurate and contextually relevant responses.Moderator: That sounds like a great idea, Sarah. Emily, what do you think about this potential collaboration?

Content chunk 5: Moderator: Impressive. What challenges have you faced while working with BERT? Sarah: One of the main challenges is the computational resources required to train and fine-tune the model. BERT is a large model with millions of parameters, so it requires powerful hardware and a significant amount of training time. We've also had to carefully manage the trade-off between model complexity and performance to ensure that the model is both accurate and efficient.

By sorting the top 25 retrieved chunks according to their cosine similarity with the user input and displaying this in a graph, we can observe the following:

Reference code on Github to create the visualization:

The graph of cosine similarities shows that the initial five to six segments each maintain a similarity above 75% with the input query. After this, there is a significant drop of over 10% beginning from the eighth segment onwards. This pattern indicates high relevance of the early segments, which gradually lessens. However, determining the optimal segments to retrieve based solely on their cosine similarity can be challenging, as there is no clear threshold that marks the shift from relevant to irrelevant.

In a retrieval-augmented generation (RAG) system, the usual strategy involves selecting the top k chunks (for example, the top 5) under the assumption that they hold the most pertinent information relative to the query. Although this method is simple and direct, it may not always produce the best outcomes.

When providing a different question:

What new feature has David's team created to improve the LightGBM classifier?

We get the following plot:

The plot shows two noticeable drops in similarity: one at the 2nd chunk and one at the 4th chunk. Although the drop is small (around 5%), it highlights the difficulty in defining a proper threshold for stopping document retrieval. Depending on the input query and the document in use, it could be challenging to establish an effective threshold. Alternatively, this issue could be addressed by deciding on a default number of chunks to retrieve (e.g., top 5), as defined above.

Addressing Questions Outside Provided Content

Challenges also arise when the query's answer is not directly available in the content. For instance:

What is self attention?

The top 5 chunks retrieved are as follows:

Content chunk 1: between them. This has helped us uncover patterns that are not immediately apparent when looking at individual features in isolation.

Content chunk 2: Moderator: That's very thorough. Emily, do you have any final thoughts on the RAG system?Emily: Yes, I'd like to mention that we're also exploring the use of reinforcement learning to further improve the retrieval component. By continuously learning from user interactions and feedback, we hope to make the retrieval process even more accurate and efficient.

Content chunk 3: Moderator: Thank you for that update, Sarah. Before we wrap up, does anyone have any questions or additional updates to share? John: I have a question for Emily. How do you handle the evaluation of the RAG system's performance?

Content chunk 4: Moderator: Very innovative. How do you handle feature engineering and selection?

Content chunk 5: Moderator: That's fantastic. Can you dive deeper into the specific features you've been using in your churn prediction model?

The retrieved chunks, despite being top-ranked by cosine similarity, do not directly address the query.

Moreover, by looking at the cosine similarity between the user query and all the retrieved chunks, we cannot find a clear pattern or sign that the retrieved chunks are not relevant at all to the user query.

To address these problems we will to use a reranker, specifically a cross-encoder model, which inputs the actual sentences rather than their embeddings. This helps reduce information loss and improves the accuracy of the retrieval process by directly comparing the sentences themselves. By reranking the retrieved chunks with a cross-encoder model, we can better ensure that the most relevant chunks are selected, enhancing the overall performance and reliability of our RAG system.

Implementing a Reranker

To enhance the performance of our retrieval system, we integrate a reranker, specifically using the FlashRank library. FlashRank evaluates and reorders retrieved vectors based on their relevance, ensuring that the most pertinent content is used in the generative model phase. This approach addresses the challenge of selecting the most relevant chunks, thereby improving the overall effectiveness and reliability of our Retrieval-Augmented Generation (RAG) system.

Why Use a Reranker?

In traditional Retrieval-Augmented Generation (RAG) systems, chunks are selected based on their cosine similarity to the input query. However, this method can lead to information loss, as embedding vectors may not fully capture the semantic nuances of the text. A reranker, especially one utilizing a cross-encoder model, addresses this limitation by directly comparing the text of the sentences, bypassing the intermediate vector representation. This direct approach allows for a more precise and contextually aware selection of relevant text chunks.

What is a Cross-Encoder?

A cross-encoder model inputs pairs of sentences and evaluates their relevance directly, rather than through pre-computed embeddings. This approach significantly reduces information loss and improves the accuracy of the retrieval process. By using a cross-encoder for reranking, we can better ensure that the most relevant chunks are selected, enhancing the performance and reliability of our RAG system.

Below is a visual comparison of a Bi-Encoder (a typical embedding model) and a Cross-Encoder:

Reference

As illustrated, the Cross-Encoder processes the sentences directly without converting them into pooled vectors, providing a classification score based on a more holistic text analysis.

More Information about Cross-Encoders

FlashRank Library

FlashRank is a super-fast, ultra-lightweight Python library designed for re-ranking search and retrieval pipelines. It leverages state-of-the-art (SoTA) cross-encoders and other models to provide efficient and effective re-ranking solutions. FlashRank supports pairwise and listwise reranking based on large language models (LLMs) and cross-encoders, and it runs efficiently on CPU without requiring heavy dependencies like Torch or Transformers.

Key Features of FlashRank Include:

Ultra-lightweight: The smallest model is approximately 4MB.
Super-fast: Reranking speed is influenced by the number of tokens and model depth.
Cost-effective: Designed to minimize costs in serverless environments by reducing memory usage and cold start times.
Supports Various Models: Including ms-marco-TinyBERT-L-2-v2 and ms-marco-MiniLM-L-12-v2, providing options for different performance and size requirements.

Implementing Reranking with FlashRank

Below is a script that demonstrates how to implement reranking using FlashRank. This script includes a reranking function that accepts a query and a list of passages, along with a model choice for reranking.

from flashrank.Ranker import Ranker, RerankRequest

def reranking(query, passages, choice):
    if choice == "ms-marco-TinyBERT-L-2-v2":
        ranker = Ranker(model_name="ms-marco-TinyBERT-L-2-v2")
    elif choice == "ms-marco-MiniLM-L-12-v2":
        ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2")
    else:
        print("Did not select a valid model")
        return None

    rerank_request = RerankRequest(query=query, passages=passages)
    reranked_passages = ranker.rerank(rerank_request)

    return reranked_passages

You can reference my full code in the GitHub repo:

After applying reranking to the retrieved chunks, significant changes are evident in the order of the top chunk IDs.

The image below illustrates the shift in the order of IDs among the top 10 chunks:

To gain a clearer understanding of how reranking affected the relevance scores, we will conduct a more detailed analysis in a subsequent section.

In the following chapter, we will explore the implementation of reranking in our project using FlashRank.

Evaluating RAG System Performance: Quantitative Metrics and Reranking

Retrieval-Augmented Generation (RAG) systems are powerful tools for generating responses based on retrieved data. To optimize their performance, it's crucial to evaluate and rank the retrieved chunks effectively. This section discusses the quantitative metrics used to assess RAG system performance and demonstrates how reranking retrieved chunks can significantly improve relevance. We also explore setting an appropriate threshold for filtering chunks, ensuring that only the most relevant information is passed to the GPT model.

Input Query Example:

What new feature has David's team created to improve the LightGBM classifier?

When using a reranker, it is recommended to retrieve a substantial number of chunks (e.g., 25) and then select those with the highest reranking scores. This process helps efficiently narrow down the chunks to the most pertinent ones.

Cosine Similarity Plot

The following plot illustrates the cosine similarity between the input query and the top 25 retrieved chunks. Higher cosine similarity values indicate greater relevance to the query. The plot shows that the top-ranked chunks have significantly higher similarity scores, which gradually decrease as we move to lower-ranked chunks.

We see that the first 3 chunks have a cosine similarity with the input query of over 75%, with a significant drop (around 10%) at the 4th chunk. This indicates that the top chunks are highly relevant to the query, but relevance decreases noticeably beyond this point.

Addressing Information Loss with Cross-Encoder Models

Using embedding vectors for comparison often results in information loss during the embedding process. In traditional RAG systems, the common approach is to select the top N chunks (e.g., top 5) based on cosine similarity. However, this method can be overly simplistic and may not always provide the best results due to its straightforward nature.

Enhanced Retrieval Accuracy with Cross-Encoder Models

To combat information loss and enhance retrieval accuracy, we utilize a cross-encoder model equipped with FlashRank. Unlike bi-encoders that generate embeddings, cross-encoders input actual sentences directly, allowing for more precise comparisons. This approach significantly reduces information loss and improves the evaluation of sentence relevance during the retrieval process.

Visualization of Cosine Similarities Before and After Reranking

Plot 1: Cosine Similarities Before and After Reranking

This plot demonstrates the change in the order of chunk relevance after reranking with a cross-encoder. Initially ranked chunks based solely on cosine similarity are reevaluated, resulting in a potential shift in their relevance order due to the cross-encoder's more accurate similarity scores.

Plot 2: Similarity Scores by Reranker

The green line in this plot clearly illustrates the drop-off in similarity scores assigned by the cross-encoder, aiding in determining an effective threshold for selecting the most relevant chunks.

Analysis and Conclusion

In summary, our examination of cosine similarities and reranking with cross-encoders in our RAG system demonstrates that the initial top 5-6 chunks show high relevance, with over 75% cosine similarity to the input query. However, relevance decreases noticebly after the 7th chunk.
While cosine similarity offers a starting point for chunk selection, its ambiguity in indicating content relevance limits its utility. Conversely, cross-encoders provide a sharper distinction between relevant and irrelevant chunks by directly analyzing the sentences, aiding in the precise setting of thresholds for chunk selection.

In the following chapter, we will explore the implementation of a dynamic threshold algorithm designed to refine our selection process. By introducing a flexible threshold, we aim to filter out less pertinent chunks, ensuring that only the most relevant chunks—those with a higher likelihood of containing answers to the input query—are retained. This strategic adjustment enhances the efficiency and accuracy of our RAG system.

Optimizing Chunk Selection

To enhance the chunk selection process, we have developed a comprehensive strategy featuring clearly defined thresholds and conditions, refined through rigorous testing. This systematic approach aims to ensure optimal relevance and efficiency in chunk selection, accommodating the inherent variability in data relevance without rigid guidelines.

Dynamic Threshold Parameters for Chunk Selection

High Score Threshold: Chunks scoring above 0.8 are immediately prioritized, ensuring that highly relevant chunks are included.
Soft Score Threshold: Chunks with scores ranging from 0.4 to 0.8 are considered under specific conditions, providing the flexibility to include chunks of moderate relevance.
Low Score Threshold: Chunks scoring below 0.2 are automatically excluded to eliminate those of low relevance from the selection.
Drop Threshold: A maximum score drop of 0.4 between consecutive chunks is enforced to avoid selecting chunks with significant declines in relevance.
Minimum Chunks: At least five chunks that meet the established criteria are selected to ensure robustness and data quality. Any chunks failing to meet these conditions are excluded, maintaining the integrity of the selection process.

Code Implementation of Dynamic Threshold Algorithm

The dynamic threshold algorithm is carefully designed to select the top N chunks based on the specified logic, ensuring that only the most relevant chunks are used for further processing.

def filter_chunks_reranked(reranked_results,
                           high_score_threshold=0.8,
                           soft_score_threshold=0.4,
                           low_score_threshold=0.2,
                           drop_threshold=0.4,
                           min_chunks=5):
    # Reorder reranked results by score in descending order
    reranked_results = sorted(reranked_results, key=lambda x: x["score"], reverse=True)
    reranked_scores = [result["score"] for result in reranked_results]

    selected_indices = []
    prev_score = None

    for i, score in enumerate(reranked_scores):
        if score >= high_score_threshold:
            selected_indices.append(i)
        elif score >= soft_score_threshold:
            if prev_score is not None and (prev_score - score) > drop_threshold:
                break  # Stop if drop threshold exceeded
            selected_indices.append(i)
        elif score < low_score_threshold:
            break  # Stop adding if score is below low score threshold
        prev_score = score

    # Ensure at least min_chunks are selected, provided they meet the low score threshold
    if len(selected_indices) < min_chunks:
        additional_indices = [i for i in range(len(reranked_results))
                              if i not in selected_indices and reranked_results[i]["score"] >= low_score_threshold]
        selected_indices += additional_indices[:max(0, min_chunks - len(selected_indices))]

    final_selection = [reranked_results[i] for i in selected_indices]
    return final_selection

Explanation

Reorder Chunks: The reranked results are sorted in descending order based on their scores.
Evaluate Each Chunk:
- High Relevance: Chunks with scores above the high score threshold (0.8) are directly included.
- Moderate Relevance: Chunks with scores between the soft score threshold (0.4) and high score threshold are considered, provided there is no significant drop from the previous score exceeding the drop threshold (0.4).
Stop Adding Chunks: The addition of chunks stops when scores fall below the low score threshold (0.2), ensuring no low-relevance chunks are considered.
Ensure Minimum Chunks: If the number of selected chunks is less than the minimum required (5), additional chunks that meet the low score threshold are added to ensure robustness.
Final Selection: Chunks that satisfy the threshold conditions are compiled into the final selection, ensuring the most relevant data is processed.

By employing this chunk selection strategy, we reduces computational costs but also ensures that the most pertinent information is selected to fed to the GPT model, thereby improving the quality of the generated responses. Fine-tuning the threshold values based on specific datasets can further optimize performance and adaptiveness, making the system more effective across various scenarios.

Visualizing the Threshold with Plots

The following visualization leverages the threshold logic from our dynamic threshold algorithm to illustrate how similarity scores are used to identify the most relevant chunks.

Similarity Score by Reranker with Threshold Indicator

User input query:

What technique does Optuna use to optimize hyperparameters?

In the plot, we've marked the threshold with a red dashed line, indicating the final chunk selected for retrieval. In this example, there is a noticeable decline in similarity scores after the third chunk, which aids in establishing a precise threshold. This distinct separation allows for effectively distinguishing the most relevant chunks from those that do not meet the criterion.

Initial Retrieval Analysis:

Before reranking, examining the content of the first six retrieved chunks reveals the following:

Content chunk 1: Moderator: Can you explain how Optuna works in conjunction with LightGBM? David: Sure. Optuna uses a technique called Bayesian optimization to find the best hyperparameters for the LightGBM model. It creates a search space of possible hyperparameters and evaluates the model's performance on a validation set. Based on these evaluations, it iteratively refines the search space to find the optimal hyperparameters. This process significantly improves the model's accuracy and reduces overfitting.

Content chunk 2: Moderator: That's very insightful. How have the results been so far? David: The results have been excellent. We've seen a significant reduction in churn rates, and the model's accuracy has improved by over 10% compared to our previous approaches. The use of Optuna has also reduced the time required for hyperparameter tuning, allowing us to deploy the model more quickly.

Content chunk 3: Moderator: Excellent work, Emily. Let's move on to the next topic. David, can you give us an overview of the LightGBM classifier with Optuna for churn prediction? David: Certainly. The LightGBM classifier is a powerful tool for predictive modeling. We've been using it to predict customer churn, and the results have been very promising. Optuna is an optimization framework that helps us fine-tune the hyperparameters of the LightGBM model, resulting in improved performance.

Content chunk 4: Moderator: That sounds quite complex. Have you conducted any user tests to evaluate the system's performance?

Content chunk 5: Moderator: That sounds promising, John. Can you give us a brief overview of how this new module will work? John: The new module will leverage stream processing technologies to analyze data as it comes in. We'll be using Apache Kafka for data ingestion and Apache Flink for real-time processing. This setup will allow us to process large volumes of data with low latency and provide near-instantaneous insights.

Content chunk 6: Moderator: That's fantastic. Can you dive deeper into the specific features you've been using in your churn prediction model?

The first three chunks contain detailed information about Optuna, demonstrating its relevance to the query. Subsequent chunks lack related content, highlighting the efficiency of selecting only the top three for focused retrieval.

Post-Reranking Validation

On the other hand, we can examine the content of the chunks after reranking to verify if the chunk selected after thresholding (first three in terms of similarity scores) relate to the input query, and the subsequent ones lack relevant information:

Content chunk 1: Moderator: Can you explain how Optuna works in conjunction with LightGBM? David: Sure. Optuna uses a technique called Bayesian optimization to find the best hyperparameters for the LightGBM model. It creates a search space of possible hyperparameters and evaluates the model's performance on a validation set. Based on these evaluations, it iteratively refines the search space to find the optimal hyperparameters. This process significantly improves the model's accuracy and reduces overfitting.

Content chunk 2: Moderator: That's very insightful. How have the results been so far? David: The results have been excellent. We've seen a significant reduction in churn rates, and the model's accuracy has improved by over 10% compared to our previous approaches. The use of Optuna has also reduced the time required for hyperparameter tuning, allowing us to deploy the model more quickly.

Content chunk 3: Moderator: Excellent work, Emily. Let's move on to the next topic. David, can you give us an overview of the LightGBM classifier with Optuna for churn prediction? David: Certainly. The LightGBM classifier is a powerful tool for predictive modeling. We've been using it to predict customer churn, and the results have been very promising. Optuna is an optimization framework that helps us fine-tune the hyperparameters of the LightGBM model, resulting in improved performance.

Content chunk 4: Emily: One of the main challenges was ensuring the latency of the system remained low. Since the retrieval component needs to quickly fetch documents and pass them to the generation component, any delays could affect the user experience. We had to optimize our database queries and streamline the data flow to minimize latency. Another challenge was maintaining the relevance of the retrieved documents. We implemented several techniques to fine-tune the retrieval process, including tweaking the

Content chunk 5: Sarah: Yes, I'd like to mention that we're exploring the use of transfer learning to fine-tune BERT for specific tasks. By leveraging pre-trained models and fine-tuning them on our specific datasets, we've been able to achieve state-of-the-art performance with less training time and computational resources.

Content chunk 6: Emily: Great question, John. We use a combination of quantitative and qualitative metrics to evaluate the RAG system's performance. Quantitatively, we measure the accuracy and relevance of the generated responses using metrics like BLEU and ROUGE scores. Qualitatively, we conduct user studies to gather feedback on the system's performance and identify areas for improvement.

From the content analysis, it's evident that only the first three chunks directly discuss Optuna and detail its functionality and impact. These chunks were identified both through initial cosine similarity measures and subsequent similarity scoring during reranking, proving their high relevance to the query. It's important to note that while high cosine similarity often suggests relevance, it doesn’t always guarantee high reranking scores.

Thanks to the established threshold, we successfully isolated these three most pertinent chunks. Employing this targeted approach allows us to feed precisely relevant information to a GPT generative model, enabling it to produce an informed and accurate response based on the curated input.

Dynamic Threshold Adjustments for Various Queries

To further illustrate the adaptability of the threshold, here are examples showing how it adjusts for different user queries:

User input query:

What initiative is John planning to maintain a high level of expertise within the team?

TOP 6 Chunks by reranking score:

Content chunk 1: John: One last thing from my side. We're planning to conduct a few workshops and training sessions for the team to ensure everyone is up to speed with the latest tools and technologies we're using. This should help us maintain a high level of expertise and keep everyone aligned with our goals.

Content chunk 2: Moderator: That's a great initiative, John. Keeping the team well-trained and informed is essential for our success. Let's make sure to schedule these sessions soon. If there are no more updates, we'll wrap up the meeting. Thank you all for your contributions and hard work. Let's keep pushing forward and achieve great things together. Meeting adjourned.

Content chunk 3: Moderator: It sounds like you managed those challenges well. How is the team handling the development stage?John: The team is doing a fantastic job. We've set up regular sprint reviews and planning sessions to keep everyone aligned. We've also implemented a continuous integration and deployment pipeline to streamline the development process. This has helped us catch issues early and ensure that we maintain a high level of code quality.

Content chunk 4: Moderator: That's fantastic. It sounds like everyone is making great progress on their projects. Thank you all for the updates. Let's continue to push forward and strive for excellence in our work. Meeting adjourned.John: Before we adjourn, I just wanted to highlight one more thing regarding Project Alpha. We're planning to integrate a new module that will allow for better real-time analytics. This should help us provide more timely insights to our clients.

Content chunk 5: John: Sure, I'd be happy to. Project Alpha is progressing well. We've completed the initial phase, and we're now moving into the development stage. The team has been working hard to meet the deadlines, and we're confident we can deliver the project on time.Moderator: That's great to hear. Can you provide more details on the milestones you've achieved so far?

Content chunk 6: John: Absolutely. In the initial phase, we focused on gathering requirements and understanding the client's needs. We conducted several workshops and interviews to ensure we had a comprehensive understanding of the project scope. We've also finalized the project plan and timeline, and the client has signed off on it.Moderator: Excellent. Can you walk us through some of the challenges you've faced during this initial phase?

Indeed, we can see that the chunks selected via the thredhold (top 3 chunks) indeed contain the answer to out query.

User input query:

What new feature has David's team created to improve the LightGBM classifier?

We can check the TOP 8 retrieved chunks in terms of reranking score.

Content chunk 1: David: I have an update regarding the LightGBM classifier. We've recently incorporated feature engineering techniques that have further improved the model's performance. By creating new features based on domain knowledge, we've been able to capture additional patterns in the data that were previously missed. Moderator: That's excellent news, David. Can you provide an example of the new features you've created?

Content chunk 2: Moderator: Excellent work, Emily. Let's move on to the next topic. David, can you give us an overview of the LightGBM classifier with Optuna for churn prediction?David: Certainly. The LightGBM classifier is a powerful tool for predictive modeling. We've been using it to predict customer churn, and the results have been very promising. Optuna is an optimization framework that helps us fine-tune the hyperparameters of the LightGBM model, resulting in improved performance.

Content chunk 3: David: I just wanted to add that we're also looking into incorporating explainability features into our LightGBM model. This will help us understand and interpret the model's predictions better, making it easier to communicate the results to stakeholders and gain their trust. Moderator: That's a great addition, David. Explainability is crucial for gaining stakeholder buy-in and ensuring the model's predictions are transparent and trustworthy. Thank you for bringing that up.

Content chunk 4: David: Certainly. One example is the use of interaction features, where we combine multiple variables to create new features that capture the interactions between them. For instance, we've created a feature that combines the customer's tenure with their recent activity level to better predict their likelihood of churn. This has significantly improved the model's predictive power. Moderator: Very innovative. Sarah, do you have any final thoughts on the BERT model?

Content chunk 5: Moderator: Can you explain how Optuna works in conjunction with LightGBM?David: Sure. Optuna uses a technique called Bayesian optimization to find the best hyperparameters for the LightGBM model. It creates a search space of possible hyperparameters and evaluates the model's performance on a validation set. Based on these evaluations, it iteratively refines the search space to find the optimal hyperparameters. This process significantly improves the model's accuracy and reduces overfitting.

Content chunk 6: David: Feature engineering is a critical part of our process. We start by brainstorming potential features based on domain knowledge and previous research. We then use techniques like correlation analysis and feature importance scores to select the most relevant features. Additionally, we've been experimenting with automated feature engineering tools that can generate and evaluate a large number of features quickly. This has allowed us to iterate rapidly and continuously improve our model.

Content chunk 7: David: Certainly. We've incorporated a variety of features into our model, including customer demographics, transaction history, and engagement metrics. For example, we've found that features such as the frequency of customer interactions with our service and the recency of their last transaction are strong predictors of churn. We've also been exploring the use of interaction features, where we combine multiple variables to capture the interactions between them. This has helped us uncover

Content chunk 8: Moderator: That's very insightful. How have the results been so far?David: The results have been excellent. We've seen a significant reduction in churn rates, and the model's accuracy has improved by over 10% compared to our previous approaches. The use of Optuna has also reduced the time required for hyperparameter tuning, allowing us to deploy the model more quickly.

We can see that the first 5 chunks, selected by our dynamic threshold, have content related to the input qwuery, while the subsequent chunks have no relevant to user input.

These visualizations show how the threshold dynamically adapts based on the input queries, ensuring that only the most relevant chunks are selected. This adaptability significantly enhances the efficiency and accuracy of the RAG system's information retrieval capabilities.
Add more comments about effectiveness seen from retrieved chunks content.
TODO

The next step involves developing a function to invoke a GPT model to generate answers based on the text extracted using the specified thresholds.

Generating Answers from Selected Context

To generate accurate answers from the context extracted by our retrieval system, we will develop a function to invoke a GPT model. This function will utilize the context filtered through our dynamic threshold algorithm to produce responses.

Implementation

This script will enable the generation of precise and relevant answers by processing the context that has been rigorously selected based on relevance scores:

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

openai_api_key = os.getenv('OPENAI_API_KEY')

client = OpenAI(api_key=openai_api_key)

def generate_full_prompt(query, context):
    prompt = f"""
    You are an expert assistant. Use only the information from the provided context to answer the question accurately and comprehensively.
    Context:
    {context}

    Question:
    {query}

    Please provide a detailed and clear answer strictly based on the context provided, without relying on any external knowledge or pre-existing information.
    """
    return prompt

def create_response(query, context):
    prompt = generate_full_prompt(query, context)

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an expert assistant."},
            {"role": "user", "content": prompt}
        ], 
        temperature=0.0
    )

    answer = response.choices[0].message.content.strip()

    return answer

Link to the full code on GH:

Example: Generating Answers

Consider the query:

What phase is Project Alpha currently in?

The following plots show the chunks selected for generating the answer:

This plot indicates that we will take the first 3 chunks reordered by reranking scores.
By applying the reranking scores, the context is narrowed down to the most relevant chunks as demonstrated in the next image:

This image shows that when using the top 25 chunks sorted by cosine similarity (Original Reponse Context), we input 1333 words into the GPT model. However, when we apply reranking with a dynamic threshold, the input is significantly reduced to just 175 words (Optimize Reponse Context after Filtering).

When this selected context is fed into our GPT-based script, the response is as follows:

Answer: Based on the context provided, Project Alpha is currently in the development stage. John mentioned that they have completed the initial phase and are now moving into the development stage of the project. Additionally, he highlighted that the team has been working hard to meet the deadlines, indicating active progress in the development phase of Project Alpha.

This answer accurately reflects the current stage of Project Alpha, validating the effectiveness of our context selection and answer generation approach.

Evaluating Cost Savings with Filtered Context

Employing the filtered context using the reranker and dynamic threshold can substantially reduce the expenses associated with using a GPT model. Here’s a breakdown of the costs:

Model in use: gpt-3.5-turbo-0125

Input token cost for standard model: $0.50 per 1M tokens
Input token cost for enhanced model: $1.50 per 1M tokens

Pricing Reference

Traditionally, if we processed the top 25 retrieved chunks, each of size 500 tokens, the computation would be as follows:

selected_context = chunk_length * 25  # Calculates to 12,500 tokens

However, by implementing a reranker with a threshold, we significantly refine our selection. Instead of processing all 25 chunks, we focus on just the top 3 most relevant chunks:

So the selected context becomes

selected_context = chunk_length * 3  # Reduces to 1,500 tokens

This strategic approach not only boosts efficiency by shortening processing times and reducing computational demands but also drastically lowers costs by minimizing the amount of tokens needed for processing. This demonstrates a significant advantage in optimizing resource use while maintaining high-quality output.

Addressing Unanswerable Questions with Textual Context

Sometimes, a query may not have an answer within the provided text, presenting a unique challenge. For example, consider the transcript from the machine learning meeting we already used before and a query about a specific concept not discussed in that meeting:

Input query:

What is self-attention?

Using cosine similarity alone, as shown in the following plot, can be insufficient for determining which chunk, if any, to select.

The highest cosine similarity here is approximately 0.56, which drops slightly among subsequent chunks. Such "low" scores generally indicate a poor match, underscoring the difficulty in setting a reliable threshold with cosine similarity alone. The reranker’s similarity scores also reflect extremely low values (around 0.0), leading our system to decide against selecting any chunk and informing the user that an answer cannot be provided.

Thanks to the dynamic threshold defined, the system correctly chooses not to select any chunks and informs the user that it cannot answer the question.

Effectively Retrieving Answers When Present in the Source Material

As discussed above, when the source material does contain relevant information, our system demonstrates its effectiveness by not selecting any chunk and providing the user an answer saying the answer cannot be found in the provided document.As outlined above, when the source document lacks relevant information, our system adeptly recognizes the absence of suitable chunks. It then communicates to the user that the answer is not available within the provided document.

To test the system’s capability to retrieve answers accurately, we can use a relevant document, such as the famous paper "Attention is All You Need," and pose the same query.

Example using the app_debugging.py available in my GH:

Input query:

What is self-attention?

Retrieved Chunk analysis:

The application found 13 relevant chunks after reranking.

After the filtering, we can see that we select around 800 words instead of 1760.

Answer provided by the model:

This example illustrates the system’s capability to accurately provide answers when the necessary information is included in the input text, showcasing its precision and adaptability.

Conclusion

Our Retrieval-Augmented Generation (RAG) system has seen substantial improvements through the integration of dynamic thresholding and cross-encoder models. These enhancements have significantly reduced the number of chunks retrieved and have dramatically sped up the system—delivering results in under one second. This efficiency is achieved using high-performance vector databases and rapid reranking techniques, all operating on a CPU without incurring additional costs. The primary expense in our system stems from GPT calls, which can be substantially reduced by switching to an open-source model, further lowering operational costs.

This method significantly improves the chances that only the most relevant text chunks are processed, substantially reducing computational demands while enhancing the quality of the generated responses. Furthermore, our system adeptly handles instances where the necessary answer is not present in the source text, ensuring that users consistently receive accurate and relevant information.

Challenges and Future Directions

Despite these advancements, the system faces several challenges, with substantial room for improvement:

Optimizing Threshold and Chunk Selection: Determining the optimal threshold for chunk selection is crucial and remains a complex task. Currently, thresholds and parameters are set based on empirical data and could benefit from automated tuning processes. Advanced machine learning techniques, such as reinforcement learning or Bayesian optimization, could be employed to dynamically adjust these parameters, ensuring more precise selections based on the query context.
Automating Collection Selection: Presently, the collection from which answers are retrieved is manually specified. Automating this process by developing a router-like feature within the RAG system could dramatically increase its efficiency. Such a feature would automatically determine the most appropriate collection to search based on the user’s query, eliminating the need for manual setup and potentially increasing the accuracy of retrieved information.
Incorporating Advanced RAG Features: Adding more sophisticated features to the RAG system could address some of its current limitations. For example, contextual awareness or domain-specific models could refine the retrieval process to better handle specialized queries or nuanced topics.
Scaling and Efficiency: As the demand for the RAG system grows, maintaining efficiency with larger datasets becomes imperative. Exploring more efficient data structures and parallel processing techniques will be essential to manage this increased load without slowing down response times.

Moving forward, these enhancements could lead to a more robust, intelligent, and user-friendly RAG system, pushing the boundaries of what automated retrieval and generation technologies can achieve.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up