Implementing a High-Speed, Performant RAG with Reranking Using Qdrant and FastEmbed (No GPU Required)
Introduction
In this article, we describe the development of a Retrieval-Augmented Generation (RAG) system that operates solely on a CPU. This system uses re-ranking and Qdrant, a free open source library for creating vector databases, to extract relevant text snippets from input documents. These snippets are then processed by a GPT model, specifically the ChatGPT 3.5 API, known for its speed and cost-efficiency. Our system aims to increase the likelihood of selecting the most pertinent information, enhancing performance and cost-effectiveness.
Video Demo:
Background
What is RAG?
Retrieval-Augmented Generation (RAG) combines retrieval-based and generative models to enhance the quality and relevance of responses in tasks like document retrieval and question answering. The system uses a vector database to store representations of data as vectors, enabling rapid retrieval.
Main Steps in a RAG System:
-
Query: The process initiates with a user input, known as a query. This query is the question or prompt that the system needs to address.
-
Embedding Model: The query is processed by an embedding model that converts the text into a numerical form known as an embedding. This transformation makes the query comparable with other stored data in terms of similarity.
-
Vector DB: The query's embedding is used to search a vector database (Vector DB). This database contains embeddings of various data points. The goal here is to find the most relevant data points (or contexts) that are similar to the query's embedding.
-
Retrieved Contexts: The most relevant data points found in the Vector DB are retrieved as contexts. These contexts contain information that is presumed to be helpful in generating an accurate and informed response to the query.
-
LLM (Large Language Model): The retrieved contexts, along with the original query, are fed into a large language model (LLM). The LLM uses this combined information to generate a coherent and contextually appropriate response.
-
Response: Finally, the LLM outputs the response. This response is designed to answer or address the user's initial query, making use of both the input data and the information retrieved from the Vector DB.
Why Use Reranking?
Reranking plays a crucial role in the Retrieval-Augmented Generation (RAG) process, significantly enhancing the quality of initial search results from the vector database. This method assesses and adjusts the relevance of these results in relation to the input query, crucially reducing errors such as irrelevant outputs and ensuring the most accurate information feeds into the generative model. This is typically done using a cross-encoder model, which evaluates each retrieved document alongside the query, offering a more detailed relevance assessment than initial embedding-based methods.
Reference
Note: In this context, FlashRank is the name of a library used for reranking, which should be implemented following the retrieval step.
Challenges in Using RAG
Choosing the Number of Data Points (Chunks) to Retrieve
Determining the optimal number of data points to retrieve is crucial for maximizing the efficiency and effectiveness of a Retrieval-Augmented Generation (RAG) system. An ideal balance minimizes computational demands and costs by reducing the number of generative calls required for models like GPT to produce responses. Moreover, focusing on a concise set of relevant data points speeds up processing and enhances the coherence of the generated content. This precision in selection helps prevent the system from generating hallucinated or irrelevant content. To achieve this, we integrate advanced reranking strategies that fine-tune the selection process, ensuring prioritization of the most likely informative chunks.
Managing Hallucinations
One significant challenge in generative models like RAG is dealing with hallucinations, where the model produces plausible but factually incorrect information. Implementing effective reranking helps mitigate this issue by prioritizing chunks of information that are not only relevant but also verified. Our strategy includes enhancing our reranking processes to more effectively sift and validate the content, thereby reducing the incidence of inaccurate outputs.
Enhancing Computational Efficiency
Achieving high computational efficiency without relying on GPUs is a critical challenge. We address this by utilizing Qdrant and FastEmbed, which are selected for their processing speed and scalability. Qdrant is particularly beneficial for managing large data volumes efficiently, even in constrained resource environments. Additionally, our reranking process, powered by a library called FlashRank, operates entirely on CPUs. This integration ensures that our system maintains swift and efficient performance across various scenarios, effectively managing both retrieval and reranking processes without the need for GPU resources.
Motivation for the Project
The development of a high-performing RAG model addresses several critical challenges: optimizing chunk retrieval, managing hallucinations effectively, balancing retrieval with generation, enhancing computational efficiency, and refining relevance scoring mechanisms. By using tools like Qdrant and FastEmbed, and integrating CPU-based reranking through FlashRank, this project aims to create efficient and reliable RAG systems that significantly advance current information retrieval practices.
Having outlined the key motivations and challenges associated with the RAG system, the next section will dig into the practical implementation aspects. We will start by establishing the foundational technology of the RAG system—the vector database. This setup is essential for efficient data storage and rapid retrieval, both of which are critical for the successful application of reranking mechanisms. Detailed explanations and step-by-step guides will illustrate how to create and integrate a vector database using Qdrant, preparing us to apply these techniques in a real-world scenario.
Implementation
Before we discuss the specifics, it's important to note that the code examples provided in this article are simplified to maintain clarity and focus on key concepts. The complete code, featuring a full application with a user interface via Streamlit, is available through external links. To keep the article concise, we present streamlined code snippets here. For those interested in a more comprehensive exploration, additional resources and a recording of the application in action can be accessed via the provided links.
Github repo:
Creation of a Vector Database
The implementation phase begins with the crucial step of creating a vector database. This database is essential for the efficient storage and retrieval of data, which supports the advanced reranking capabilities of our RAG model. In this section, we outline the process of setting up a vector database using Qdrant. We will show how it integrates with the RAG model to enhance data retrieval speed and efficiency, providing practical insights and detailed configuration tips for building this crucial component of the RAG system.
Qdrant
Qdrant is a vector database designed for scalable and efficient vector search. It provides a robust infrastructure for storing and querying high-dimensional vector embeddings.
FastEmbed
FastEmbed, developed by Qdrant, is a library used for creating vector embeddings. It delivers performance comparable to more resource-intensive models but operates efficiently without the need for a GPU. By optimizing the creation and use of vector embeddings, FastEmbed enhances the retrieval accuracy and speed, making it an ideal choice for RAG systems aiming to perform well under limited computational resources.
Installation of Qdrant
You can install Qdrant using Docker with the following commands:
docker pull qdrant/qdrant
docker run -p 6333:6333 -v $(pwd)/path/to/data:/qdrant/storage qdrant/qdrant
This setup starts a Qdrant instance accessible at localhost:6333
.
Utilizing LangChain Wrapper for Qdrant
In this article, I will use the LangChain wrapper for Qdrant, along with other utility functions. It is also possible to achieve the same results using LlamaIndex or without any additional libraries, but these tools can simplify the job for a RAG application.
LangChain provides a streamlined interface for integrating with various vector databases, including Qdrant, making it easier to manage and manipulate vector data.
LangChain GitHub Repository
LangChain Website
Retrieving Relevant Documents Using RAG
Example Data
We will implement a RAG system to retrieve data using an example meeting transcription generated with GPT-4 via WebUI (https://chatgpt.com/). The original text is 20,000 characters long and written in English. It pertains to a fictitious data science meeting where the company's data science team discusses the current project statuses and advancements.
Text Extract:
Moderator: Good morning, everyone. Thank you for joining today's meeting. We have a packed agenda, so let's dive right into it. First up, we'll discuss the various projects currently underway. John, could you give us an update on Project Alpha?
John: Sure, I'd be happy to. Project Alpha is progressing well. We've completed the initial phase, and we're now moving into the development stage. The team has been working hard to meet the deadlines, and we're confident we can deliver the project on time.
Moderator: That's great to hear. Can you provide more details on the milestones you've achieved so far?
John: Absolutely. In the initial phase, we focused on gathering requirements and understanding the client's needs. We conducted several workshops and interviews to ensure we had a comprehensive understanding of the project scope. We've also finalized the project plan and timeline, and the client has signed off on it.
Moderator: Excellent. Can you walk us through some of the challenges you've faced during this initial phase?
John: One of the main challenges was aligning the client's expectations with our capabilities.
...
Link to full text:
ML Meeting Transcription
Moreover, when creating this text, I asked ChatGPT to create some questions that I would have used to test the RAG.
- What phase is Project Alpha currently in?
- How does the retrieval component of the RAG system work?
- What technique does Optuna use to optimize hyperparameters?
- What makes BERT different from other transformer models?
- What new feature has David's team created to improve the LightGBM classifier?
- What are the quantitative metrics used to evaluate the RAG system's performance?
- How has transfer learning been beneficial for fine-tuning the BERT model?
- What challenge is associated with training the BERT model?
- What approach has been used to improve the LightGBM classifier's performance?
- What is the primary use of the BERT model in Sarah's projects?
WWe will utilize these questions to test whether the RAG system can accurately identify and select text chunks from the provided document that contain the answers.
Create The Collection (Vector Database) using Qdrant
First, we need to create a collection of vectors using the created meeting transcription. I created a txt file with the transcription and provided it to the script defined above to create a collection.
Here’s an example of how to create a Qdrant collection and add vectors to it:
from qdrant_client import QdrantClient
import os
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
def extract_text_from_file(file_path):
"""Extracts text from a file based on its extension."""
file_extension = os.path.splitext(file_path)[1].lower()
text = ""
if file_extension == '.pdf':
with open(file_path, "rb") as file:
pdf_reader = PdfReader(file)
text = "".join([page.extract_text() or "" for page in pdf_reader.pages])
elif file_extension in ['.txt', '.md']:
with open(file_path, 'r', encoding='utf-8') as file:
text = file.read()
else:
raise ValueError("Unsupported file type: only PDF, TXT, and MD are supported.")
return text
def create_qdrant_collection(file_path):
"""Creates a Qdrant collection from a file."""
collection_name = os.path.splitext(os.path.basename(file_path))[0]
document_text = extract_text_from_file(file_path)
if document_text:
embedding_model = FastEmbedEmbeddings()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents([document_text])
client = QdrantClient(host="localhost", port=6333)
client.upload_documents(texts, embedding_model, collection_name=collection_name)
print(f"Collection '{collection_name}' created successfully!")
# Example usage with file path
file_path = 'path/to/your/document.pdf'
create_qdrant_collection(file_path)
This will create a collection on Qdrant.
For the full project, I developed a more complete script which we can call in a main function to create a collection by providing a file, or be called as a standalone script to create a collection given a filepath.
When creating a collection, one can either create a database saved into the hard drive or in memory (which will disappear after closing the docker application). In my case, I decided to save them locally so I can reuse them later.
For more information, reference : Qdrant page on Langchain
It is possible to check out the created collections locally (via browser) through:
http://localhost:6333/dashboard#/collections
Here you can view details about each collection. By selecting a collection, you can access information about the vectors it contains. For instance, in the ml_meeting collection:
In this example, we observe a "point" (vector), along with associated information:
- metadata (absent in this demo) could include details like the file name or the source page of the information. This is customizable during the collection's creation.
- page_content: This represents the text segment that has been converted into a vector.
- default vector: Displays the vector values and Length, which indicates the number of dimensions in the vector, determined by the embedding model used.
Now we have a collection, so we are ready to use it for our RAG application.
Building a User Interface with Streamlit
We can create a simple user interface (UI) using Streamlit to allow users to input their queries and receive answers from the RAG system using the previously created collection.
import streamlit as st
from langchain.vectorstores import Qdrant
from qdrant_client import QdrantClient
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
def main():
st.title("Simple Query Interface with Qdrant")
st.markdown("Enter your query to retrieve relevant documents from the Qdrant collection.")
input_query = st.text_area("Query:", height=150)
if st.button("Submit"):
if input_query:
results = query_qdrant(input_query)
st.write(results)
else:
st.warning("Please enter a query.")
def query_qdrant(query):
client = QdrantClient(url="http://localhost:6333", prefer_grpc=False)
db = Qdrant(client=client, embeddings=FastEmbedEmbeddings(), collection_name="ml_meeting")
retrieved_entries = db.similarity_search_with_score(query=query, k=25)
retrieved_results = [{"id": doc.metadata['_id'], "text": doc.page_content, "cosine_similarity": score} for doc, score in retrieved_entries]
return retrieved_results
if __name__ == "__main__":
main()
By running this code (streamlit run script_name.py
in the terminal), you will activate a simple UI where users can enter queries, as shown in the screenshot below:
The query_qdrant function in the script processes the user's query by embedding it. Specifically, in this article, I utilize the collection named "ml_meeting", as introduced previously. The collection to load from Qdrant can be specified through the collection_name input parameter when defining the database:
db = Qdrant(client=client, embeddings=FastEmbedEmbeddings(), collection_name="ml_meeting")
This setup then connects to the Qdrant client to retrieve the top k chunks based on cosine similarity:
retrieved_entries = db.similarity_search_with_score(query=input_query, k=25)
I have chosen to retrieve k=25
chunks because this number generally offers a balanced dataset that's substantial enough for detailed post-analysis while effectively managing the information load. It's important to note that the optimal number of chunks can vary depending on the specific use of the document, the length of the chunks, and other factors. This flexibility allows for adjustments to optimize performance across various scenarios.
CONTINUE HERE
To test our RAG system, we will use one of the questions generated above:
What is the primary use of the BERT model in Sarah's projects?
This query will be input into the RAG system to retrieve the most relevant document chunks.
After writing the question in the area box and clicking on Submit
, the script will output the most similar chunks in terms of cosine similarity (via Qdrant).
On Github, I implemented a more comprehensive user interface that allows users to upload their own documents. You can find the relevant file on GitHub at the following link:
Retrieved Chunks analysis
For improved visualization and analysis of the retrieved chunks, sorting them by their cosine similarity to the input query and plotting these values provides a clearer view of the similarity distribution. This approach makes it easy to understand how closely each chunk relates to the input.
Example Query and Results
Consider the input query:
What is the primary use of the BERT model in Sarah's projects?
The top 5 retrieved chunks ranked by cosine similarity are:
Content chunk 1: Moderator: That's very thorough.Sarah, could you provide an update on the BERT transformer model?Sarah: Sure. The BERT transformer model has been a game-changer for natural language processing tasks. We've been using it for various applications, including text classification, sentiment analysis, and question-answering systems. The model's ability to understand context and generate human-like responses has been invaluable.
Content chunk 2: Moderator: That's very interesting. How have you been applying BERT in your projects?Sarah: We've been using BERT for a variety of tasks. In text classification, it's been highly effective at categorizing documents based on their content. For sentiment analysis, it accurately identifies the sentiment of a piece of text, whether it's positive, negative, or neutral. In our question-answering systems, BERT has been able to provide accurate and relevant answers based on the input query.
Content chunk 3: David: Certainly. One example is the use of interaction features, where we combine multiple variables to create new features that capture the interactions between them. For instance, we've created a feature that combines the customer's tenure with their recent activity level to better predict their likelihood of churn. This has significantly improved the model's predictive power.Moderator: Very innovative. Sarah, do you have any final thoughts on the BERT model?
Content chunk 4: Sarah: Actually, before we wrap up, I wanted to bring up a potential collaboration between our BERT project and Emily's RAG system. I believe there are synergies we can leverage to improve both projects. For instance, we could use BERT's capabilities to enhance the generation component of the RAG system, leading to even more accurate and contextually relevant responses.Moderator: That sounds like a great idea, Sarah. Emily, what do you think about this potential collaboration?
Content chunk 5: Moderator: Impressive. What challenges have you faced while working with BERT? Sarah: One of the main challenges is the computational resources required to train and fine-tune the model. BERT is a large model with millions of parameters, so it requires powerful hardware and a significant amount of training time. We've also had to carefully manage the trade-off between model complexity and performance to ensure that the model is both accurate and efficient.
By sorting the top 25 retrieved chunks according to their cosine similarity with the user input and displaying this in a graph, we can observe the following:
Reference code on Github to create the visualization:
The graph of cosine similarities shows that the initial five to six segments each maintain a similarity above 75% with the input query. After this, there is a significant drop of over 10% beginning from the eighth segment onwards. This pattern indicates high relevance of the early segments, which gradually lessens. However, determining the optimal segments to retrieve based solely on their cosine similarity can be challenging, as there is no clear threshold that marks the shift from relevant to irrelevant.
In a retrieval-augmented generation (RAG) system, the usual strategy involves selecting the top k chunks (for example, the top 5) under the assumption that they hold the most pertinent information relative to the query. Although this method is simple and direct, it may not always produce the best outcomes.
When providing a different question:
What new feature has David's team created to improve the LightGBM classifier?
We get the following plot:
The plot shows two noticeable drops in similarity: one at the 2nd chunk and one at the 4th chunk. Although the drop is small (around 5%), it highlights the difficulty in defining a proper threshold for stopping document retrieval. Depending on the input query and the document in use, it could be challenging to establish an effective threshold. Alternatively, this issue could be addressed by deciding on a default number of chunks to retrieve (e.g., top 5), as defined above.
Addressing Questions Outside Provided Content
Challenges also arise when the query's answer is not directly available in the content. For instance, when providing a input query which answer cannot be found in the provided document such as:
What is self attention?
The top 5 chunks retrieved are as follows:
Content chunk 1: between them. This has helped us uncover patterns that are not immediately apparent when looking at individual features in isolation.
Content chunk 2: Moderator: That's very thorough. Emily, do you have any final thoughts on the RAG system?Emily: Yes, I'd like to mention that we're also exploring the use of reinforcement learning to further improve the retrieval component. By continuously learning from user interactions and feedback, we hope to make the retrieval process even more accurate and efficient.
Content chunk 3: Moderator: Thank you for that update, Sarah. Before we wrap up, does anyone have any questions or additional updates to share? John: I have a question for Emily. How do you handle the evaluation of the RAG system's performance?
Content chunk 4: Moderator: Very innovative. How do you handle feature engineering and selection?
Content chunk 5: Moderator: That's fantastic. Can you dive deeper into the specific features you've been using in your churn prediction model?
The retrieved chunks, despite being top-ranked by cosine similarity, do not directly address the query.
Moreover, by looking at the cosine similarity between the user query and all the retrieved chunks, we cannot find a clear pattern or sign that the retrieved chunks are not relevant at all to the user query.
To address these problems we will to use a reranker, specifically a cross-encoder model, which inputs the actual sentences rather than their embeddings. This helps reduce information loss and improves the accuracy of the retrieval process by directly comparing the sentences themselves. By reranking the retrieved chunks with a cross-encoder model, we can better ensure that the most relevant chunks are selected, enhancing the overall performance and reliability of our RAG system.
Part 2 : Implementing a Reranker