LLM之RAG实战（五十七）| RAG 20多种常见算法对比 - 文章 - 开发者社区

 在本文，将介绍一下RAG常用的20多种算法。为了便于展示，这里采用jupyter notebook进行演示，代码目录结构如下：

  
├── 1_simple_rag.ipynb  
├── 2_semantic_chunking.ipynb...  
├── 9_rse.ipynb  
├── 10_contextual_compression.ipynb  
├── 11_feedback_loop_rag.ipynb  
├── 12_adaptive_rag.ipynb...  
├── 17_graph_rag.ipynb  
├── 18_hierarchy_rag.ipynb  
├── 19_HyDE_rag.ipynb  
├── 20_crag.ipynb  
└── data/  
    └── val.json                             
    └── AI_information.pdf                   
    └── attention_is_all_you_need.pdf

源码原始地址：https://github.com/FareedKhan-dev/all-rag-techniques

源码完善后的地址：https://github.com/ArronAI007/Awesome-AGI/tree/main/RAG/examples/rag\_examples

一、准备工作

a）测试查询和真实答案；

b）测试的PDF文档；

c）向量模型；

d）语言大模型和多模态大模型；

为了便于演示，将在全文中使用如下的复杂查询：

  
test query:  
How does AI’s reliance on massive data sets act as a double-edged sword?  
  
True Answer:  
It drives rapid learning and innovation while also   
risking the amplification of inherent biases,   
making it crucial to balance data volume with fairness and quality.

克隆项目仓库

  
# Cloning the repo  
git clone https://github.com/FareedKhan-dev/all-rag-techniques.git  
cd all-rag-techniques

安装相关依赖包

  
# Installing the required libraries  
pip install -r requirements.txt

  由于需要从.env文件(必须在项目根目录)中导入环境变量，因此，还需要额外安装

python-dotenv包，安装方式如下：

  
pip install python-dotenv

可以在.env文件中添加大模型的API-Key

  
OPENAI_API_KEY="xxx"  
NEBIUS_API_KEY="xxx"

二、结论

  本文内容较多，先给出测试中效果最好的RAG技术：Adaptive RAG，它的得分是0.86。


 它的原理是通过智能地对查询进行分类并为每种问题类型选择最合适的检索策略，自适应 RAG 比其他方法表现出更好的性能。在事实、分析、观点和上下文策略之间动态切换的能力使其能够非常准确地处理不同的信息需求。

下面开始逐个介绍这些RAG算法：

三、Simple RAG

 让我们从最简单的 RAG 开始。首先，将可视化它的工作原理，然后测试和评估一下效果。

picture.image

工作流如下所示：

从 PDF 中提取文本；
将文本拆分为较小的块；
将块转换为数字嵌入；
根据查询搜索最相关的数据块；
使用检索到的数据块生成响应；
将响应与正确答案进行比较以评估准确性。

首先，加载文档，并进行切块：

  
# Define the path to the PDF file  
pdf_path = "data/AI_information.pdf"  
  
# Extract text from the PDF file, and create smaller, overlapping chunks.  
extracted_text = extract_text_from_pdf(pdf_path)  
text_chunks = chunk_text(extracted_text, 1000, 200)  
  
print("Number of text chunks:", len(text_chunks))  
  
### OUTPUT ###  
Number of text chunks: 42

  使用 extract\_text\_from\_pdf 从 PDF 文件中提取所有文本。然后，chunk\_text 对文档切分为长度约为 1000个字符的块，相邻块之间重复的字符是200个。


  接下来，我们将这些块转换为Embedding（数字化表示）：

  
# Create embeddings for the text chunks  
response = create_embeddings(text_chunks)

create\_embeddings获取text\_chunks，并使用嵌入模型为每个文本块生成一个向量。

 下面使用semantic\_search进行语义搜索，找到与测试查询最相关的块：

  
# Our test query, and perform semantic search.  
query = '''How does AI's reliance on massive data sets act as a double-edged sword?'''  
top_chunks = semantic_search(query, text_chunks, embeddings, k=2)

   有了相关的块，让大模型生成答案：

  
# Define the system prompt for the AI assistant  
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"  
  
# Create the user prompt based on the top chunks, and generate AI response.  
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n========\n" for i, chunk in enumerate(top_chunks)])  
user_prompt = f"{user_prompt}\nQuestion: {query}"  
ai_response = generate_response(system_prompt, user_prompt)  
print(ai_response.choices[0].message.content)

 上述代码将检索到的块格式化为LLM的提示词，然后使用generate\_response函数将此提示词发送给LLM，LLM根据提供的上下文和问题生成答案。


  让我们看看simple RAG的效果：

  
# Define the system prompt for the evaluation system  
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."  
  
# Create the evaluation prompt and generate the evaluation response  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT ###  
... Therefore, the score of 0.5 being not very close to the true response, and not perfectly aligned.

四、语义分块（ Semantic Chunking ）

 在Simple RAG 方法中，我们只是将文本切成固定大小的块。这太粗鲁了！它可能会将一个句子分成两半，或者将不相关的句子组合在一起。


 Semantic Chunking 旨在变得更智能。它不是固定大小，而是尝试根据含义拆分文本，将语义相关的句子组合在一起。

picture.image

 这个想法是，如果句子谈论的是相似的事物，它们应该在同一个块中。我们将使用相同的嵌入模型来计算句子的相似程度。

  
# Splitting text into sentences (basic split)  
sentences = extracted_text.split(". ")  
  
# Generate embeddings for each sentence  
embeddings = [get_embedding(sentence) for sentence in sentences]  
  
print(f"Generated {len(embeddings)} sentence embeddings.")  
  
### OUTPUT ###  
233

 此代码将我们的 extracted\_text 拆分为单独的句子。然后为每个单独的句子创建嵌入向量。


  现在，我们将计算连续句子之间的相似度：

  
# Compute similarity between consecutive sentences  
similarities = [cosine_similarity(embeddings[i], embeddings[i + 1]) for i in range(len(embeddings) - 1)]

 这个 cosine\_similarity 函数（我们之前定义过）告诉我们两个 embeddings 的相似程度。分数 1 表示它们非常相似 ，0 表示它们完全不同 。我们计算每对相邻句子的分数。


 语义分块是决定在什么地方将文本拆分为块 。我们将使用 “breakpoint” 方法。我们在这里使用percentile方法 ，寻找相似度的下降快的地方 ：

  
# Compute breakpoints using the percentile method with a threshold of 90  
breakpoints = compute_breakpoints(similarities, method="percentile", threshold=90)

 compute\_breakpoints 函数使用 “percentile” 方法识别句子之间相似度显着下降的点，这些是我们的 chunk 边界。


  现在，我们创建语义块：

  
# Create chunks using the split_into_chunks function  
text_chunks = split_into_chunks(sentences, breakpoints)  
print(f"Number of semantic chunks: {len(text_chunks)}")  
  
### OUTPUT ###  
Number of semantic chunks: 145

 split\_into\_chunks 获取我们的句子列表和我们找到的断点 ，并将句子分组为块 。


  接下来，我们需要为这些 chunk 创建 embeddings：

  
# Create chunk embeddings using the create_embeddings function  
chunk_embeddings = create_embeddings(text_chunks)

   看一下语义分块生成的答案：

  
# Create the user prompt based on the top chunks  
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])  
user_prompt = f"{user_prompt}\nQuestion: {query}"  
  
# Generate AI response  
ai_response = generate_response(system_prompt, user_prompt)  
print(ai_response.choices[0].message.content)

评估一下结果：

  
# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"  
  
# Generate the evaluation response using the evaluation system prompt and evaluation prompt  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
  
# Print the evaluation response  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT  
Based on the evaluation criteria,  
I would assign a score of 0.2 to the AI assistant response.

 分数只有0.2。虽然语义分块在理论上听起来不错，但它在这里并没有帮助我们。事实上，与简单的固定大小分块相比，我们的分数下降了！


 这表明，仅仅改变分块策略并不能保证获胜。我们的方法需要更加复杂。让我们在下一节中尝试其他操作。

五、上下文增强检索（Context Enriched Retrieval）

 我们看到，语义分块虽然原则上是个好主意，但实际上并没有改善我们的结果。


  一个问题是，即使是语义定义的块也可能过于集中。他们可能缺少周围文本的关键上下文。

picture.image

 Context-Enriched Retrieval 不仅通过抓取最匹配的 chunk 来解决这个问题，还通过抓取它的邻居来解决这个问题。


 让我们看看这在代码中是如何工作的。我们需要一个新函数 context\_enriched\_search 来处理检索 ：

  
def context_enriched_search(query, text_chunks, embeddings, k=1, context_size=1):  
    """      
    Retrieves the most relevant chunk along with its neighboring chunks.      
    """      
    # Convert the query into an embedding vector      
    query_embedding = create_embeddings(query).data[0].embedding      
    similarity_scores = []   
         
    # Compute similarity scores between query and each text chunk embedding      
    for i, chunk_embedding in enumerate(embeddings):   
        # Calculate cosine similarity between the query embedding and current chunk embedding          
        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))          
        # Store the index and similarity score as a tuple          
        similarity_scores.append((i, similarity_score))      
          
    # Sort the similarity scores in descending order (highest similarity first)      
    similarity_scores.sort(key=lambda x: x[1], reverse=True)      
      
    # Get the index of the most relevant chunk      
    top_index = similarity_scores[0][0]      
      
    # Define the range for context inclusion      
    # Ensure we don't go below 0 or beyond the length of text_chunks      
    start = max(0, top_index - context_size)      
    end = min(len(text_chunks), top_index + context_size + 1)      
      
    # Return the relevant chunk along with its neighboring context chunks      
    return [text_chunks[i] for i in range(start, end)]

 核心逻辑与我们之前的搜索类似，但我们不是只返回单个最佳 chunk，而是在它周围抓取一个 chunk 的 “窗口”。 context\_size 控制我们在两侧包含多少个 chunk。


  文本提取和分块步骤与 Simple RAG 中的步骤相同，这里略过。


 我们将使用固定大小的块，就像我们在 Simple RAG 部分中所做的那样，我们保持 chunk\_size = 1000 和 overlap = 200。


  使用LLM生成一个回复：

  
# Create the user prompt based on the top chunks  
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])  
user_prompt = f"{user_prompt}\nQuestion: {query}"  
  
# Generate AI response  
ai_response = generate_response(system_prompt, user_prompt)  
print(ai_response.choices[0].message.content)

评估一下

  
# Create the evaluation prompt and generate the evaluation response  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTUT ###  
Based on the evaluation criteria,  
I would assign a score of 0.6 to the AI assistant response.

 这次得到了0.6分数。这与simple RAG（0.5）和语义分块（0.2）相比，有了显著的改进。

六、上下文块标题（Contextual Chunk Headers）

 我们已经看到，通过包含相邻的 chunks 来添加上下文是有帮助的。但是 ，如果 chunk 的内容本身缺少重要信息怎么办？


  通常，文档具有清晰的结构标题、标题、副标题，这些标题提供了关键的上下文。上下文块标头 （CCH） 利用了这种结构。

picture.image

 这个想法很简单：在我们创建 embedding 之前，我们为每个 chunk 预置一个描述性的 header。这个标题可以是一个小摘要，或者其他。


 generate\_chunk\_header 函数将分析每个文本块并生成一个简洁、有意义的标题来总结其内容。这有助于有效地组织和检索相关信息。

  
# Chunk the extracted text, this time generating headers  
text_chunks_with_headers = chunk_text_with_headers(extracted_text, 1000, 200)  
  
# Print a sample to see what it looks like  
print("Sample Chunk with Header:")  
print("Header:", text_chunks_with_headers[0]['header'])  
print("Content:", text_chunks_with_headers[0]['text'])  
  
### OUTPUT ###  
Sample Chunk with Header:  
Header: A Description about AI Impact  
Content: AI has been an important part of society since ...

  现在，为标题和文本创建嵌入向量：

  
# Generate embeddings for each chunk (both header and text)  
embeddings = []  
for chunk in tqdm(text_chunks_with_headers, desc="Generating embeddings"):  
    text_embedding = create_embeddings(chunk["text"])      
    header_embedding = create_embeddings(chunk["header"])      
    embeddings.append({"header": chunk["header"], "text": chunk["text"], "embedding": text_embedding, "header_embedding": header_embedding})

 遍历所有的块，获取 header 和 text 的 embedding，并将所有内容存储在一起。


  由于 semantic\_search 已经可以与嵌入一起使用，我们只需要确保我们的 Headers 和 Text Chunk 都被正确嵌入。这样，当我们执行搜索时，模型可以同时考虑高级摘要（标题）和详细内容（块文本）来查找最相关的信息。


 现在，修改一下检索步骤，不仅返回匹配的块，还返回它们的标题，以获得更好的上下文并生成响应。

  
# Perform semantic search using the query and the new embeddings  
top_chunks = semantic_search(query, embeddings, k=2)  
  
# Create the user prompt based on the top chunks. note: no need to add header  
# because the context is already created using header and chunk  
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk['text']}\n=====================================\n" for i, chunk in enumerate(top_chunks)])  
user_prompt = f"{user_prompt}\nQuestion: {query}"  
  
# Generate AI response  
ai_response = generate_response(system_prompt, user_prompt)  
print(ai_response.choices[0].message.content)  
  
### OUTPUT ###  
Evaluation Score: 0.5

  通过添加这些上下文标题，让系统有更好的机会找到正确的信息，也让 LLM 有更好的机会生成完整和准确的答案。


  这说明在数据进入检索系统之前对其进行扩充的力量。我们没有更改核心 RAG 管道，但我们使数据本身更具信息量。

七、文档扩充（Document Augmentation）

  我们已经看到了在我们的 chunk 周围添加上下文（比如上下文窗口或 headers）是有效的。现在，让我们尝试一种不同类型的增强：从我们的文本块生成问题。其理念是，这些问题可以充当替代的 “查询”，可能比原始文本块本身更符合用户的意图。

picture.image

 在分块和生成嵌入向量之间添加此步骤。我们可以简单地使用 generate\_questions 函数来实现，它需要一个 text\_chunk 并返回使用它生成的一些问题。


  让我们看看如何通过问题生成来实现文档扩充：

  
# Process the document (extract text, create chunks, generate questions, build vector store)  
text_chunks, vector_store = process_document(  
    pdf_path,      
    chunk_size=1000,      
    chunk_overlap=200,      
    questions_per_chunk=3)  
      
print(f"Vector store contains {len(vector_store.texts)} items")  
  
### OUTPUT ###  
Vector store contains 214 items

 在这里，定义process\_document 函数可以完成文档扩充，输入pdf\_path、chunk\_size、chunk\_overlap 和 questions\_per\_chunk 并返回一个 vector\_store。


 现在，vector\_store 不仅包括文档的嵌入，还包括生成的问题的嵌入。


 我们可以像以前一样使用此 vector\_store 执行语义搜索。我们在这里使用一个简单的函数来查找相似的向量。

  
# Perform semantic search to find relevant content  
search_results = semantic_search(query, vector_store, k=5)  
  
print("Query:", query)  
print("\nSearch Results:")  
  
# Organize results by type  
chunk_results = []  
question_results = []  
  
for result in search_results:  
    if result["metadata"]["type"] == "chunk":   
        chunk_results.append(result)      
    else:          
        question_results.append(result)

 处理搜索结果的方式有所变化。现在，我们的 vector store 中有两种类型的数据：原始文本块和生成的问题。这里将它们分开处理，观察一下哪种类型的内容与查询最匹配。


  最后，生成上下文，然后进行评估：

  
# Prepare context from search results  
context = prepare_context(search_results)  
  
# Generate response  
response_text = generate_response(query, context)  
  
# Get reference answer from validation data  
reference_answer = data[0]['ideal_answer']  
  
# Evaluate the response  
evaluation = evaluate_response(query, response_text, reference_answer)  
  
print("\nEvaluation:")  
print(evaluation)  
  
### OUTPUT ###  
Based on the evaluation criteria, I would assign a   
score of 0.8 to the AI assistants response.

评估的分数是0.8，生成的问题添加到搜索中，进一步提升了性能。

八、查询转换（Query Transformation）

 到目前为止，我们一直专注于改进 RAG 系统使用的数据。但是查询本身呢？


 用户表达问题的方式有时候并不是搜索知识库的最佳方式。可以尝试查询转换来解决此问题。这里将探索三种不同的方法：

查询重写：使查询更加具体和详细；
后退提示：创建更广泛、更通用的查询来检索后台上下文；
子查询分解：将复杂查询分解为多个更简单的子查询。

picture.image

   我们看看这些转换的实际效果，使用标准的测试查询：

  
# Query Rewriting  
rewritten_query = rewrite_query(query)  
  
# Step-back Prompting  
step_back_query = generate_step_back_query(query)

 generate\_step\_back\_query 与 rewriting 相反：它创建一个语义更广泛的查询，这样可能会检索有用的背景信息，从而提升模型性能。

下面是子查询分解：

  
# Sub-query Decomposition  
sub_queries = decompose_query(query, num_subqueries=4)

 decompose\_query将原始查询分解为几个更小、更集中的问题。其思路是，这些子查询放在一起，可能比任何单个查询都更能覆盖原始查询的意图。


 现在，为了了解这些转换如何影响我们的 RAG 系统，让我们使用一个结合了所有先前方法的函数：

  
def rag_with_query_transformation(pdf_path, query, transformation_type=None):  
    """      
    Run complete RAG pipeline with optional query transformation.      
    Args:   
        pdf_path (str): Path to PDF document          
        query (str): User query          
        transformation_type (str): Type of transformation (None, 'rewrite', 'step_back', or 'decompose')      
    Returns:          
        Dict: Results including query, transformed query, context, and response      
    """      
    # Process the document to create a vector store      
    vector_store = process_document(pdf_path)      
      
    # Apply query transformation and search      
    if transformation_type:          
        # Perform search with transformed query          
        results = transformed_search(query, vector_store, transformation_type)      
    else:          
        # Perform regular search without transformation          
        query_embedding = create_embeddings(query)          
        results = vector_store.similarity_search(query_embedding, k=3)      
          
    # Combine context from search results      
    context = "\n\n".join([f"PASSAGE {i+1}:\n{result['text']}" for i, result in enumerate(results)])      
      
    # Generate response based on the query and combined context      
    response = generate_response(query, context)      
      
    # Return the results including original query, transformation type, context, and response      
    return {          
        "original_query": query,          
        "transformation_type": transformation_type,          
        "context": context,          
        "response": response      
    }

 我们使用evaluate\_transformations 函数来对比不同的查询转换技术（重写、后退和分解）。这有助于让我们了解哪种方法检索到最相关的信息，从而获得更好的响应。

  
# Run evaluation  
evaluation_results = evaluate_transformations(pdf_path, query, reference_answer)  
print(evaluation_results)  
  
### OUTPUT ###  
Evaluation Score: 0.5

  评估分数是0.5，由此看来，查询转换技术并没有始终优于更简单的方法。

 虽然查询转换可能很强大，但它们并不是灵丹妙药。有时，原始查询已经格式正确，尝试 “改进” 它实际上会使情况变得更糟。

九、重排序

 我们尝试过改进数据（使用分块策略）和查询（使用转换）。现在，让我们关注检索过程本身。简单相似性搜索通常会返回相关和不相关结果的混合。

picture.image

  Reranking是对最初检索的结果重新排序，将最佳结果放在最前面。


  rerank\_with\_llm 函数获取初始检索到的块，并使用 LLM 根据相关性对它们重新排序。这有助于确保最有用的信息首先显示。


 重新排序后，我们称之为 generate\_final\_response 的最终函数获取重新排序的块，将它们格式化为提示词，并将它们发送到 LLM 以生成最终响应。

  
def rag_with_reranking(query, vector_store, reranking_method="llm", top_n=3, model="meta-llama/Llama-3.2-3B-Instruct"):      
    """      
    Complete RAG pipeline incorporating reranking.      
    """      
    # Create query embedding      
    query_embedding = create_embeddings(query)          
      
    # Initial retrieval (get more than we need for reranking)      
    initial_results = vector_store.similarity_search(query_embedding, k=10)          
      
    # Apply reranking      
    if reranking_method == "llm":  
        reranked_results = rerank_with_llm(query, initial_results, top_n=top_n)      
    elif reranking_method == "keywords":          
        reranked_results = rerank_with_keywords(query, initial_results, top_n=top_n) # we are not using it.      
    else:          
        # No reranking, just use top results from initial retrieval          
        reranked_results = initial_results[:top_n]          
          
    # Combine context from reranked results      
    context = "\n\n===\n\n".join([result["text"] for result in reranked_results])          
      
    # Generate response based on context      
    response = generate_response(query, context, model)          
      
    return {      
        "query": query,          
        "reranking_method": reranking_method,          
        "initial_results": initial_results[:top_n],          
        "reranked_results": reranked_results,          
        "context": context,          
        "response": response      
    }

  输入query，vector\_store和一个reranking\_method，我们使用llm作为reranking\_method方式。

  
# Run RAG with LLM-based reranking  
llm_reranked_result = rag_with_reranking(query, vector_store, reranking_method="llm")  
  
# Evaluate.  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{llm_reranked_result['response']}\nTrue Response: {reference_answer}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT ###  
Evaluation score is 0.7

 评估分数是0.7，可以看出重排序可以提高检索出相关性更高的文档，从而提升生成的效果。

十、RSE

 我们一直专注于单个数据块，但有时最好的信息会分布在多个连续的数据块中。相关区段提取 （RSE） 可以解决这个问题。


  RSE 不仅抓取前 k 个块，还尝试识别和提取相关文本的整个片段。

picture.image

 我们为RSE定义rag\_with\_rse函数，它需要一个pdf\_path和query，并返回响应。我们将多个函数调用组合在一起来执行 RSE。

  
# Run RAG with RSE  
rse_result = rag_with_rse(pdf_path, query)

评估如下

  
# Evaluate  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{rse_result['response']}\nTrue Response: {reference_answer}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT ###  
However, the Response from Standard Retrieval includes ...  
0.8 is the score I would assign to AI Response

 通过关注相关文本的连续片段 ，RSE 为 LLM 提供了更连贯和完整的上下文，从而得到更准确和全面的回应。

十一、上下文压缩（Contextual Compression）

 我们一直在添加越来越多的上下文、相邻块、生成的问题、整个片段。但有时， 少即是多 。


  LLM 的上下文窗口有限，使用不相关的信息可能会损害RAG性能。

picture.image

 上下文压缩是关于选择性的。我们检索了大量的上下文，但随后对其进行压缩，只保留与查询直接相关的部分。


 主要的差别是生成之前需要进行 “Contextual Compression” 步骤。我们不会更改我们检索的内容，但在将其传递给 LLM 之前对其进行优化。


 使用函数 rag\_with\_compression实现上下文压缩。在内部，使用 LLM 来分析检索到的块，并仅提取与查询直接相关的句子或段落。

  
def rag_with_compression(pdf_path, query, k=10, compression_type="selective", model="meta-llama/Llama-3.2-3B-Instruct"):      
    """      
    RAG (Retrieval-Augmented Generation) pipeline with contextual compression.      
      
    Args:      
        pdf_path (str): Path to the PDF document.          
        query (str): User query for retrieval.          
        k (int): Number of top relevant chunks to retrieve. Default is 10.          
        compression_type (str): Type of compression to apply to retrieved chunks. Default is "selective".          
        model (str): Language model to use for response generation. Default is "meta-llama/Llama-3.2-3B-Instruct".      
          
    Returns:      
        dict: A dictionary containing the query, original and compressed chunks, compression stats, and the final response.      
    """          
      
    print(f"\n=== RAG WITH COMPRESSION ===\nQuery: {query} | Compression: {compression_type}")          
      
    # Process the document to extract, chunk, and embed text      
    vector_store = process_document(pdf_path)          
      
    # Retrieve top-k relevant chunks based on query similarity      
    results = vector_store.similarity_search(create_embeddings(query), k=k)      
    retrieved_chunks = [r["text"] for r in results]      
      
    # Apply compression to retrieved chunks      
    compressed = batch_compress_chunks(retrieved_chunks, query, compression_type, model)          
      
    # Filter out empty compressed chunks; fallback to original if all are empty      
    compressed_chunks, compression_ratios = zip([(c, r) for c, r in compressed if c.strip()] or [(chunk, 0.0) for chunk in retrieved_chunks])          
      
    # Combine compressed chunks to form context for response generation      
    context = "\n\n---\n\n".join(compressed_chunks)          
      
    # Generate a response using the compressed context      
    response = generate_response(query, context, model)      
      
    print(f"\n=== RESPONSE ===\n{response}")          
      
    # Return detailed results      
    return {      
        "query": query,          
        "original_chunks": retrieved_chunks,          
        "compressed_chunks": compressed_chunks,          
        "compression_ratios": compression_ratios,          
        "context_length_reduction": f"{sum(compression_ratios)/len(compression_ratios):.2f}%",          
        "response": response      
    }

rag_with_compression 提供了不同压缩类型的选项：

“selective” 仅保留直接相关的句子。
“summary” 创建侧重于查询的简短摘要；
“extraction” 仅提取包含答案的句子（非常严格！

现在，要运行压缩，我们使用以下代码：

  
# Run RAG with contextual compression (using 'selective' mode)  
compression_result = rag_with_compression(pdf_path, query, compression_type="selective")  
  
# Evaluate.  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{compression_result['response']}\nTrue Response: {reference_answer}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT ###  
Evaluation Score 0.75

 上下文压缩是一种强大的技术，它可以平衡广度 （初始检索获得广泛的信息）和重点 （压缩消除了噪音）。

十二、循环反馈

 到目前为止，我们看到的所有技术都是 “静态的”，它们不会从错误中吸取教训。下面介绍一种反馈循环机制，它可以根据用户提供有关 RAG 系统响应的反馈（例如，好/坏、相关/不相关），系统存储用户的每次反馈，将来的检索时候会使用此反馈进行改进。

picture.image

 我们可以使用 full\_rag\_workflow 函数调用来实现反馈循环。下面是函数定义：

  
def full_rag_workflow(pdf_path, query, feedback_data=None, feedback_file="feedback_data.json", fine_tune=False):      
    """      
    Execute a complete RAG workflow with feedback integration for continuous improvement.      
      
    """      
    # Step 1: Load historical feedback for relevance adjustment if not explicitly provided      
    if feedback_data is None:      
        feedback_data = load_feedback_data(feedback_file)          
        print(f"Loaded {len(feedback_data)} feedback entries from {feedback_file}")          
    # Step 2: Process document through extraction, chunking and embedding pipeline      
    chunks, vector_store = process_document(pdf_path)          
      
    # Step 3: Fine-tune the vector index by incorporating high-quality past interactions      
    # This creates enhanced retrievable content from successful Q&A pairs      
    if fine_tune and feedback_data:     
        vector_store = fine_tune_index(vector_store, chunks, feedback_data)          
          
    # Step 4: Execute core RAG with feedback-aware retrieval      
    # Note: This depends on the rag_with_feedback_loop function which should be defined elsewhere      
    result = rag_with_feedback_loop(query, vector_store, feedback_data)          
      
    # Step 5: Collect user feedback to improve future performance      
    print("\n=== Would you like to provide feedback on this response? ===")      
    print("Rate relevance (1-5, with 5 being most relevant):")      
    relevance = input()          
      
    print("Rate quality (1-5, with 5 being highest quality):")      
    quality = input()          
      
    print("Any comments? (optional, press Enter to skip)")      
    comments = input()          
      
    # Step 6: Format feedback into structured data      
    feedback = get_user_feedback(      
        query=query,          
        response=result["response"],          
        relevance=int(relevance),          
        quality=int(quality),          
        comments=comments      
    )          
      
    # Step 7: Persist feedback to enable continuous system learning      
    store_feedback(feedback, feedback_file)      
    print("Feedback recorded. Thank you!")          
      
    return result

此 full_rag_workflow 函数执行以下几项操作：

加载现有反馈：它会检查 feedback_data.json 文件并加载任何以前的反馈。
运行 RAG 管道：这部分与我们之前所做的类似。
征求反馈：它会提示用户对响应的相关性和质量进行评级。
存储反馈：它将反馈保存到 feedback_data.json 文件中。

此反馈实际上如何用于改进检索的魔力更为复杂，并且发生在 fine_tune_index、 adjust_relevance_scores 等函数中（为简洁起见，此处不显示）。但关键思想是，良好的反馈可以提高某些文档的相关性，而糟糕的反馈会降低相关性。

让我们运行一个简化版本，假设我们没有任何现有的反馈：

  
# we don't have previous feedback, therefore "fine_tune=False"  
result = full_rag_workflow(pdf_path=pdf_path, query=query, fine_tune=False)  
  
# Evaluate.  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{result['response']}\nTrue Response: {reference_answer}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT ###  
Evaluation score is 0.7 because ....

 这不是一个巨大的跳跃，这是意料之中的。反馈循环通过重复交互随着时间的推移改进系统。本节仅演示机制 。


 实际上，需要长期积累反馈并使用它才可以改进检索过程。这使得 RAG 系统能够根据它收到的查询类型进行自适应和个性化。

十三、自适应RAG（Adaptive RAG）

 我们探索了各种改进 RAG 的方法：更好的分块、添加上下文、转换查询、重新排名，甚至合并反馈。

picture.image

   但是，最好的技术取决于所问问题的类型呢？这就是 Adaptive RAG 背后的想法。

我们在这里使用四种不同的策略：

纪实策略（Factual Strategy ）：专注于检索精确的事实和数据。
分析策略（Analytical Strategy）：旨在全面覆盖一个主题，探索不同的方面。观点策略（Opinion Strategy）：试图就主观问题收集不同的观点。
情境策略（Contextual Strategy）：合并特定于用户的上下文以定制检索。

下面，我们将使用一个名为 rag_with_adaptive_retrieval 的函数来处理整个过程：

  
def rag_with_adaptive_retrieval(pdf_path, query, k=4, user_context=None):  
    """      
    Complete RAG pipeline with adaptive retrieval.      
      
    """      
    print("\n=== RAG WITH ADAPTIVE RETRIEVAL ===")      
    print(f"Query: {query}")          
      
    # Process the document to extract text, chunk it, and create embeddings      
    chunks, vector_store = process_document(pdf_path)          
      
    # Classify the query to determine its type      
    query_type = classify_query(query)      
    print(f"Query classified as: {query_type}")          
      
    # Retrieve documents using the adaptive retrieval strategy based on the query type      
    retrieved_docs = adaptive_retrieval(query, vector_store, k, user_context)          
      
    # Generate a response based on the query, retrieved documents, and query type      
    response = generate_response(query, retrieved_docs, query_type)          
      
    # Compile the results into a dictionary      
    result = {      
        "query": query,          
        "query_type": query_type,          
        "retrieved_documents": retrieved_docs,          
        "response": response      
    }          
    print("\n=== RESPONSE ===")      
    print(response)          
      
    return result

 首先使用名为 classify\_query 的函数对查询进行分类。根据识别的类型，选择并执行适当的专用检索策略（factual\_retrieval\_strategy、analytical\_retrieval\_strategy、opinion\_retrieval\_strategy 或 contextual\_retrieval\_strategy）。

最后，使用 generate_response 使用检索到的文档生成响应。

 该函数返回一个包含结果的字典，包括查询 、 查询类型 、 检索到的文档和生成的响应 。

评估一下该方法的效果：

  
# Run the adaptive RAG pipeline  
result = rag_with_adaptive_retrieval(pdf_path, query)  
  
# Evaluate.  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{result['response']}\nTrue Response: {reference_answer}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT ###  
Evaluation score is 0.86

  通过根据特定类型的查询调整我们的检索策略，我们可以获得比一刀切方法更好的结果。这凸显了了解用户意图并相应地定制 RAG 系统的重要性。


 自适应 RAG 不是一个固定的过程，它是可以根据查询选择最佳策略的一个框架。

十四、Self RAG

  到目前为止，我们的 RAG 系统在很大程度上是被动的 。他们接受查询、检索信息并生成响应。Self-RAG 采取了不同的方法：它是主动和反思的。它不仅仅是检索和生成，它还考虑是否检索、检索什么以及如何使用检索到的信息。

picture.image

  这些 “反射” 步骤使 Self-RAG 比传统的 RAG 更具活力和适应性。它可以决定：

完全跳过检索。
使用不同的策略多次检索。
丢弃不相关的信息。

优先考虑支持良好且有用的信息。

Self-RAG 的核心在于它能够生成“反射token”。这些是模型用来推理其自身进程的特殊标记。例如，它对 retrieval\_needed、 relevance、 support\_rating 和 utility\_ratings 使用不同的标记。


该模型使用这些标记的组合来决定何时必须检索，何时不需要，以及 LLM 应根据什么生成最终响应。

首先，确定是否需要检索：

  
def determine_if_retrieval_needed(query):      
    """      
    (Illustrative Example - NOT fully functional)      
    Determines if retrieval is necessary for the given query.      
    """      
    system_prompt = """You are an AI assistant that determines if retrieval is necessary to answer a query.      
    For factual questions, specific information requests, or questions about events, people, or concepts, answer "Yes".      
    For opinions, hypothetical scenarios, or simple queries with common knowledge, answer "No".      
    Answer with ONLY "Yes" or "No"."""      
      
    user_prompt = f"Query: {query}\n\nIs retrieval necessary to answer this query accurately?"      
      
    response = client.chat.completions.create(  
        model="meta-llama/Llama-3.2-3B-Instruct",          
        messages=[        
            {"role": "system", "content": system_prompt},              
            {"role": "user", "content": user_prompt}          
        ],          
            temperature=0  
     )      
     answer = response.choices[0].message.content.strip().lower()      
     return "yes" in answer

 这个 determine\_if\_retrieval\_needed 函数（再次简化）使用 LLM 来判断是否需要外部信息。


  对于像 “What is the capital of France？” 这样的事实性问题，它可能会返回 False（LLM 可能已经知道这一点）。


  对于像 “Write a poem...” 这样的创意任务，它也可能会返回 False。


  但对于更复杂或更利基的查询，它将返回 True。

以下是相关性评估的简化示例：

  
def evaluate_relevance(query, context):      
    """      
    (Illustrative Example - NOT fully functional)      
    Evaluates the relevance of a context to the query.      
    """      
    system_prompt = """You are an AI assistant. Determine if a document is relevant to a query.      
    Answer with ONLY "Relevant" or "Irrelevant"."""      
      
    user_prompt = f"""Query: {query}      
    Document content:      
    {context[:500]}... [truncated]      
      
    Is this document relevant to the query? Answer with ONLY "Relevant" or "Irrelevant".      
    """      
      
    response = client.chat.completions.create(      
        model="meta-llama/Llama-3.2-3B-Instruct",          
        messages=[        
            {"role": "system", "content": system_prompt},              
            {"role": "user", "content": user_prompt}      
        ],          
        temperature=0      
    )      
    answer = response.choices[0].message.content.strip().lower()      
    return answer

 这个 evaluate\_relevance 函数（再次简化）使用 LLM 来判断检索到的文档是否与查询相关。这允许 Self-RAG 在生成响应之前过滤掉不相关的文档。


  最后，为了调用所有这些，我们可以使用：

  
# we can call `self_rag` function for self-rag, and it automatically  
# decide when to retrieve and when not.  
result = self_rag(query, vector_store)  
print(result["response"])  
  
### OUTPUT ###  
Evaluation score for the AI Response is 0.65

评估分数为0.65，这说明了以下事实：

Self-RAG 具有巨大的潜力，但完整的实施很复杂。
即使是我们演示的 “Is Retrieval Needed？” 步骤有时也可能是错误的。

我们没有展示完整的 “反思” 过程，因此我们不能要求更高的分数。

关键的收获是 Self-RAG 旨在使 RAG 系统更加智能和适应性 。这是向 LLM 迈进的一步，LLM 可以推理自己的知识和检索需求。

十五、GraphRAG

 到目前为止，我们的 RAG 系统已将文档视为独立块的集合。但是，如果信息是相互关联的呢？如果理解一个概念需要理解相关概念怎么办？这就是 Graph RAG 的用武之地。


 Graph RAG 不是简单的块列表，而是将信息组织为知识图谱。把它想象成一个网络：

Nodes（节点）：表示概念、实体或信息片段（如我们的文本块）。
Edges（边）：表示这些节点之间的关系。

picture.image

 核心思想是，通过遍历这个图表，我们不仅可以找到直接相关的信息，还可以找到提供关键上下文的间接相关信息。

让我们看看核心步骤如何工作的一些简化代码：

首先，构建知识图谱：

  
def build_knowledge_graph(chunks):  
    """      
    Build a knowledge graph from text chunks using embeddings and concept extraction.      
    Args:          
        chunks (list of dict): List of text chunks, each containing a "text" field.      
    Returns:     
        tuple: (Graph with nodes as text chunks, list of embeddings)      
    """      
    graph, texts = nx.Graph(), [c["text"] for c in chunks]      
    embeddings = create_embeddings(texts)  # Compute embeddings      
      
    # Add nodes with extracted concepts and embeddings      
    for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):      
        graph.add_node(i, text=chunk["text"], concepts := extract_concepts(chunk["text"]), embedding=emb)      
          
    # Create edges based on shared concepts and embedding similarity      
    for i, j in ((i, j) for i in range(len(chunks)) for j in range(i + 1, len(chunks))):      
        if shared_concepts := set(graph.nodes[i]["concepts"]) & set(graph.nodes[j]["concepts"]):        
            sim = np.dot(embeddings[i], embeddings[j]) / (np.linalg.norm(embeddings[i])  np.linalg.norm(embeddings[j]))              
            weight = 0.7 * sim + 0.3 * (len(shared_concepts) / min(len(graph.nodes[i]["concepts"]), len(graph.nodes[j]["concepts"])))              
            if weight > 0.6:          
                 graph.add_edge(i, j, weight=weight, similarity=sim, shared_concepts=list(shared_concepts))      
                   
    print(f"Graph built: {graph.number_of_nodes()} nodes, {graph.number_of_edges()} edges")      
    return graph, embeddings

 使用graph\_rag\_pipeline定义rag整个流程：

  
def graph_rag_pipeline(pdf_path, query, chunk_size=1000, chunk_overlap=200, top_k=3):  
    """      
    Complete Graph RAG pipeline from document to answer.      
    """      
    # Extract text from the PDF document      
    text = extract_text_from_pdf(pdf_path)          
      
    # Split the extracted text into overlapping chunks      
    chunks = chunk_text(text, chunk_size, chunk_overlap)          
      
    # Build a knowledge graph from the text chunks      
    graph, embeddings = build_knowledge_graph(chunks)          
      
    # Traverse the knowledge graph to find relevant information for the query      
    relevant_chunks, traversal_path = traverse_graph(query, graph, embeddings, top_k)          
      
    # Generate a response based on the query and the relevant chunks      
    response = generate_response(query, relevant_chunks)              
      
    # Return the query, response, relevant chunks, traversal path, and the graph      
    return {      
        "query": query,          
        "response": response,          
        "relevant_chunks": relevant_chunks,          
        "traversal_path": traversal_path,          
        "graph": graph      
    }

评估一下：

  
# Execute the Graph RAG pipeline to process the document and answer the query  
results = graph_rag_pipeline(pdf_path, query)  
  
# Evaluate.  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{results['response']}\nTrue Response: {reference_answer}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT ###  
0.78

 评估的分数是0.78。

Graph RAG 并没有优于更简单的方法，但它可以捕获信息之间的关系，而不仅仅是单个信息本身。

picture.image

这对于需要了解概念之间联系的复杂查询特别有用。

十六、层次索引

 我们探索了改进 RAG 的各种方法：更好的分块、上下文丰富、查询转换、重新排名，甚至基于图形的检索。但有一个基本的权衡：

小块：适合精确匹配，但会丢失上下文。
大块：保留上下文，但可能导致检索相关性降低。

分层索引提供了一种思路：我们创建两个级别的表示：

摘要：文档较大部分的简要概述。

Detailed Chunks（详细数据块）：这些部分中较小的数据块。

picture.image

首先，搜索摘要：这将快速缩小文档的相关部分范围。

然后，仅在这些部分内搜索详细数据块：这样既可以保证小数据块的精度，同时保留了较大部分的上下文。

我们使用hierarchical_rag 来看看实际效果：

  
def hierarchical_rag(query, pdf_path, chunk_size=1000, chunk_overlap=200,                  
                     k_summaries=3, k_chunks=5, regenerate=False):      
    """      
    Complete hierarchical Retrieval-Augmented Generation (RAG) pipeline.      
    Args:      
        query (str): The user query.          
        pdf_path (str): Path to the PDF document.          
        chunk_size (int): Size of text chunks for processing.          
        chunk_overlap (int): Overlap between consecutive chunks.          
        k_summaries (int): Number of top summaries to retrieve.          
        k_chunks (int): Number of detailed chunks to retrieve per summary.          
        regenerate (bool): Whether to reprocess the document.      
    Returns:          
        dict: Contains the query, generated response, retrieved chunks,                 
        and counts of summaries and detailed chunks.      
    """      
    # Define filenames for caching summary and detailed vector stores      
    summary_store_file = f"{os.path.basename(pdf_path)}_summary_store.pkl"      
    detailed_store_file = f"{os.path.basename(pdf_path)}_detailed_store.pkl"          
      
    # Process document if regeneration is required or cache files are missing      
    if regenerate or not os.path.exists(summary_store_file) or not os.path.exists(detailed_store_file):      
        print("Processing document and creating vector stores...")          
        summary_store, detailed_store = process_document_hierarchically(pdf_path, chunk_size, chunk_overlap)                  
          
        # Save processed stores for future use          
        with open(summary_store_file, 'wb') as f:        
             pickle.dump(summary_store, f)          
        with open(detailed_store_file, 'wb') as f:        
             pickle.dump(detailed_store, f)      
    else:          
        # Load existing vector stores from cache          
        print("Loading existing vector stores...")          
        with open(summary_store_file, 'rb') as f:        
            summary_store = pickle.load(f)          
        with open(detailed_store_file, 'rb') as f:        
            detailed_store = pickle.load(f)          
              
     # Retrieve relevant chunks using hierarchical search      
     retrieved_chunks = retrieve_hierarchically(query, summary_store, detailed_store, k_summaries, k_chunks)          
       
     # Generate a response based on the retrieved chunks      
     response = generate_response(query, retrieved_chunks)          
       
     # Return results with metadata      
     return {          
         "query": query,          
         "response": response,          
         "retrieved_chunks": retrieved_chunks,          
         "summary_count": len(summary_store.texts),          
         "detailed_count": len(detailed_store.texts)      
     }

hierarchical_rag 函数包括两阶段的检索：

首先，根据summary_store 查找最相关的摘要。
然后，会搜索top摘要包括的chunk的detailed_store，这比搜索所有详细的 chunk 要高效得多。

该函数还有一个 regenerate 参数，用于创建新的 vector store 或使用现有的 vector store。

下面，我们进行一个查询和并评估一下最终的效果：

  
# Run the hierarchical RAG pipeline  
result = hierarchical_rag(query, pdf_path)

  
# Evaluate.  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{result['response']}\nTrue Response: {reference_answer}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT  
0.84

评估的分数是0.84，是目前最好的。

十七、HyDE

 到目前为止，我们一直在直接嵌入用户的查询或其转换版本。HyDE （Hypothetical Document Embedding） 采用了不同的方法。它不是嵌入查询，而是嵌入回答查询的假设文档。

picture.image

它的流程是：

生成一个假设的文档：使用 LLM 来试图生成回答查询（如果存在）的文档；
嵌入假设文档：嵌入此生成的假设文档，而不是原始查询；
检索：查找与假设文档的嵌入类似的文档；
生成：使用检索到的文档（不是假设的文档）来回答查询。

出发点是：一个完整的文档比一个简短的查询语义更丰富，即使是一个假设的文档。这有助于弥合嵌入空间中的查询和文档之间的差距。

让我们看看它是如何工作的。首先，我们需要一个函数来生成那个假设的文档。

  
def generate_hypothetical_document(query, desired_length=1000):  
    """      
    Generate a hypothetical document that answers the query.      
    """      
    # Define the system prompt to instruct the model on how to generate the document      
    system_prompt = f"""You are an expert document creator.       
    Given a question, generate a detailed document that would directly answer this question.      
    The document should be approximately {desired_length} characters long and provide an in-depth,       
    informative answer to the question. Write as if this document is from an authoritative source      
    on the subject. Include specific details, facts, and explanations.      
    Do not mention that this is a hypothetical document - just write the content directly."""      
      
    # Define the user prompt with the query      
    user_prompt = f"Question: {query}\n\nGenerate a document that fully answers this question:"          
      
    # Make a request to the OpenAI API to generate the hypothetical document      
    response = client.chat.completions.create(  
        model="meta-llama/Llama-3.2-3B-Instruct",  # Specify the model to use          
        messages=[        
            {"role": "system", "content": system_prompt},  # System message to guide the assistant              
            {"role": "user", "content": user_prompt}  # User message with the query          
        ],          
        temperature=0.1  # Set the temperature for response generation      
    )          
      
    # Return the generated document content      
    return response.choices[0].message.content

此函数接受一个查询，让LLM生成一个文档来试图回答它。

现在，让我们将它们全部放在一个 hyde_rag 函数中：

  
def hyde_rag(query, vector_store, k=5, should_generate_response=True):  
    """      
    Perform RAG using Hypothetical Document Embedding.          
    """      
    print(f"\n=== Processing query with HyDE: {query} ===\n")          
      
    # Step 1: Generate a hypothetical document that answers the query      
    print("Generating hypothetical document...")      
    hypothetical_doc = generate_hypothetical_document(query)      
    print(f"Generated hypothetical document of {len(hypothetical_doc)} characters")          
      
    # Step 2: Create embedding for the hypothetical document      
    print("Creating embedding for hypothetical document...")      
    hypothetical_embedding = create_embeddings([hypothetical_doc])[0]          
      
    # Step 3: Retrieve similar chunks based on the hypothetical document      
    print(f"Retrieving {k} most similar chunks...")      
    retrieved_chunks = vector_store.similarity_search(hypothetical_embedding, k=k)          
      
    # Prepare the results dictionary      
    results = {      
        "query": query,          
        "hypothetical_document": hypothetical_doc,          
        "retrieved_chunks": retrieved_chunks      
        }          
          
    # Step 4: Generate a response if requested      
    if should_generate_response:          
        print("Generating final response...")          
        response = generate_response(query, retrieved_chunks)          
        results["response"] = response          
    return results

hyde_rag函数的大致流程是：

生成假设文档；
创建该文档的嵌入（而不是查询的嵌入）；
使用该嵌入进行检索；
基于检索的内容生成答案；

  
# Run HyDE RAG  
hyde_result = hyde_rag(query, vector_store)  
  
# Evaluate.  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{hyde_result['response']}\nTrue Response: {reference_answer}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT  
0.5

 评估的分数是0.5，虽然HyDE思路不错，但效果并没有更好。可能是生成假设文档的方向可能与我们的实际文档集合略有不同，从而导致检索的相关性较低。


  到目前发现，没有单一的 “最佳” RAG 技术。不同的方法适合不同的查询和不同的数据。

十八、Fusion RAG

  我们已经看到，不同的检索方法具有不同的优势。向量搜索擅长语义相似性，而关键词搜索擅长查找精确匹配。如果我们能把它们结合起来呢？这就是 Fusion RAG 背后的理念。

picture.image

  Fusion RAG 不是一种检索方法，而是同时执行两种检索方法，然后对结果进行组合和重新排序。这使我们能够捕获语义含义和精确的关键字匹配。


  我们使用fusion\_retrieval 函数来实现基于向量和基于 BM25 的检索，对每个检索的分数进行标准化，使用加权公式将它们合并，然后根据合并的分数对文档进行排名。

以下是融合检索的功能：

  
import numpy as np  
  
def fusion_retrieval(query, chunks, vector_store, bm25_index, k=5, alpha=0.5):  
    """Perform fusion retrieval by combining vector-based and BM25 search results."""          
      
    # Generate embedding for the query      
    query_embedding = create_embeddings(query)      
      
    # Perform vector search and store results in a dictionary (index -> similarity score)      
    vector_results = {      
        r["metadata"]["index"]: r["similarity"]           
        for r in vector_store.similarity_search_with_scores(query_embedding, len(chunks))      
    }      
      
    # Perform BM25 search and store results in a dictionary (index -> BM25 score)      
    bm25_results = {      
        r["metadata"]["index"]: r["bm25_score"]           
        for r in bm25_search(bm25_index, chunks, query, len(chunks))      
    }      
      
    # Retrieve all documents from the vector store      
    all_docs = vector_store.get_all_documents()      
      
    # Compute combined scores for each document using a weighted sum of vector and BM25 scores      
    scores = [      
        (i, alpha * vector_results.get(i, 0) + (1 - alpha) * bm25_results.get(i, 0))           
        for i in range(len(all_docs))      
    ]      
      
    # Sort documents by combined score in descending order and keep the top k results      
    top_docs = sorted(scores, key=lambda x: x[1], reverse=True)[:k]      
      
    # Return the top k documents with text, metadata, and combined score      
    return [      
        {"text": all_docs[i]["text"], "metadata": all_docs[i]["metadata"], "score": s}           
        for i, s in top_docs      
    ]

它结合了两种方法的优点：

向量搜索：使用现有的 create_embeddings 和 SimpleVectorStore 来实现语义相似性。
BM25 搜索：使用 BM25 算法（一种标准信息检索技术）实现基于关键字的搜索。
组合搜索：将两种方法的分数结合起来，得到一个单一的、统一的排名。

下面实现一下Fusion RAG的基本逻辑：

  
# First, process the document to create chunks, vector store, and BM25 index  
chunks, vector_store, bm25_index = process_document(pdf_path)  
  
# Run RAG with fusion retrieval  
fusion_result = answer_with_fusion_rag(query, chunks, vector_store, bm25_index)  
print(fusion_result["response"])  
  
# Evaluate.  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{fusion_result['response']}\nTrue Response: {reference_answer}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT  
Evaluation score for AI Response is 0.83

 最终的评估分数是0.83，看起来还不错，它就像结合了两个专家一起工作，一个擅长理解查询的含义，另一个擅长查找完全匹配。

十九、多模态RAG

 到目前为止，我们只处理文本。但是还有很多信息来自图像、图表中。多模态RAG可以解锁这些信息并使用它来改进我们的响应。

picture.image

与纯文本RAG的主要变化有以下几点：

提取文本和图像：从 PDF 中提取文本和图像；
生成图像描述：使用 LLM（具体来说，具有视觉功能的模型）为每张图像生成文本描述（字幕）；
创建嵌入（文本和标题）：为文本块和图像描述分别创建嵌入；
嵌入模型：在此笔记本中，我们使用 BAAI/bge-en-icl 嵌入模型。
LLM：为了生成响应和图像标题，我们将使用 llava-hf/llava-1.5–7b-hf 模型。

这样，我们的 vector store 同时包含文本和视觉信息，可以在两种模态中进行搜索。

  
def process_document(pdf_path, chunk_size=1000, chunk_overlap=200):  
    """      
    Process a document for multi-modal RAG.      
    """      
    # Create a directory for extracted images      
    image_dir = "extracted_images"      
    os.makedirs(image_dir, exist_ok=True)          
      
    # Extract text and images from the PDF      
    text_data, image_paths = extract_content_from_pdf(pdf_path, image_dir)          
      
    # Chunk the extracted text      
    chunked_text = chunk_text(text_data, chunk_size, chunk_overlap)          
      
    # Process the extracted images to generate captions      
    image_data = process_images(image_paths)          
      
    # Combine all content items (text chunks and image captions)      
    all_items = chunked_text + image_data          
      
    # Extract content for embedding      
    contents = [item["content"] for item in all_items]          
      
    # Create embeddings for all content      
    print("Creating embeddings for all content...")      
    embeddings = create_embeddings(contents)          
      
    # Build the vector store and add items with their embeddings      
    vector_store = MultiModalVectorStore()      
    vector_store.add_items(all_items, embeddings)          
      
    # Prepare document info with counts of text chunks and image captions      
    doc_info = {      
        "text_count": len(chunked_text),          
        "image_count": len(image_data),          
        "total_items": len(all_items),      
    }          
      
    # Print summary of added items      
    print(f"Added {len(all_items)} items to vector store ({len(chunked_text)} text chunks, {len(image_data)} image captions)")          
      
    # Return the vector store and document info      
    return vector_store, doc_info

  此函数提取了图像提取和标题 ，以及创建了多模态向量存储：MultiModalVectorStore。


  这里假设图像字幕效果相当好。（在实际场景中，需要仔细评估描述的质量）。


  现在，让我们用一个查询来评估一下多模态RAG的效果：

  
# Process the document to create vector store. we have a new pdf for this  
pdf_path = "data/attention_is_all_you_need.pdf"  
vector_store, doc_info = process_document(pdf_path)  
  
# Run the multi-modal RAG pipeline.  This is very similar to before!  
result = query_multimodal_rag(query, vector_store)  
  
# Evaluate.  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{result['response']}\nTrue Response: {reference_answer}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT  
0.79

 评估的分数是0.79，多模态RAG有可能非常强大，尤其是包括图像的文档。

二十、Crag

 到目前为止，我们的 RAG 系统相对被动。它们检索信息并生成响应。但是，如果检索到的信息是错误的怎么办？如果它是不相关的、不完整的，甚至是矛盾的怎么办？Corrective RAG （CRAG） 旨在解决该问题。

picture.image

 CRAG 增加了一个关键步骤： 评估 。在初始检索之后，它会检查检索到的文档的相关性。而且，根据评估结果，有不同的策略 ：

高相关性：如果检索到的文档良好，正常进行；
低相关性：如果检索到的文档有问题，会回退到 Web 搜索！
中等相关性：如果文档正常，则合并来自文档和 Web 的信息。

这种“纠正”机制使 CRAG 比标准 RAG 更鲁棒。

下面定义一个函数来实现：

  
# Run CRAG  
crag_result = rag_with_compression(pdf_path, query, compression_type="selective")

这个单一的函数调用做了很多事情：

初始检索：正常检索文档；
相关性评估：对每个文档与查询的相关性进行评分；
决策：决定是使用文档、执行 Web 搜索还是将两者合并；
响应生成：使用所选的知识源生成响应。

评估效果：

  
# Evaluate.  
evaluation_prompt = f"User Query: {query}\nAI Response:\n{crag_result['response']}\nTrue Response: {reference_answer}\n{evaluate_system_prompt}"  
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)  
print(evaluation_response.choices[0].message.content)  
  
### OUTPUT ###  
0.824

  评估评分是0.824。CRAG 检测和纠正检索失败的能力使其比标准 RAG 更可靠。


 通过在必要时动态切换到 Web 搜索，它可以处理更广泛的查询，并避免陷入不相关或不充分的信息。


 这种“自我纠正”能力是朝着更强大和更值得信赖的 RAG 系统迈出的重要一步。

参考文献：

[1] https://github.com/FareedKhan-dev/all-rag-techniques