引入元数据(metadata)提升RAG架构下LLM应用的效果和管控精度 - 文章 - 开发者社区

动手点关注

干货不迷路

picture.image

在阅读本文前，建议阅读：

LLM应用架构之检索增强（RAG）的缘起与架构介绍

改进召回（Retrieval）和引入重排（Reranking）提升RAG架构下的LLM应用效果

在前文提到，要将RAG架构应用到生产，存在着效果，安全，可信等等多方面的问题需要处理。从RAG整体流程来看，一个RAG系统分为建库和检索两部分，而建库某种意义上起到了奠基作用，它为后续检索策略优化提供可能性。

picture.image

就建库而言，又可细分为文档加载读取，文档解析，索引三个步骤，最终存放在相应的存储中。

以langchain和llamaindex来讲，提供了很多文档加载读取的实现，如网页，pdf，常见数据库等，其核心作用就是把数据从外部读取进来来，对于RAG系统来讲，会将其看作是文档（Document）统一来处理。到了文档解析阶段，其核心目标就是将大文档chunk化，形成一个个片段，在index阶段，将其embeding为索引以供检索。

对于简单的RAG系统而言，向量数据库中存放的仅仅是由document拆分的chunks，在检索时也只能使用一些top-k的策略，精细化程度不够。而本文将会介绍在document和chunks中增加一些描述信息，为后续的index、检索和生成提供更多效果提升和管控的可能性。这就是元数据（metadata），用来描述文档或者chunks的附属信息，如标题，摘要，所有者等。有了这些元数据，那么在后续embedding建索引，以及后续检索都可以使用它们，这样做可以很大程度提高RAG系统的效果和可控性，在上一篇文章改进召回（Retrieval）和引入重排（Reranking）提升RAG架构下的LLM应用效果，在reranking时设计的最近文档优先策略就用到了metadata中的date属性。

picture.image

本文将以llamaindex为例，介绍metadata的概念以及它的使用场景。

认识Metadata

概念

在llamaindex中，有两个基本概念document和node，document可以拆分为若干nodes，如果是文本，也可叫做chunks。文档除了包含文本内容外，还包含metadata和relationships两个附加信息。其中metadata是一个字典结构类型，用以描述该document或node的基本信息，可以自由定义其中的内容，并且对于拆分出的node（chunks）还可以继承其document上的元数据内容。relationships也是一个字典结构，用来存储它与其他document或者node的关系。

在document中如何增加metadata呢？理论上这个字典类型可以设置任何满足条件的数据，但是通常这些元数据将会存储在向量数据库中，为了方便它们存储查询，其 key必须是string，而其value只能是str，float，int类型。下面是一些metadata的设置方式。

1.新建时设置metadata。


          
document = Document(
          
    text='text', 
          
    metadata={
          
        'filename': '<doc_file_name>', 
          
        'category': '<category>'
          
    }
          
)

2.文档创建后设置


        
            

          document.metadata = {'filename': '<doc\_file\_name>'}

3.使用 SimpleDirectoryReader 和 file_metadata 钩子自动设置文件名。这将在每个文件上自动运行钩子以设置元数据字段：


          
from llama_index import SimpleDirectoryReader
          
filename_fn = lambda filename: {'file_name': filename}
          

          
# automatically sets the metadata of each document according to filename_fn
          
documents = SimpleDirectoryReader('./data', file_metadata=filename_fn).load_data()

实际上，并不是所有的metadata都需要开发者手工设置，在llamaindex中，提供了一些常见的metadata提取器（Extractor），能够减轻设置metadata的复杂度。具体如下：

TitleExtractor：提取每个节点上下文中的文档标题，并将其存储在 document_title 元数据字段中。

QuestionsAnsweredExtractor：Node级提取器。提取Node可以回答的问题集，并将问题存储在 questions_this_excerpt_can_answer 元数据字段中。当进行查询时，retriever就可以利用每个node中的metadata里的存储的问题召回最相关的chunks，显然这样能够提高召回的相关性。

SummaryExtractor：具有相邻共享能力的node级提取器。自动提取一组节点的摘要，并将摘要存储在 section_summary、prev_section_summary、next_section_summary 元数据字段中。但SummaryExtractor 并不直接回答 "能否为文档的第一页生成摘要？它的作用是用来召回与node上下文相关的节点。最终的总结是仍然是在生成环节生成，比如利用TreeSummarize。

KeywordExtractor：node级提取器。从node中提取关键词并将其存储在 excerpt_keywords 元数据字段中。retriever会使用存储在每个节点元数据中的关键词，以便召回与查询最相关的关键词node。

EntityExtractor：使用默认模型 tomaarsen/span-marker-mbert-base-multinerd 和 SpanMarker 库将Entity提取到元数据字段中。

检索使用metadata

对于metadata检索阶段使用是需要配合向量数据库的，当下，大多数的向量数据库都具备元数据的过滤能力。

picture.image

基本使用方法：

picture.image

1.定义metadata，生成索引。


          
from llama_index.schema import TextNode
          

          
nodes = [
          
    TextNode(
          
        text="The Shawshank Redemption",
          
        metadata={
          
            "author": "Stephen King",
          
            "theme": "Friendship",
          
        },
          
    ),
          
    TextNode(
          
        text="The Godfather",
          
        metadata={
          
            "director": "Francis Ford Coppola",
          
            "theme": "Mafia",
          
        },
          
    ),
          
    TextNode(
          
        text="Inception",
          
        metadata={
          
            "director": "Christopher Nolan",
          
        },
          
    ),
          
]
          

          
vector_store = PineconeVectorStore(
          
    pinecone_index=pinecone_index, namespace="test_05_14"
          
)
          
storage_context = StorageContext.from_defaults(vector_store=vector_store)
          
index = VectorStoreIndex(nodes, storage_context=storage_context)

2.定义metadata filter。


          
from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters
          

          
filters = MetadataFilters(filters=[ExactMatchFilter(key="theme", value="Mafia")])

3.利用filter检索。


          
retriever = index.as_retriever(filters=filters)
          
retriever.retrieve("What is inception about?")

也可以使用下面方式检索。


          
retriever = index.as_retriever(vector_store_kwargs={"filter": {"theme": "Mafia"}})
          
retriever.retrieve("What is inception about?")

在检索阶段，llamaindex也支持借助LLM能力自动生成filter检索（VectorIndexAutoRetriever）

picture.image

另外说明：由于llamaindex版本重构原因，metadata在原来版本的里叫做extra_info。

场景案例

从使用场景上看，metadata可以用来提升效果，内容引用溯源，document级甚至chunk级的权限控制，排序，过滤，后处理等业务策略等。

场景1: 提升召回相关性

案例的数据是下载于美国财政部网站（treasury.gov）的美国政府 2022 年和 2021 年财务报告的执行摘要。虽然这些 PDF 文件只有十几页，但却包含了大量密集的统计数据。

现在利用内置的metadata提取器创建应用。

1.创建metadata提取器并将其传递给node parser。


          
from llama_index.node_parser import SimpleNodeParser
          
from llama_index.node_parser.extractors import (
          
    MetadataExtractor,
          
    SummaryExtractor,
          
    QuestionsAnsweredExtractor,
          
    TitleExtractor,
          
    KeywordExtractor,
          
)
          
from llama_index.text_splitter import TokenTextSplitter
          

          
#define LLM service
          
llm = OpenAI(temperature=0.1, model_name="gpt-3.5-turbo", max_tokens=512)
          
service_context = ServiceContext.from_defaults(llm=llm)
          

          
#construct text splitter to split texts into chunks for processing
          
text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)
          

          
#set the global service context object, avoiding passing service_context when building the index 
          
from llama_index import set_global_service_context
          
set_global_service_context(service_context)
          

          
#create metadata extractor
          
metadata_extractor = MetadataExtractor(
          
    extractors=[
          
        TitleExtractor(nodes=1, llm=llm), #title is located on the first page, so pass 1 to nodes param
          
        QuestionsAnsweredExtractor(questions=3, llm=llm), #let's extract 3 questions for each node, you can customize this.
          
        SummaryExtractor(summaries=["prev", "self"], llm=llm), #let's extract the summary for both previous node and current node.
          
        KeywordExtractor(keywords=10, llm=llm) #let's extract 10 keywords for each node.
          
    ],
          
)
          

          
#create node parser to parse nodes from document
          
node_parser = SimpleNodeParser(
          
    text_splitter=text_splitter,
          
    metadata_extractor=metadata_extractor,
          
)

2.使用node parser从加载的 PDF 文档中获取node，然后创建索引。


          
#loading documents
          
documents_2022 = SimpleDirectoryReader(input_files=["data/executive-summary-2022.pdf"], filename_as_id=True).load_data()
          
print(f"loaded documents_2022 with {len(documents_2022)} pages")
          
documents_2021 = SimpleDirectoryReader(input_files=["data/executive-summary-2021.pdf"], filename_as_id=True).load_data()
          
print(f"loaded documents_2021 with {len(documents_2021)} pages")
          

          
#use node_parser to get nodes from documents
          
nodes_2022 = node_parser.get_nodes_from_documents(documents_2022)
          
nodes_2021 = node_parser.get_nodes_from_documents(documents_2021)
          
print(f"loaded nodes_2022 with {len(nodes_2022)} nodes")
          
print(f"loaded nodes_2021 with {len(nodes_2021)} nodes")
          

          
#print metadata in json format
          
for node in nodes_2022:
          
    metadata_json = json.dumps(node.metadata, indent=4)  # Convert metadata to formatted JSON
          
    print(metadata_json)
          

          
for node in nodes_2021:
          
    metadata_json = json.dumps(node.metadata, indent=4)  # Convert metadata to formatted JSON
          
    print(metadata_json)
          

          
#based on the nodes and service_context, create index
          
index = VectorStoreIndex(nodes=nodes_2022 + nodes_2021, service_context=service_context)

生成完成的node的metadata如下：


          
{
          
    "page_label": "9",
          
    "file_name": "executive-summary-2022.pdf",
          
    "document_title": "Comprehensive Overview of the U.S. Government's Financial Report for 2022",
          
    "questions_this_excerpt_can_answer": "1. What is the projected debt-to-GDP ratio for the U.S. government over the 
          
next 75 years if current policy is maintained?\n2. How do policy delays in addressing the debt-to-GDP ratio impact future generations?\n3. How can changes in fiscal policy be implemented in a way that minimizes the negative impact on economic growth?",
          
    "prev_section_summary": "The executive summary of the 2022 Financial Report of the U.S. Government highlights the 
          
fiscal gap and the importance of timely fiscal policy reform. The estimated fiscal gap for 2022 is 4.9 percent of GDP, indicating the need for spending reductions and receipt increases equivalent to 4.2 percent of GDP on average over the next 75 years to achieve fiscal sustainability. The timing of policy changes is crucial, as delaying reform would impose a greater burden on future generations. The longer the delay, the larger the post-reform primary surpluses required to achieve the target debt-to-GDP ratio. The report projects a rise in the government's debt-to-GDP ratio over the 
          
75-year period.",
          
    "section_summary": "The section discusses the impact of delayed policy changes on future generations and the unsustainability of current fiscal policy. It also highlights the government's efforts to address climate change, as mandated by EO 14008. The Financial Report summarizes how various agencies are responding to the climate crisis and provides links to their Climate Adaptation and Resilience Plans.",
          
    "excerpt_keywords": "delayed, post-reform primary surpluses, target debt-to-GDP ratio, 75-year period, future generations, harmed, policy delay, higher primary surpluses, taxes, programmatic spending, projections, Financial Report, 
          
government's debt-to-GDP ratio, current policy, sustainable, fiscal policy, economic growth, revenue, spending, reporting, climate change, EO 14008, tackling the climate crisis, United States, international leadership, legislation, policy actions, CFO Act agencies, financial statements, climate adaptation, resilience plans."
          
}

这里面document_title、questions_this_excerpt_can_answer、prev_section_summary、section_summary 和 excerpt_keywords都是提取器自动生成的元数据。值得一提的是，善于借用LLM的能力是LLM应用开发的很重要的技巧，在这里就很好地践行了，这些字段都是由metadata提取器借助LLM自动生成的。

为了对比，再创建一个没有metadata的索引数据。除 page_label 字段外，删除了所有元数据提取器。


          
nodes_2022 = node_parser.get_nodes_from_documents(documents_2022)
          
nodes_2021 = node_parser.get_nodes_from_documents(documents_2021)
          
print(f"loaded nodes_2022 with {len(nodes_2022)} nodes")
          
print(f"loaded nodes_2021 with {len(nodes_2021)} nodes")           
          

          
nodes_no_metadata = deepcopy(nodes_2022) + deepcopy(nodes_2021)
          
for node in nodes_no_metadata:
          
    node.metadata = {
          
        k: node.metadata[k] for k in node.metadata if k in ["page_label"]
          
    }
          

          
index = VectorStoreIndex(nodes=nodes_no_metadata, service_context=service_context)

效果对比：

问题1: How did the budget deficit and net operating cost change in the financial report for 2022 compared to previous years?

尽管问题中提到了关键字 "compare"，但答案就在 2022 年的同一份财务报告中，因此无需使用 LlamaIndex 的子问题查询引擎（SubQuestionQueryEngine）查找两份报告来生成回复。由索引构建的简单查询引擎就可以正常工作。


          
# load indexs with metadata or without metadata
          
index = load_index()
          
# queries the index with the input text
          
response = index.as_query_engine().query(input_text)

查询响应对比：

picture.image

对比有元数据和无元数据的查询结果，有元数据的答案更加简洁准确，因为它列出了预算赤字和净运营成本的实际减少/增加百分比，与报告中的内容相符。而没有元数据的答案则不正确，因为 2022 年的预算赤字确实减少了，而答案却说 2022 年的预算赤字和净运营成本增加了。

问题2: What percentage of the U.S. government’s total tax and other revenues in the 2022 financial report came from taxes?

picture.image

同样，有元数据和没有元数据的答案差别也很大。对于没有元数据的问题，LLM回答指出提出的问题没有在给定的上下文中提供。而在有元数据的答案中，得到了 2022 年税收占美国政府税收和其他收入总额的准确百分比。这说明有了metadata对于提升召回相关性起了关键性作用。

完整源代码：https://github.com/wenqiglantz/llamaindex-metadata-financial-reports。

场景2: 增加文章引用（ citation ）

在llamaindex中，SimpleDirectoryReader可以自动地生成page_label和file_name元数据。基于这两个元数据可以得到回答的引用。


          
reader = SimpleDirectoryReader(input_files=["../data/10k/lyft_2021.pdf"])
          
data = reader.load_data()
          
index = VectorStoreIndex.from_documents(data, service_context=service_context)
          
query_engine = index.as_query_engine(streaming=True, similarity_top_k=3)
          
response = query_engine.query(
          
"What was the impact of COVID? Show statements in bullet form and show page reference after each statement."
          
)
          
response.print_response_stream()

• The ongoing COVID-19 pandemic continues to impact communities in the United States, Canada and globally (page 6).
• The pandemic and related responses caused decreased demand for our platform leading to decreased revenues as well as decreased earning opportunities for drivers on our platform (page 6).
• Our business continues to be impacted by the COVID-19 pandemic (page 6).
• The exact timing and pace of the recovery remain uncertain (page 6).
• The extent to which our operations will continue to be impacted by the pandemic will depend largely on future developments, which are highly uncertain and cannot be accurately predicted (page 6).
• An increase in cases due to variants of the virus has caused many businesses to delay employees returning to the office (page 6).
• We anticipate that continued social distancing, altered consumer behavior, reduced travel and commuting, and expected corporate cost cutting will be significant challenges for us (page 6).
• We have adopted multiple measures, including, but not limited, to establishing new health and safety requirements for ridesharing and updating workplace policies (page 6).
• We have had to take certain cost-cutting measures, including lay-offs, furloughs and salary reductions, which may have adversely affect employee morale, our culture and our ability to attract and retain employees (page 18).
• The ultimate impact of the COVID-19 pandemic on our users, customers, employees, business, operations and financial performance depends on many factors that are not within our control (page 18).

查看node数据：

for node in response.source_nodes:  
    print("-----")  
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]  
    print(f"Text:\t {text_fmt} ...")  
    print(f"Metadata:\t {node.node.metadata}")  
    print(f"Score:\t {node.score:.3f}")

Text: Impact of COVID-19 to our BusinessThe ongoing COVID-19 pandemic continues to impact communities in the United States, Canada and globally. Since the pandemic began in March 2020,governments and private businesses - at the recommendation of public health officials - have enacted precautions to mitigate the spread of the virus, including travelrestrictions and social distancing measures in many regions of the United States and Canada, and many enterprises have instituted and maintained work from homeprograms and limited the number of employees on site. Beginning in the middle of March 2020, the pandemic and these related responses caused decreased demand for ourplatform leading to decreased revenues as well as decreased earning opportunities for drivers on our platform. Our business continues to be impacted by the COVID-19pandemic. Although we have seen some signs of demand improving, particularly compared to the dema ...
Metadata: {'page_label': '6', 'file_name': 'lyft_2021.pdf'}
Score: 0.821

Text: will continue to be impacted by the pandemic will depend largely on future developments, which are highly uncertain and cannot beaccurately predicted, including new information which may emerge concerning COVID-19 variants and the severity of the pandemic and actions by government authoritiesand private businesses to contain the pandemic or recover from its impact, among other things. For example, an increase in cases due to variants of the virus has causedmany businesses to delay employees returning to the office. Even as travel restrictions and shelter-in-place orders are modified or lifted, we anticipate that continued socialdistancing, altered consu mer behavior, reduced travel and commuting, and expected corporate cost cutting will be significant challenges for us. The strength and duration ofthese challenges cannot b e presently estimated.In response to the COVID-19 pandemic, we have adopted multiple measures, including, but not limited, to establishing ne ...
Metadata: {'page_label': '56', 'file_name': 'lyft_2021.pdf'}
Score: 0.808

Text: storing unrented and returned vehicles. These impacts to the demand for and operations of the different rental programs have and may continue to adversely affectour business, financial condi tion and results of operation.• The COVID-19 pandemic may delay or prevent us, or our current or prospective partners and suppliers, from being able to test, develop or deploy autonomousvehicle-related technology, including through direct impacts of the COVID-19 virus on employee and contractor health; reduced consumer demand forautonomous vehicle travel resulting from an overall reduced demand for travel; shelter-in-place orders by local, state or federal governments negatively impactingoperations, including our ability to test autonomous vehicle-related technology; impacts to the supply chains of our current or prospective partners and suppliers;or economic impacts limiting our or our current or prospective partners’ or suppliers’ ability to expend resources o ...
Metadata: {'page_label': '18', 'file_name': 'lyft_2021.pdf'}
Score: 0.805

In [ ]:

可以看到，通过“What was the impact of COVID? Show statements in bullet form and show page reference after each statement."”这样的prompt，大模型可以根据page_label的描述，生成对应的引用。

另外，llamaindex还可以使用CitationQueryEngine来生成带有引用的回答。在langchain中也有类似的实现，大家可以参考官方文档。


          
query_engine = CitationQueryEngine.from_args(
          
    index,
          
    similarity_top_k=3,
          
    # here we can control how granular citation sources are, the default is 512
          
    citation_chunk_size=512,
          
)
          
response = query_engine.query("What did the author do growing up?")
          
print(response)
          
###
          
Before college, the author worked on writing short stories and programming on an IBM 1401 using an early version of Fortran [1]. They later got a TRS-80 computer and wrote simple games, a program to predict rocket heights, and a word processor [2].

场景3: 权限过滤

可以在document导入时，结合业务逻辑同步生成带有权限标记的metadata。如：


          
document = Document(
          
    text='text', 
          
    metadata={
          
        'owner_id': '123', 
          
        'accessRole': 'role1'
          
    }
          
)

当用户A访问时，就可以反查出其角色，然后通过metafilter，检索出他有权限的资料，供大模型生成。


        
            

          filters = MetadataFilters(filters=[ExactMatchFilter(key="accessRole", value="role1")])

总结

Metadata给RAG架构下的应用效果提升及精细化控制提供了可能性，这使得RAG不再简单只是向量检索生成，还可以结合业务需要注入更多个性策略，这为RAG在行业广泛落地提供了更大的支持，未来也将会被越来越多的应用。

picture.image