干货 | 如何系统性地评价RAG系统

本文翻译/改写自：https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG

聊天机器人作为大语言模型的一种典型应用形式，而这种应用的普及很大部分收到了检索增强生成（RAG）框架的发展。RAG框架融合了知识库和生成式模型的优势，通过将OpenQA转换为ClosedQA来降低模型的幻觉，减少错误信息的输出，同时又保障了输出信息的时效性，并且使得聊天机器人应用能够快速的在领域知识得到快速应用。

虽然RAG框架目前非常流行，但是如何有效的评价RAG的效果仍然是一个问题，因为这块目前还有没有统一的行业标准。目前的方法主要依赖人工评估，但是这种方法效率低，并且难以大规模推广。

为了解决这个问题，《LLM-auto-eval-best-practices-RAG》这篇文章通过一系列的时间和探索，制定了一套大语言模型 LLM+RAG的自动化评估实践。可以帮大家快速、科学的评价RAG应用，并投入到生产环境中。

自动化评估的难点在哪里？

目前在LLM模型评估这个领域，通用做法是利用LLM来充当自动化评估的“裁判”，比如利用GPT-4来评定大语言模型的输出质量。虽然这种方法目前使用的非常多，但是这块在实际应用的时候也是有一些问题的：

• 与人类评分的一致性：在文档问答聊天机器人的场景下，LLM 评判的打分是否能准确反映人类对于答案正确性、易读性和完整性的真实偏好？
• 通过few shot提高准确性：向 LLM 评判提供少量评分示例的方法效果如何？这种做法能在多大程度上提升 LLM 评判在不同评价标准上的可靠性和复用性？
• 合适的评分尺度：鉴于不同框架采用的评分尺度各异（如 AzureML 的 0 至 100 分制与 langchain 的二元制），我们应该如何选择适当的评分尺度？
• 评估指标的通用性：在同一评估指标（如正确性）下，这一指标在不同应用场景（如日常聊天、内容概括、检索增强生成等）之间的通用性和可复用性如何？

如何在RAG应用中进行自动化评估

为了解决以上问题，作者针对性的探讨了多种解决方案：

• LLM 作为评判，在超过 80%的情况下与人工评分达成一致 。在作者的RAG聊天机器人评估中，LLM 评判的效果接近人工评判，超过 80%的评价与人工评分完全一致，且在 95%以上的情况下，评分差距不超过 1 分（采用 0-3 分的评分制度）。
• 利用 GPT-3.5 及few shot降低成本 。通过为每个评分等级提供示例，GPT-3.5 能够充当 LLM 评判，考虑到上下文长度的限制，使用低精度评分制度更为实用。相比 GPT-4，采用few shot辅助的 GPT-3.5 能将 LLM 评判的成本减少 10 倍，并将评估速度提升 3 倍以上。
• 采用低精度评分制度，简化解读过程 。作者发现，低精度的评分等级如 0、1、2、3 甚至二元评分（0、1），与高精度的 0-10.0 或 0-100.0 相比，在保持相当精度的同时，极大地简化了向人类标注者和 LLM 评判提供评分标准的难度。低精度评分制度还有助于保持不同 LLM 评判间的评分一致性（比如 GPT-4 与 claude2 之间）。
• RAG 应用需定制专属基准 。一个模型在特定领域的基准测试中（如休闲聊天、数学或创意写作）表现优异，并不代表它在其他任务（如根据给定上下文回答问题）上同样出色 。基准测试应与用例相匹配，即 RAG 应用应仅使用 RAG 基准进行评估。

根据我们的研究，我们建议在使用 LLM 评判时遵循以下步骤：

• 采用 1 至 5 分的评分制度；
• 使用 GPT-4 作为 LLM 评判，无需示例，以便理解评分规则；
• 将 LLM 评判切换至 GPT-3.5，每个评分等级提供一个示例。

构建最佳实践的方法指南

实验架构设计

picture.image

实验包含三个步骤：

步骤一：构建评估数据集

基于Databricks文档中的100个问题及其相关上下文，编制了一份数据集。上下文是指与问题紧密相关的文档部分。

步骤二：制作答案表

借助评估数据集，让各类语言模型撰写答案，并将问题、上下文与答案的组合保存在名为“答案表”的数据集中。本次研究涉及的模型包括GPT-4、GPT-3.5、Claude-v1、Llama2-70b-chat、Vicuna-33b和mpt-30b-chat。

步骤三：打分与评价

依据答案表，我们利用不同的LLM为答案打出成绩，并提供得分的理由。打分是根据正确性（占比60%） 、完整性（占比20%） 以及易读性（占比20%） 综合评定的。我们之所以这样分配权重，是为了突显我们对答案正确性的重视。尽管不同应用可能会有不同的权重调整策略，但我们相信正确性始终是最关键的考量因素。

此外，运用了一些特殊技巧来减少定位偏差并增强评估的可信度 ：

• 采用低温设置（温度设为0.1），确保评分结果的一致性和可复现性。
• 采取单答案评分机制，跳过成对比较的复杂过程。
• 引入思维链路，允许LLM在给出最终评分前对评分流程进行深入思考。
• 运用少样本生成法，为LLM提供针对每个评分维度（正确性、完整性、易读性）的评分标准样例，帮助其做出判断。

实验一：与人为评分的契合度

这一阶段的实验目的是评估人工评分者和LLM评判之间的一致性程度，便将gpt-3.5-turbo及vicuna-33b生成的答案表（0至3分的评分制度）提交给专业标注团队获取人工评分，并将这些评分与GPT-4的评分结果进行对比。主要发现如下：

在正确性和可读性评分方面，人类评委与GPT-4评委的一致性可以超过80%。若我们将一致性标准放宽至1分以内的差异，这一比率则能提升至95%以上。

picture.image

全面性评分的匹配度不高，这正印证了业界利益相关者的看法，他们认为“全面性”这一标准相较于“正确性”或“可读性”等更具主观性。

实验二：透过示例提高准确度

lmsys的论文通过设定一系列标准，引导LLM评判依据回答的助益性、相关度、准确度、深度、创意及细节层次来打分。然而，论文并未透露具体的评分细节。我们的研究发现，诸多因素对最终评分有着显著影响，比如：

• 各项标准的相对重要性：助益性、相关度、准确度、深度、创意；
• 某些标准如助益性的界定颇为模糊；
• 当不同标准相互矛盾时，比如回答虽有助益却未必准确的情况；

为了针对特定的评分标准指导LLM评判，我们开发了一套评分规则，并尝试以下方法：

1、原始提示：以下是lmsys论文中原封不动采用的提示。

  
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format

我们对lmsys论文中原有提示进行了适配，使其能够输出关于正确性、全面性和可读性的度量标准，并要求评判在给出每个分数前附加一条简短的解释（以便利用连贯推理的优势）。

接下来是无示例的零样本提示和仅提供一个示例的少样本提示。使用了相同的答题数据作为输入，对比两种不同提示下的评分结果。

2、零样本学习：要求LLM评判根据我们设定的关于正确性、全面性和可读性的度量标准进行评价，并对每个分数给出一行解释性文字。

  
Please act as an impartial judge and evaluate the quality of the provided answer which attempts to answer the provided question based on a provided context.  
  You'll be given a function grading_function which you'll call for each provided context, question and answer to submit your reasoning and score for the correctness, comprehensiveness and readability of the answer.

3、少次样本学习：我们对原有的零样本提示进行了修改，为评分标准中的每个等级都配备了具体示例。升级后的提示：

  
Please act as an impartial judge and evaluate the quality of the provided answer which attempts to answer the provided question based on a provided context.  
  
  You'll be given a function grading_function which you'll call for each provided context, question and answer to submit your reasoning and score for the correctness, comprehensiveness and readability of the answer.   
  
    
  
  Below is your grading rubric:   
  
- Correctness: If the answer correctly answer the question, below are the details for different scores:  
  
  - Score 0: the answer is completely incorrect, doesn’t mention anything about the question or is completely contrary to the correct answer.  
  
      - For example, when asked “How to terminate a databricks cluster”, the answer is empty string, or content that’s completely irrelevant, or sorry I don’t know the answer.  
  
  - Score 1: the answer provides some relevance to the question and answers one aspect of the question correctly.  
  
      - Example:  
  
          - Question: How to terminate a databricks cluster  
  
          - Answer: Databricks cluster is a cloud-based computing environment that allows users to process big data and run distributed data processing tasks efficiently.  
  
          - Or answer:  In the Databricks workspace, navigate to the "Clusters" tab. And then this is a hard question that I need to think more about it  
  
  - Score 2: the answer mostly answer the question but is missing or hallucinating on one critical aspect.  
  
      - Example:  
  
          - Question: How to terminate a databricks cluster”  
  
          - Answer: “In the Databricks workspace, navigate to the "Clusters" tab.  
  
          Find the cluster you want to terminate from the list of active clusters.  
  
          And then you’ll find a button to terminate all clusters at once”  
  
  - Score 3: the answer correctly answer the question and not missing any major aspect  
  
      - Example:  
  
          - Question: How to terminate a databricks cluster  
  
          - Answer: In the Databricks workspace, navigate to the "Clusters" tab.  
  
          Find the cluster you want to terminate from the list of active clusters.  
  
          Click on the down-arrow next to the cluster name to open the cluster details.  
  
          Click on the "Terminate" button. A confirmation dialog will appear. Click "Terminate" again to confirm the action.”  
  
- Comprehensiveness: How comprehensive is the answer, does it fully answer all aspects of the question and provide comprehensive explanation and other necessary information. Below are the details for different scores:  
  
  - Score 0: typically if the answer is completely incorrect, then the comprehensiveness is also zero score.  
  
  - Score 1: if the answer is correct but too short to fully answer the question, then we can give score 1 for comprehensiveness.  
  
      - Example:  
  
          - Question: How to use databricks API to create a cluster?  
  
          - Answer: First, you will need a Databricks access token with the appropriate permissions. You can generate this token through the Databricks UI under the 'User Settings' option. And then (the rest is missing)  
  
  - Score 2: the answer is correct and roughly answer the main aspects of the question, but it’s missing description about details. Or is completely missing details about one minor aspect.  
  
      - Example:  
  
          - Question: How to use databricks API to create a cluster?  
  
          - Answer: You will need a Databricks access token with the appropriate permissions. Then you’ll need to set up the request URL, then you can make the HTTP Request. Then you can handle the request response.  
  
      - Example:  
  
          - Question: How to use databricks API to create a cluster?  
  
          - Answer: You will need a Databricks access token with the appropriate permissions. Then you’ll need to set up the request URL, then you can make the HTTP Request. Then you can handle the request response.  
  
  - Score 3: the answer is correct, and covers all the main aspects of the question  
  
- Readability: How readable is the answer, does it have redundant information or incomplete information that hurts the readability of the answer.  
  
  - Score 0: the answer is completely unreadable, e.g. fully of symbols that’s hard to read; e.g. keeps repeating the words that it’s very hard to understand the meaning of the paragraph. No meaningful information can be extracted from the answer.  
  
  - Score 1: the answer is slightly readable, there are irrelevant symbols or repeated words, but it can roughly form a meaningful sentence that cover some aspects of the answer.  
  
      - Example:  
  
          - Question: How to use databricks API to create a cluster?  
  
          - Answer: You you  you  you  you  you  will need a Databricks access token with the appropriate permissions. And then then you’ll need to set up the request URL, then you can make the HTTP Request. Then Then Then Then Then Then Then Then Then  
  
  - Score 2: the answer is correct and mostly readable, but there is one obvious piece that’s affecting the readability (mentioning of irrelevant pieces, repeated words)  
  
      - Example:  
  
          - Question: How to terminate a databricks cluster  
  
          - Answer: In the Databricks workspace, navigate to the "Clusters" tab.  
  
          Find the cluster you want to terminate from the list of active clusters.  
  
          Click on the down-arrow next to the cluster name to open the cluster details.  
  
          Click on the "Terminate" button…………………………………..  
  
          A confirmation dialog will appear. Click "Terminate" again to confirm the action.  
  
  - Score 3: the answer is correct and reader friendly, no obvious piece that affect readability.  
  
- Then final rating:  
    - Ratio: 60% correctness + 20% comprehensiveness + 20% readability

通过本次实验，作者发现：

采用GPT-4的少样本提示对结果的一致性并没有带来显著改善 。即便我们引入了附有示例的详尽评分细则，GPT-4在各类LLM模型上的评分成效并未有明显进步。

有趣的是，这一做法反而使得分数的分布范围出现了轻微波动。

picture.image

向GPT-3.5-turbo-16k提供少量示例能显著增强评分的稳定性 ，并使得评分结果变得可用。引入详尽的评分细则或示例对GPT-3.5的评分结果带来了显著提升。虽然GPT-4与GPT-3.5的得分平均值略有差异（3.0分对比2.6分），但两者在排名和精确度上保持了相当的一致性。相较之下，缺乏评分标准的GPT-3.5产生了极不稳定的结果，无法投入实际使用。需注意，由于提示长度可超过4k个令牌，我们选择了GPT-3.5-turbo-16k而非GPT-3.5-turbo。

picture.image

实验三：选择合适的评分尺度

在LLM-as-judge的论文中，采用了非整数的0到10的范围（即浮点数）作为评分标准，也就是说，对于最终得分，它采用了高精度的评分规则。我们发现，这些高精度的评分标准在后续应用中存在以下问题：

• 一致性问题：评估者，无论是人类还是LLM，在进行高精度评分时，很难对相同的分数保持统一的评判标准。因此，我们观察到，当评分尺度从低精度转变为高精度时，不同评判者给出的分数一致性有所下降。
• 可解释性挑战：此外，如果我们想要将LLM的评判结果与人类的评判结果进行相互验证，我们必须提供如何对答案进行评分的指导。在高精度评分尺度中，为每个“分数”提供准确的评分指导是非常困难的——例如，得分为5.1与5.6的答案，各自应该具备怎样的特点？

作者对不同的低精度评分尺度进行了实验，以确定哪一个是“最佳”选择，最终我们建议使用0-3或0-4的整数评分尺度（如果你倾向于使用Likert类型尺度）。我们尝试了0-10、1-5、0-3和0-1等不同的尺度，并从中认识到：

• 对于“可用性”或“好/坏”这类简单指标，二元评分法是有效的。
• 对于0-10这样的评分尺度，很难为每一个分数等级制定明确的区分标准。

picture.image

从上图可以看出，GPT-4与GPT-3.5能够在多种低精度评分尺度下保持结果排序的一致性，故采用03或15这类简化的评分尺度，既能保持评估的精确度，又便于解释和理解。

我们建议将0-3或1-5作为评分标准，这样做不仅有助于与人工评分保持一致，也便于对评分准则进行逻辑推理，并为该范围内的每个分数等级提供相应的示例。

实验四：不同应用场景的适用性

LLM-as-judge的论文指出，无论是通过LLM还是人类评判，Vicuna-13B模型都被认为与GPT-3.5模型实力相当：

picture.image

上图源自LLM-as-judge论文的图4：https://arxiv.org/pdf/2306.05685.pdf

但是，在针对文档问答应用场景对这些模型进行评估时，即便规模更大的Vicuna-33B模型，在根据上下文回答问题方面的表现也明显不及GPT-3.5。这一点也得到了GPT-4、GPT-3.5以及人类评判的一致确认（如实验一所述），他们都认同Vicuna-33B的表现要逊于GPT-3.5。

picture.image

我们仔细审视了论文建议的基准数据集，发现其包含的三大任务类别（写作、数学、知识）并不直接体现或促进模型依据上下文整合答案的能力。

实际上，文档问答这类用例需要的是阅读理解和指令遵循 方面的评估标准。这意味着不同用例间的评估结果无法通用，为了准确衡量模型是否能满足用户需求，我们必须开发针对具体用例定制的评估基准。

通往 AGI 的神秘代码

  
if like_this_article():  
    do_action('点赞')  
    do_action('再看')  
    add_wx_friend('iamxxn886')  
  
if like_all_arxiv_articles():  
    go_to_link('https://github.com/HuggingAGI/HuggingArxiv')    star_github_repo(''https://github.com/HuggingAGI/HuggingArxiv')