RAG评估：大海捞针测试

检索增强生成（RAG）是当今现实世界中许多LLM应用程序的基础，从生成头条新闻的公司到为小型企业解决问题的独立开发人员。

因此，RAG评估已成为开发和部署这些系统的关键部分。一种新的创新方法是“Needle in a Haystack”测试，由Greg Kamradt在这篇文章中首先提出，并在他的YouTube上进行了详细讨论。该测试旨在评估RAG系统在不同大小的上下文中的性能 。它的工作原理是将特定的、有针对性的信息（“针”）嵌入到更大、更复杂的文本主体（“haystack”）中 。目的是评估LLM在大量数据中识别和利用这一特定信息的能力 。

通常在RAG系统中，上下文窗口绝对充满了信息。从矢量数据库返回的大块上下文与语言模型instructions、templating和prompt中可能存在的任何其他内容混杂在一起。“大海捞针”（The Needle in a Haystack）评估测试LLM在混乱中找出具体细节的能力。您的RAG系统可能在检索最相关的上下文方面做得很出色，但是如果忽略了其中的细粒度细节，这又有什么用呢？

我们在几个主要的语言模型上多次运行这个测试。让我们仔细看看这个过程和总体结果。

Takeaways

• 并非所有LLM都是一样的。模型是根据不同的目标和需求进行训练的。例如，Anthropic的Claude以其稍微冗长的模型而闻名，这通常源于其目的是不做出未经证实的主张。
• 由于这个事实， 提示的微小差异可能导致模型之间的结果截然不同 。一些LLM需要更多量身定制的提示，以便在特定任务中表现出色。
• 在llm之上构建时——特别是当这些模型连接到私有数据时—— 有必要在整个开发和部署过程中评估检索和模型性能 。看似微不足道的差异可能会导致令人难以置信的巨大性能差异。

Understanding the Needle In a Haystack Test

“The Needle in a Haystack”测试首先用于评估两种流行的LLM（OpenAI的ChatGPT-4和Anthropic的Claude 2.1）的召回。“The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day。”这句不合时宜的话被放在保罗·格雷厄姆（Paul Graham）不同长度的文章片段的不同深度，类似于这样：

picture.image

Figure 1: About 120 tokens and 50% depth

然后，这些模型被提示回答在旧金山最好的事情是什么，只使用提供的上下文。然后在0%（文档顶部）和100%（文档底部）之间的不同深度，以及1K令牌和每个模型的令牌限制（GPT-4为128k， Claude 2.1为200k）之间的不同上下文长度，重复此操作。下面的图表记录了这两个模型的性能：

picture.image

Figure 2: ChatGPT-4’s performance

正如您所看到的，ChatGPT的性能在64k令牌时开始下降，在100k及以上时急剧下降。有趣的是，如果“针”位于上下文的开头，模型倾向于忽略或“忘记”它，而如果它位于末尾或作为第一个句子，模型的性能仍然稳定 。

picture.image

Figure 3: Claude 2.1’s performance

对于Claude来说，最初的测试并不顺利，最终的检索准确率为27%。观察到类似的现象，随着上下文长度的增加，性能下降，当“针”隐藏在更靠近文档底部时，性能通常会增加，如果“针”是上下文的第一个句子，则检索准确率为100%。

Anthropic’s Response

作为对这些发现的回应，Anthropic发表了一篇文章，详细介绍了他们重新进行的测试，并做了一些关键的改变。

首先，他们改变了“针”，使之更贴近haystack的主题。Claude 2.1被训练成“如果文件中没有足够的信息来证明答案是正确的，就不会（回答）一个基于文件的问题。因此，Claude很可能正确地认为在多洛雷斯公园吃三明治是旧金山最好的事情。然而，与一篇关于做得很好的文章一起，这一小段信息可能是未经证实的。这可能会导致一个冗长的回复，解释Claude不能确认吃三明治是在旧金山最好的事情，或者完全省略细节。当重新进行实验时，Anthropic的研究人员发现，将“针”移到文章中最初提到的一个小细节上会显著提高结果。

其次，对用于查询模型的提示模板进行了少量编辑。在模板的末尾添加了一行——here is the most relevant sentence in the context——指示模型简单地返回上下文中提供的最相关的句子。与第一个类似，这个更改允许我们通过指导模型简单地返回一个句子而不是做出断言，来规避模型避免未经证实的声明的倾向。


 
 
 
 
   
PROMPT = """  
  
HUMAN: <context>  
{context}  
</context>  
  
What is the most fun thing to do in San Francisco based on the context? Don't give information outside the document or repeat our findings  
  
Assistant: here is the most relevant sentence in the context:"""

这些变化导致Claude的整体检索准确率显著提高：从27%提高到98%！发现这个最初的研究很有趣，我们决定在Needle in a Haystack测试中来运行我们自己的一组实验。

Further Experiments

在进行一系列新的测试时，我们对原来的实验进行了一些修改。我们使用的"针"是一个随机数，每次迭代都会改变，从而消除了缓存的可能性。此外，我们使用了开源的Phoenix eval库（完全公开）来减少测试时间，并使用rails直接搜索输出中的随机数，从而消除了可能降低检索分数的冗长问题。最后，我们考虑了系统无法检索结果的负面情况，将其标记为无法回答。我们对这种阴性情况进行了单独的测试，以评估系统在无法检索数据时的识别能力。这些修改使我们能够进行更严格和全面的评估。

更新后的测试使用四种不同的大型语言模型在几种不同的配置中运行：ChatGPT-4， Claude 2.1（有或没有对Anthropic建议的提示进行上述更改），以及Mistral AI的Mixtral-8X7B-v0.1和7B Instruct。考虑到提示中的细微差别可能导致不同模型的结果大不相同，我们使用了几个提示模板来比较这些模型的最佳表现。我们为ChatGPT和Mixtral使用的简单模板如下：


 
 
 
 
   
SIMPLE\_TEMPLATE = '''   
   You are a helpful AI bot that answers questions for a user. Keep your responses short and direct.   
   The following is a set of context and a question that will relate to the context.  
   #CONTEXT  
   {context}  
   #ENDCONTEXT  
  
   #QUESTION  
   {question} Don’t give information outside the document or repeat your findings. If the information is not available in the context respond UNANSWERABLE

对于Claude，我们测试了前面讨论的两个模板。


 
 
 
 
   
ANTHROPIC\_TEMPLATE\_ORIGINAL = ''' Human: You are a close-reading bot with a great memory who answers questions for users. I’m going to give you the text of some essays. Amidst the essays (“the haystack”) I’ve inserted a sentence (“the needle”) that contains an answer to the user’s question.   
Here's the question:  
   <question>{question}</question>  
   Here’s the text of the essays. The answer appears in it somewhere.  
   <haystack>  
   {context}  
   </haystack>  
   Now that you’ve read the context, please answer the user's question, repeated one more time for reference:  
   <question>{question}</question>  
  
   To do so, first find the sentence from the haystack that contains the answer (there is such a sentence, I promise!) and put it inside <most\_relevant\_sentence> XML tags. Then, put your answer in <answer> tags. Base your answer strictly on the context, without reference to outside information. Thank you.   
   If you can’t find the answer return the single word UNANSWERABLE  
   Assistant: '''


 
 
 
 
   
ANTHROPIC\_TEMPLATE\_REV2 = ''' Human: You are a close-reading bot with a great memory who answers questions for users. I'm going to give you the text of some essays. Amidst the essays ("the haystack") I've inserted a sentence ("the needle") that contains an answer to the user's question.   
Here's the question:  
   <question>{question}</question>  
   Here's the text of the essays. The answer appears in it somewhere.  
   <haystack>  
   {context}  
   </haystack>  
   Now that you've read the context, please answer the user's question, repeated one more time for reference:  
   <question>{question}</question>  
  
   To do so, first find the sentence from the haystack that contains the answer (there is such a sentence, I promise!) and put it inside <most\_relevant\_sentence> XML tags. Then, put your answer in <answer> tags. Base your answer strictly on the context, without reference to outside information. Thank you.   
   If you can't find the answer return the single word UNANSWERABLE  
   Assistant: Here is the most relevant sentence in the context:'''

完成这些测试的所有代码都可以在这个GitHub存储库中找到: https://github.com/Arize-ai/LLMTest\_NeedleInAHaystack。

Results

picture.image

Figure 7: Comparison of GPT-4 results between the initial research (Run #1) and our testing (Run #2)

picture.image

Figure 8: Comparison of Claude 2.1 (without prompting guidance) results between Run #1 and Run #2

我们对ChatGPT和Claude的结果（没有提示指导）与Kamradt先生的发现相差不大，生成的图表看起来相对相似：右上方（长上下文，上下文开头附近的"针"）是LLM信息检索的受害者 。

picture.image

Figure 9: Comparison of Claude 2.1 results with and without prompting guidance

虽然我们无法复制Anthropic在Claude 2.1中98%的检索准确率的结果，但我们确实看到了提示更新后总失误率的显著下降（从165降至74）。这种跳跃是通过在现有提示符的末尾添加10个单词的指令来实现的，这突出表明提示的微小差异可能会对llm产生截然不同的结果 。

picture.image

Figure 10: Mixtral results | Image by author

最后但并非最不重要的是，看到Mixtral在这个任务中的表现是多么有趣，尽管这些是迄今为止测试过的最小的模型。混合专家（MOEs）模型远优于7B-Instruct模型，并且我们发现MOEs模型在检索评估方面做得更好 。

Conclusion

“大海捞针”测试是量化LLM解析上下文以找到所需信息的能力的一种聪明方法。我们的研究得出了几个主要结论。首先，ChatGPT-4是目前该领域的行业领导者，我们和其他公司已经进行了许多其他评估。第二，起初Claude 2.1似乎表现不佳，但随着提示结构的调整，该模型显示出显著的改善。Claude比其他一些模型有点啰嗦，要格外小心地指挥它，对结果会有很大帮助。最后，Mixtral MOE大大超出了我们的预期，我们很高兴看到Mixtral模型不断超出预期。

Takeaways

Understanding the Needle In a Haystack Test

Anthropic’s Response

Further Experiments

Results

Conclusion

参考文献