Prompt优化:释放LLM的全部潜能 - 高级提示功能

技术
高级提示功能

到这个阶段,应该很明显,改善提示有助于在不同任务上获得更好的结果。这就是提示工程背后的整个理念。

虽然那些例子很有趣,但在我们深入更高级概念之前,让我们更正式地探讨一些概念。

话题:

  • 零样本提示
  • 少样本提示
  • 思维链提示
  • 零样本提示思维链
  • 自洽性
  • 生成知识提示
  • 自动提示工程师

零样本提示

现在的大型语言模型可以通过大量数据训练并进行调优后可以按照指令执行零样本提示任务。我们在前面的部分尝试了一些零样本提示的例子。下面是我们使用的一个例子:

提示:


          
将文本分类为中性,负面或正面。  
  
文本:我觉得假期还可以。  
情绪:  

      

输出:


          
中性  

      

请注意,在上面的提示中,我们没有向模型提供任何示例——这就是零样本学习能力在发挥作用。当零样本学习无法工作时,建议在提示中提供演示或示例。下面我们将讨论少示提示的方法。


少样本提示

尽管大规模语言模型已经展现出了不错的零样本学习能力,但在使用零样本学习方式时,它们仍然在更复杂的任务上表现不佳。为了提高这一点,我们使用了少样本提示技术,以实现上下文学习,即在提示中提供样例来引导模型实现更好的性能。这些演示里的样例,就是我们想要模型返回答案的样子。

让我们通过 Brown et al. 2020提出的一个例子来展示少样本提示的方法。在这个例子中,任务是在句子中正确使用一个新单词。

提示:


          
A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses  
the word whatpu is:  
We were traveling in Africa and we saw these very cute whatpus.  
To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses  
the word farduddle is:  

      

输出:


          
An example of a sentence that uses the word farduddle is:  
  
We were so excited to win the game that we started to farduddle around the field.  

      

我们可以观察到,这个模型已经学会了如何仅仅通过一个示例(即1-shot)来执行任务。对于更难的任务,我们可以通过增加示例(例如3-shot、5-shot、10-shot等)来进行实验。

根据Min等人(2022)的研究结果,在进行少样本学习时,以下是一些关于演示/范例的建议:

  • “标签空间和演示所指定的输入文本分布都很重要(无论标签是否对单个输入正确)”。
  • 你使用的格式对于性能也起着关键作用,即使你只使用随机标签,这也比没有标签好得多。
  • 附加结果表明,从真实标签的分布中选择随机标签(而不是均匀分布)也会有帮助。

让我们尝试几个例子。首先,让我们尝试一个带有随机标签的例子(意味着负面和正面的标签随机分配给输入)。

提示:


          
This is awesome! // Negative  
This is bad! // Positive  
Wow that movie was rad! // Positive  
What a horrible show! //  

      

输出:


          
负面  

      

注:这里如果内容翻译成中文的话,效果如下:

提示:


          
这太棒了!// 消极  
这很糟糕!// 积极  
哇,那部电影真棒!// 消极  
多么可怕的表演!//  

      

负面:


          
这太棒了!// 积极  
这很糟糕!// 消极  
哇,那部电影真棒!// 积极  
多么可怕的表演!// 消极  

      

我们即使在标签被随机排列的情况下仍然能得到正确答案。需要注意的是,我们也保留了格式,这也有帮助。实际上,通过进一步实验,似乎我们正在尝试的新的GPT模型在处理随机格式时变得更加鲁棒。例如:

提示:


          
积极 这太棒了!  
这很糟糕!消极  
哇,这部电影太棒了!  
积极  
多么可怕的节目!--  

      

输出:


          
消极  

      

以上格式不一致,但模型仍然预测了正确的标签。我们必须进行更彻底的分析来确认它是否适用于不同和更复杂的任务,包括不同的提示变化。

少样本提示的限制

标准的少样本提示技术对许多任务都非常有效,但仍然不是完美的技术,特别是在处理更复杂的推理任务时。让我们演示为什么会出现这种情况。您还记得我们之前提供的以下任务吗:


          
这个组里的奇数加起来是一个偶数:15、32、5、13、82、7、1。  
  
答案:  

      

如果我们再次尝试,模型的输出将是以下内容:


          
这个组里的奇数加起来是一个偶数。15 + 5 + 13 + 7 + 1 = 41,这是一个奇数。3282是偶数,所以它们加起来是偶数。因此,15 + 32 + 5 + 13 + 82 + 7 + 1 = 155,这是一个奇数。  

      

这不是正确的响应,这不仅突显了这些系统的限制,而且需要更先进的提示工程。

让我们尝试添加一些示例,以查看少量提示是否可以改善结果。

提示:


          
这组数中的奇数加起来是一个偶数:4、8、9、15、12、2、1。  
  
答案:答案是假的。  
  
这组数中的奇数加起来是一个偶数:17、10、19、4、8、12、24。  
  
答案:答案是真的。  
  
这组数中的奇数加起来是一个偶数:16,11,14,4,8,13,24。  
  
答案:答案是真的。  
  
这组数中的奇数加起来是一个偶数:17、9、10、12、13、4、2。  
  
答案:答案是假的。  
  
这个组中的奇数相加得到一个偶数:15、32、5、13、82、7、1。  
  
答案:  

      

输出:


          
答案是真的。  

      

上面的例子是不成功的,所以少数据提示无法获得这种推理问题的可靠响应。上面的示例提供了任务的基本信息。如果您仔细看一下,我们所介绍的任务类型涉及更多的推理步骤。换句话说,如果我们将问题分解成步骤,并向模型演示,可能会有所帮助。最近,思路链(CoT)提示已经流行起来,以解决更复杂的算术,常识和符号推理任务。

总的来说,提供例子似乎对解决某些任务是有用的。当零样本提示和少样本提示不足时,这可能意味着模型学习的内容不足以在任务上很好表现。因此,建议开始考虑微调您的模型或尝试更高级的提示技术。接下来,我们将谈论其中一种流行的提示技术,称为思维链提示,它已经获得了很大的关注。


思维链提示

Wei et al. (2022)介绍了“思维链”提示,通过中间推理步骤实现了复杂的推理能力。你可以将其与少样本提示相结合以在需要预先进行推理的更复杂任务中获得更好的结果。

提示:


          
这组数中的奇数相加得到偶数:4, 8, 9, 15, 12, 2, 1。  
  
答案: 将所有奇数相加(9、15、1),得到25。答案为假的。  
  
这组数中的奇数相加得到偶数:17, 10, 19, 4, 8, 12, 24.  
  
答案: 将所有奇数相加(17、19),得到36。答案为真的。  
  
这组数中的奇数相加得到偶数:16、11、14、4、8、13、24。  
  
答案: 将所有奇数(11, 13)相加得到24。答案为真的。  
  
这组数中的奇数相加得到偶数:17、9、10、12、13、4、2。  
  
答案: 将所有奇数(17、9、13)加起来得到39。答案为假的 。  
  
这组数中的奇数相加得到偶数:15,32,5,13,82,7,1。  
  
答案:  

      

输出:


          
将所有奇数相加(15、5、13、7、1),得到41。答案为假的。  

      

哇!我们可以看到在提供推理步骤时得到完美的结果。我们可以通过提供更少的例子来解决这个任务,比如只需一个例子似乎就足够了:

提示:


          
这组数字中的奇数加起来得到偶数:4、8、9、15、12、2、1。  
  
答案:将所有奇数(9、15、1)相加得到25。答案是错误的。  
  
这组数字中的奇数加起来得到偶数:15、32、5、13、82、7、1。  

      

输出:


          
答案:将所有奇数(15、5、13、7、1)相加得到41,这是一个奇数。因此答案是错误的。  

      

记住,作者声称这是一种新兴的能力,只有实现足够大的语言模型时才会出现。


零样本提示思维链

最近提出的一种新想法是 零样本提示思维链(Kojima et al. 2022),其基本上是在原有提示的基础上添加了 "Let's think step by step"。让我们尝试一个简单的问题,看看模型的表现如何:

提示:


          
我去市场买了10个苹果。我给了2个苹果给邻居给了2个给修理工。然后我又买了5个苹果并吃了1个。我还剩下多少个苹果?  

      

输出:


          
你最开始买了10个苹果,给了2个,那么剩下8个。  
  
然后你又买了5个苹果,那么总共有8+5=13个苹果。  
  
最后你吃了1个,那么你还剩下12个苹果。  

      

注:中文提示的回答是正确的,所以继续用英文案例来说明

提示:


          
I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?  

      

输出:


          
You remained with 8 apples.  
  
Initially, you bought 10 apples, and gave away 2+2=4 apples, so you were left with 10-4=6 apples.  
  
After buying 5 more apples, you had 6+5=11 apples.  
  
After eating 1 apple, you were left with 11-1=10 apples.  
  
So in the end, you had 10 apples remaining.  

      

答案不正确!现在让我们尝试使用特殊提示。

提示:


          
I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?  
  
Let's think step by step.  

      

输出:


          
Sure, let's break it down step by step.  
  
You started with 10 apples. You gave away 2 to the neighbor and 2 to the repairman, so you had 6 apples left.  
  
Then you went to buy 5 more apples, which brought your total to 11 apples. But you also ate 1, so you are left with 10 apples again.  
  
Therefore, you remained with 10 apples in the end.  

      

这个简单的提示能够有效地完成这项任务,令人印象深刻。特别是当你没有太多的提示例子可用时,这尤其有用。


自洽性

也许对于提示工程来说,自我一致性是更先进的技术之一。由Wang et al. (2022)提出,自我一致性的目标是“替代链式思考提示中贪婪解码”。这个想法是通过少量样本的CoT采样多样化的推理路径,并使用生成的结果选择最一致的答案。这有助于提高CoT提示在涉及算术和常识推理任务中的性能。

让我们尝试以下算术推理示例:

提示:


          
当我6岁时,我的姐姐是我的一半年龄。现在  
我今年70岁,我妹妹多大?  

      

输出:


          
当你6岁时,你的姐姐比你大一半,也就是9岁。现在你70岁,所以经过64年,你的姐姐现在应该是6+64=70岁。因此你的妹妹的年龄应该比你小,但具体年龄需要更多信息才能确定。  

      

输出结果不正确!我们如何通过自相一致来改进它?让我们试试。我们将使用 Wang 等人 2022 年的少样本示范(表17)。

提示:


          
Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done,  
there will be 21 trees. How many trees did the grove workers plant today?  
A: We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted.  
So, they must have planted 21 - 15 = 6 trees. The answer is 6.  
  
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?  
A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.  
  
Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?  
A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74  
chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39.  
  
Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops  
did Jason give to Denny?  
A: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. The number of  
lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8.  
  
Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does  
he have now?  
A: He has 5 toys. He got 2 from mom, so after that he has 5 + 2 = 7 toys. Then he got 2 more from dad, so  
in total he has 7 + 2 = 9 toys. The answer is 9.  
  
Q: There were nine computers in the server room. Five more computers were installed each day, from  
monday to thursday. How many computers are now in the server room?  
A: There are 4 days from monday to thursday. 5 computers were added each day. That means in total 4 * 5 =  
20 computers were added. There were 9 computers in the beginning, so now there are 9 + 20 = 29 computers.  
The answer is 29.  
  
Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many  
golf balls did he have at the end of wednesday?  
A: Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On  
Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33.  
  
Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?  
A: She bought 5 bagels for $3 each. This means she spent 5  
  
Q: When I was 6 my sister was half my age. Now I’m 70 how old is my sister?  
A:  

      

输出 1:


          
When you were 6, your sister was half your age, which means she was 6 / 2 = 3 years old.  
The age difference between you and your sister will always be the same, which is 70 - 6 = 64 years.  
So your sister's current age is 3 + 64 = 67 years old. The answer is 67.  

      

输出 2:


          
When you were 6, your sister was half your age, which means she was 6 / 2 = 3 years old.  
Now you are 70, which means the age gap between you and your sister is still the same as when you were 6.  
So, your sister is now 70 - (6 - 3) = 67 years old. The answer is 67.  

      

输出 3:


          
When you were 6, your sister was half your age, so she was 6/2 = 3 years old.  
Now you are 70, which means that the age difference between you and your sister has remained the same throughout the years. The age difference is 70 - 6 = 64 years.  
Therefore, your sister's age must be 3 + 64 = 67 years old. The answer is 67.  

      

计算出最终答案需要几个步骤(请查看论文了解详情),但为了简单起见,我们可以看到已经出现了多数答案,因此基本上就成为了最终答案。


生成知识提示

LLM(Large Language Model)不断得到改进,其中一种流行的技术是能够将知识或信息纳入模型,以帮助其做出更准确的预测。

使用类似的想法,模型也可以用来在进行预测之前生成知识吗?这是Liu et al. 2022在论文中尝试的 -- 生成知识作为提示的一部分。特别是,对于像常识推理这样的任务,这对于有多少帮助呢?

让我们尝试一个简单的提示:

提示:


          
有一部分高尔夫是比其他人分数高的获胜,对还是不对?  

      

输出:


          
不对。高尔夫球的目标是尽可能获得最低的得分。游戏或回合结束时得分最低的球员是获胜者。  

      

注:注意这里的例子,无论是英文原版还是中文例子,在ChatGPT的回答都是正确的。

这类误解暴露了LLMs在执行需要更多关于世界的知识任务时的限制。我们如何通过知识生成来改善这种情况?

首先,我们生成一些“知识点”:

提示:


          
Input: Greece is larger than mexico.  
Knowledge: Greece is approximately 131,957 sq km, while Mexico is approximately 1,964,375 sq km, making Mexico 1,389% larger than Greece.  
  
Input: Glasses always fog up.  
Knowledge: Condensation occurs on eyeglass lenses when water vapor from your sweat, breath, and ambient humidity lands on a cold surface, cools, and then changes into tiny drops of liquid, forming a film that you see as fog. Your lenses will be relatively cool compared to your breath, especially when the outside air is cold.  
  
Input: A fish is capable of thinking.  
Knowledge: Fish are more intelligent than they appear. In many areas, such as memory, their cognitive powers match or exceed those of ’higher’ vertebrates including non-human primates. Fish’s long-term memories help them keep track of complex social relationships.  
  
Input: A common effect of smoking lots of cigarettes in one’s lifetime is a higher than normal chance of getting lung cancer.  
Knowledge: Those who consistently averaged less than one cigarette per day over their lifetime had nine times the risk of dying from lung cancer than never smokers. Among people who smoked between one and 10 cigarettes per day, the risk of dying from lung cancer was nearly 12 times higher than that of never smokers.  
  
Input: A rock is the same size as a pebble.  
Knowledge: A pebble is a clast of rock with a particle size of 4 to 64 millimetres based on the Udden-Wentworth scale of sedimentology. Pebbles are generally considered larger than granules (2 to 4 millimetres diameter) and smaller than cobbles (64 to 256 millimetres diameter).  
  
Input: Part of golf is trying to get a higher point total than others.  
Knowledge:  

      

答案1:


          
The objective of golf is to play a set of holes in the least number of strokes. A round of golf typically consists of 18 holes. Each hole is played once in the round on a standard golf course. Each stroke is counted as one point, and the total number of strokes is used to determine the winner of the game.  

      

答案2:


          
Golf is a precision club-and-ball sport in which competing players (or golfers) use many types of clubs to hit balls into a series of holes on a course using the fewest number of strokes. The goal is to complete the course with the lowest score, which is calculated by adding up the total number of strokes taken on each hole. The player with the lowest score wins the game.  

      

注:这里因为翻译成中文的例子和英文Prompt的答案差异较大,保持英文Prompt。

我们正在使用Liu等人2022年的文章中提供的提示。

提示:


          
Question: Part of golf is trying to get a higher point total than others. Yes or No?  
  
Knowledge: The objective of golf is to play a set of holes in the least number of strokes. A round of golf typically consists of 18 holes. Each hole is played once in the round on a standard golf course. Each stroke is counted as one point, and the total number of strokes is used to determine the winner of the game.  
  
Explain and Answer:  

      

答案1 (置信度很高):


          
No, the objective of golf is not to get a higher point total than others. Rather, the objective is to play a set of holes in the least number of strokes. The total number of strokes is used to determine the winner of the game, not the total number of points.  

      

          
Question: Part of golf is trying to get a higher point total than others. Yes or No?  
  
Knowledge: Golf is a precision club-and-ball sport in which competing players (or golfers) use many types of clubs to hit balls into a series of holes on a course using the fewest number of strokes. The goal is to complete the course with the lowest score, which is calculated by adding up the total number of strokes taken on each hole. The player with the lowest score wins the game.  
  
Explain and Answer:  

      

答案1 (置信度比较低):


          
Yes, part of golf is trying to get a higher point total than others. Each player tries to complete the course with the lowest score, which is calculated by adding up the total number of strokes taken on each hole. The player with the lowest score wins the game.  

      

发生了一些有趣的事情,第一个答案中模型非常自信,但第二个答案却不那么自信。我简化了流程,以演示为目的,但在得出最终答案时还有一些细节需要考虑。请查阅论文以获得更多信息。


自动提示工程师(APE)

picture.image

Zhou et al., (2022) 提出自动提示引擎(Automatic prompt Engineer, APE),一种用于自动生成和选择指令的框架。将指令生成问题构建成自然语言合成问题,利用LLMs作为黑匣子优化问题生成和搜索潜在的解决方案。

第一步涉及一个大型语言模型(作为推理模型),该模型使用输出演示来生成任务的指令候选项。这些候选方案将引导搜索过程。使用目标模型执行指令,然后根据计算的评估分数选择最合适的指令。

APE 发现了一种比人工设计的“Let's think step by step”提示(Kojima 等人,2022)更好的零样本 CoT 提示。

提示“Let's work this out in a step by step way to be sure we have the right answer.”引出了一系列思考,提高了在MultiArith和GSM8K基准测试中的表现。

picture.image

本文涉及与快速工程相关的一个重要主题,即自动优化提示的想法。虽然本指南中不深入探讨此主题,但如果你有兴趣,以下是一些关键论文:

  • 自动提示 - 基于梯度引导搜索,提出了一种自动创建各种任务提示的方法。
  • 前缀调优 - 是一种轻量级的fine-tuning替代方法,用于在自然语言生成任务中放置可训练的连续前缀。
  • 提示调优 - 提出了通过反向传播学习软提示的机制。

前一节(基本提示)

0
0
0
0
关于作者

文章

0

获赞

0

收藏

0

相关资源
边缘云游戏行业解决方案
《“加速”游戏体验升级,火山引擎边缘云游戏行业解决方案》 许思安 | 火山引擎边缘云高级总监
相关产品
评论
未登录
看完啦,登录分享一下感受吧~
暂无评论