- 引言
- 准备工作
- R1训练速览
- 数据处理
- Reward函数
- R1-Zero
日长篱落无人过,惟有蜻蜓蛱蝶飞。
小伙伴们好,我是微信公众号"小窗幽记机器学习"的小编卖铁观音的小男孩。 承接上文LLM推理中的强化学习及其实战:以GRPO为例(上篇)已经介绍DeepSeek-R1的理论,本文将从实战角度出发,重点阐述如何一步步训练出R1-Zero模型。下一篇则会进一步讲解R1类模型的训练细节。
环境配置
conda create -n r1\_from\_scratch-env-py311 "llvmdev>=15" "cmake>=3.24" git python=3.11
source activate
conda deactivate
conda activate r1\_from\_scratch-env-py311
准备数据
数据出处:
https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k
本地数据存储位置:
/your\_dir/share\_data\_zoo/AI-MO/NuminaMath-TIR
/your\_dir/share\_data\_zoo/bespokelabs/Bespoke-Stratos-17k
AI-MO/NuminaMath-TIR
数据下载: https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
AI-MO/NuminaMath-TIR
包含70K个数学问题,messages列表示解决方案背后的COT(思想链)推理过程。
| Field | Description | | --- | --- | | problem | The math problem | | solution | Step-by-step solution | | messages | Chat to solve the problem |
加载数据,查看样本:
from datasets import load\_dataset
# Load the "AI-MO/NuminaMath-TIR" dataset from Local dir
data\_dir="/your\_dir/share\_data\_zoo/AI-MO/NuminaMath-TIR/data"
MATH\_le = load\_dataset("parquet", data\_dir=data\_dir)
# Access the first sample in the training set
print(MATH\_le['train'][0])
打印输出结果如下:
{'problem': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$? Express your answer as a common fraction.',
'solution': "To determine the coefficient of \\(x^2y^6\\) in the expansion of \\(\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8\\), we can use the binomial theorem. ...",
'messages': [{'content': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$? Express your answer as a common fraction.',
'role': 'user'},
{'content': "To determine the coefficient of \\(x^2y^6\\) in the expansion ...",
'role': 'assistant'}]}
bespokelabs/Bespoke-Stratos-17k
数据下载: https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k
Bespoke-Stratos包含17K个专注于数学和代码的问题。
| Field | Description | | --- | --- | | system | Guidelines for math and code problems | | conversation | Chat to solve the problem |
数据存储位置:
/your\_dir/share\_data\_zoo/bespokelabs/Bespoke-Stratos-17k
加载数据,查看样本:
# Load the "Bespoke-Stratos-17k" dataset from bespokelabs
from datasets import load\_dataset
data\_dir="/your\_dir/share\_data\_zoo/bespokelabs/Bespoke-Stratos-17k/data"
bespoke\_rl = load\_dataset("parquet", data\_dir=data\_dir)
# Access the first sample in the training set
print(bespoke\_rl['train'][0])
打印输出结果如下:
{'system': "Your role as an assistant involves thoroughly exploring XXX, 'conversations': [{'from': 'user', 'value': 'Return your final response within \\boxed{}. ....}]}
在深入探讨R1各步骤细节前先进行简要概述。更多详尽的内容可以参考此前的解读文章:
- LLM推理中的强化学习及其实战:以GRPO为例(上篇)
- 一文纵览DeepSeek模型家族:从LLM到R1
- 深度揭秘DeepSeek R1 背后的强化学习:开启大模型训练新纪元
- DeepSeek-R1如何用强化学习、冷启动和蒸馏,开启大模型训练新思路?
图1:DeepSeek R1实现概览
为让模型获得强悍的推理能力,DeepSeek R1采用了强化学习(RL), 当模型推理正确时给予奖励,反之则予以惩罚。这并非单一训练环节,而是一整套的"流水线"步骤。先用纯强化学习测试推理能力是否自然形成,这就是实验性质的DeepSeek-R1-Zero。而真正的DeepSeek-R1则更加系统化,分为多个阶段:先提供初始数据,再进行强化学习,然后是更多数据,更多强化学习...就像逐级提升的过程。这一切都是为了大幅提高语言模型的思考解题能力。
选择Base模型
由于DeepSeek团队选择了DeepSeek-V3作为他们创建R1 Zero和R1的基础模型,但DeepSeek-V3的大小高达685 GB ,显然不在我们能力范围内。为此,这里将使用一个小得多的基础模型Qwen/Qwen2.5–0.5B-Instruct
(大小为0.9 GB)。当然,如果你有更高的GPU内存,甚至可以加载未量化的LLM,你也可以选择更大的模型,比如Qwen/Qwen2.5–7B-Instruct
。
以下对所选用的模型进行加载和基本的试用:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2025/3/23 13:36
# @Author : <小窗幽记机器学习>
# @File : check\_model.py
import os
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
HfArgumentParser,
TrainingArguments,
set\_seed,
TrainerCallback,
TrainerControl,
TrainerState,
)
MODEL\_DIR = "/your\_dir/share\_model\_zoo/"
MODEL\_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
OUTPUT\_DIR = "/your\_dir/model\_results/"
OUTPUT\_FILE = "GRPO-training"# For saving our trained model
OUTPUT\_DIR = os.path.join(OUTPUT\_DIR, MODEL\_NAME, OUTPUT\_FILE)
print("Model Save dir=", OUTPUT\_DIR)
MODEL\_NAME = os.path.join(MODEL\_DIR, MODEL\_NAME)
# Create output directory if it doesn't exist
os.makedirs(OUTPUT\_DIR, exist\_ok=True)
# Initialize tokenizer with chat template
tokenizer = AutoTokenizer.from\_pretrained(
MODEL\_NAME,
trust\_remote\_code=True,
padding\_side="right"
)
# Set pad token if not set
if tokenizer.pad\_token is None:
tokenizer.pad\_token = tokenizer.eos\_token
print(f"Vocabulary size: {len(tokenizer)}")
print(f"Model max length: {tokenizer.model\_max\_length}")
print(f"Pad token: {tokenizer.pad\_token}")
print(f"EOS token: {tokenizer.eos\_token}")
# Initialize base model
model = AutoModelForCausalLM.from\_pretrained(
MODEL\_NAME,
trust\_remote\_code=True,
torch\_dtype=torch.bfloat16
)
print(f"Model parameters: {model.num\_parameters():,}")
"""
Vocabulary size: 151665
Model max length: 131072
Pad token: <|endoftext|>
EOS token: <|im\_end|>
Model parameters: 494,032,768
"""
# Check CUDA availability
device = torch.device("cuda"if torch.cuda.is\_available() else"cpu")
print(f"Using device: {device}")
# Move model to the appropriate device
model.to(device)
# Test basic inference
def test\_model\_inference(user\_input: str):
"""Test basic model inference with the loaded model and tokenizer."""
messages = [
{"role": "system", "content": "你是微信公众号<小窗幽记机器学习>的智能助理,你叫卖打火机的小男孩"},
{"role": "user", "content": user\_input}
]
# Apply chat template
text = tokenizer.apply\_chat\_template(
messages,
tokenize=False,
add\_generation\_prompt=True
)
# Tokenize and generate
inputs = tokenizer(text, return\_tensors="pt").to(device)
outputs = model.generate(
**inputs,
max\_new\_tokens=100,
do\_sample=True,
temperature=0.7
)
response = tokenizer.decode(outputs[0], skip\_special\_tokens=True)
return response
# Test the model
test\_input = "你好啊,请问你是?"
response = test\_model\_inference(test\_input)
print(f"Test Input: {test\_input}")
print(f"Model Response: {response}")
"""
Test Input: 你好啊,请问你是?
Model Response: system
你是微信公众号<小窗幽记机器学习>的智能助理,你叫卖打火机的小男孩
user
你好啊,请问你是?
assistant
我是微信公众号<小窗幽记机器学习>的智能助理,专门回答用户的问题。有什么问题我可以帮你解答。
"""
可以看出所选用的 "Qwen/Qwen2.5-0.5B-Instruct" 模型效果还不错,作为一个base模型应该是相对可靠的。
强化学习中的策略模型(R)
以上已经选择了基础模型,接下来需要了解如何为训练大语言模型(LLM)设置基本的强化学习(RL)环境。
对于DeepSeek R1,官方所选用的基础模型是DeepSeek V3,而在这里我们以Qwen2.5-0.5B-Instruct为起点,即基础模型。后续,将基于它创建了DeepSeek R1-Zero版本
R1 Zero是使用强化学习创建的,其中(DeepSeek v3/Qwen2.5-0.5B)充当RL agent(执行动作的行动者)。以下首先可视化它是如何工作的。
图2:Qwen 2.5作为agent的workflow
RL Agent(DeepSeek V3/Qwen2-0.5B)首先执行一个动作,这意味着它为给定的问题生成一个答案和一些推理,这个问题被放入其环境中。在这种情况下,环境简单地就是推理任务本身。
执行动作后,环境会给出一个奖励。这个奖励就像反馈,它告诉我们的基础模型(DeepSeek V3/Qwen2-0.5B)它的动作有多好。正面奖励意味着它做对了某事,可能是得到了正确的答案或推理得很好。这个反馈信号返回给到我们的基础模型,帮助它学习并调整未来如何采取行动以获得更好的奖励。
在下一部分,我们将更详细地讨论这种方法。
R1-Zero中的GRPO算法
现在我们已经理解了基本的强化学习流程,接下来需要了解DeepSeek用于R1-Zero的具体强化学习算法。
有许多强化学习算法可用,但传统强化学习使用所谓的"评论器"(critic)来帮助主要决策部分(actor),即DeepSeek-V3/Qwen2-0.5B)。这个评论器通常与actor一样大且复杂,这使得计算成本翻倍。
但DeepSeek使用GRPO来训练他们的初始模型(R1 Zero),GRPO的做法不同,因为它直接从一组动作(actions)的结果中确定一个基准线,这是一种良好行动(actions)的参考点。因此,GRPO完全不需要单独的评论器模型。这节省了大量计算并提高了效率。
让我们绘制一个GRPO如何用于R1 Zero训练的流程图,然后我们将对其进行解释。
图3:DeepSeek R1 Zero的GRPO流程
以下用Qwen2-0.5B这个基础模型来说明DeepSeek GRPO实现的工作原理。
首先,问题输入(A)被提供给Qwen模型(B),Qwen尝试通过生成补全(C)来生成答案。最终结果,称为补全输出(D),包含在<think>
标签中的推理步骤和<answer>
标签中的最终解决方案。
接下来,问题输入(A)和真实解决方案(E)被输入到奖励函数(F)中,这些函数充当智能评分器。这些函数将Qwen补全输出(D)与正确的解决方案进行比较,并评估不同方面,例如:
- 准确性(答案是否正确?)
- 格式(
<think>
和<answer>
标签是否正确使用?) - 推理步骤(逻辑是否清晰?)7
- Cosine Scaling余弦缩放(回答是否简洁?)
- 重复惩罚(是否有不必要的重复?)
这些评估产生奖励分数(G),然后传递给GRPO训练器(H)。训练器使用梯度来调整Qwen模型(B),微调其生成答案的方式。这个过程被称为梯度奖励策略优化(Gradient Reward Policy Optimization,GRPO),因为它使用梯度、奖励反馈和策略调整 来优化Qwen响应,以最大化性能。
最后,更新后的Qwen模型(B)会在新问题上再次测试,通过重复循环不断完善自己。每次迭代,Qwen都会成为更好的问题解决者。
在接下来的部分中,我们将开始为GRPO训练预处理我们的训练数据集。
Prompt模板
我们正在使用与DeepSeek为GRPO算法所用的相同思考提示模板来构建R1 Zero,所以让我们定义一下:
# DeepSeek system prompt for GRPO based training
SYSTEM\_PROMPT = (
"A conversation between User and Assistant. The user asks a question,
and the Assistant solves it. The assistant "
"first thinks about the reasoning process in the mind and
then provides the user with the answer. The reasoning "
"process and answer are enclosed within <think> </think>
and <answer> </answer> tags, respectively, i.e., "
"<think> reasoning process here </think><answer> answer here </answer>"
)
这个系统提示告诉基础模型(Qwen2-0.5B)其作为一个有帮助的助手的角色,在回答前要逐步推理。<think>
和<answer>
标签用于构建模型响应,将其内部推理与最终答案分开,以便更好地评估和奖励。
训练数据预处理
现在我们的系统提示已准备好,我们需要根据我们的模板转换我们的训练数据。
图4:数据预处理流程
我们需要创建make\_conversation
函数,它将为我们处理对话。它将从我们的训练数据集中获取每个问题列的值,并为每一行返回一个包含系统提示和附加问题的字典。让我们创建这个准备数据集的函数。
# Function to structure the training data
def make\_conversation(example):
"""Convert dataset examples into conversation format."""
return {
"prompt": [
{"role": "system", "content": SYSTEM\_PROMPT},
{"role": "user", "content": example["problem"]},
],
}
# Load and prepare dataset
def load\_math\_dataset():
"""Load and prepare the mathematics dataset."""
dataset = load\_dataset(
"AI-MO/NuminaMath-TIR",
name="default",
split=['train', 'test']
)
# Convert splits into dictionary
dataset = {
'train': dataset[0],
'test': dataset[1]
}
# Apply conversation format
for split in dataset:
dataset[split] = dataset[split].map(make\_conversation)
# Remove 'messages' column if exists
if"messages"in dataset[split].column\_names:
dataset[split] = dataset[split].remove\_columns("messages")
return dataset
我们已经准备好了一切,让我们将我们的训练数据转换为所需格式并打印训练和测试集的大小。
# Load our training dataset and printing train/test size
dataset = load\_math\_dataset()
print(f"Train set size: {len(dataset['train'])}")
print(f"Test set size: {len(dataset['test'])}")
打印结果如下:
Train set size: 72441
Test set size: 99
现在我们已经分割了训练数据集,在进行下一步之前,我们需要验证我们的数据集(检查用户/助手对话是否存在)。
def validate\_dataset(dataset):
"""Perform basic validation checks on the dataset."""
# Define the required fields for the dataset
required\_fields = ["problem", "prompt"]
# Loop through the 'train' and 'test' splits of the dataset
for split in ['train', 'test']:
print(f"\nValidating {split} split:")
# Retrieve column names from the dataset
fields = dataset[split].column\_names
# Check if any required fields are missing
missing = [field for field in required\_fields if field not in fields]
if missing:
print(f"Warning: Missing fields: {missing}") # Warn if fields are missing
else:
print("✓ All required fields present") # Confirm all fields are present
# Retrieve the first sample from the dataset split
sample = dataset[split][0]
# Extract the 'prompt' field, which contains a list of messages
messages = sample['prompt']
# Validate the prompt format:
# - It should contain at least two messages
# - The first message should be from the 'system' role
# - The second message should be from the 'user' role
if (len(messages) >= 2 and
messages[0]['role'] == 'system' and
messages[1]['role'] == 'user'):
print("✓ Prompt format is correct") # Confirm correct format
else:
print("Warning: Incorrect prompt format") # Warn if format is incorrect
# Validate dataset
validate\_dataset(dataset)
打印输出结果如下:
Validating train split:
✓ All required fields present
✓ Prompt format is correct
Validating test split:
✓ All required fields present
✓ Prompt format is correct
从上述结果可以看出,训练数据集已成功验证,这意味着我们已经成功地将原始数据转换可用于训练的数据集。
在R1训练速览章节的 GRPO 部分已经介绍Reward函数,它通过五种不同的方式评估基础模型的回答:
- 准确性(回答是否正确?)
- 格式(包括标签是否正确使用?)
- 推理步骤(逻辑是否清晰?)
- 余弦缩放(回答是否简洁?)
- 重复惩罚(是否有不必要的重复?)。
这些都是计算每个回答奖励的函数,因此以下会先实现Reward Functions这部分代码。
图5:奖励函数
Reward的Accuracy
奖励的准确性虽然最容易理解,但实现的时候代码稍微复杂。在奖励模型中,我们想从数学上检查,基础模型回答是否与真实解决方案等同。如果模型答案在数学上是正确的,则分配1.0的奖励。如果不正确,奖励为0.0。在无法解析真实解决方案的情况下,则分配0.5的中性奖励,以避免不公平的惩罚。
以下是这个函数的实现:
def accuracy\_reward(completions, solution, **kwargs):
"""
Reward function to check if the model's response is mathematically
equivalent to the ground truth solution.
Uses latex2sympy2 for parsing and math\_verify for validation.
"""
# Extract responses
contents = [completion[0]["content"] for completion in completions]
rewards = []
for content, sol in zip(contents, solution):
# Parse the ground truth solution
gold\_parsed = parse(sol, extraction\_mode="first\_match",
extraction\_config=[LatexExtractionConfig()])
if gold\_parsed: # Check if parsing was successful
# Parse the model's answer with relaxed normalization
answer\_parsed = parse(
content,
extraction\_config=[
LatexExtractionConfig(
normalization\_config=NormalizationConfig(
nits=False,
malformed\_operators=False,
basic\_latex=True,
equations=True,
boxed="all",
units=True,
),
boxed\_match\_priority=0,
try\_extract\_without\_anchor=False,
)
],
extraction\_mode="first\_match",
)
# Reward 1.0 if correct, 0.0 if incorrect
reward = float(verify(answer\_parsed, gold\_parsed))
else:
# If ground truth cannot be parsed, assign neutral reward (0.5)
reward = 0.5
print("Warning: Failed to parse gold solution:", sol)
rewards.append(reward)
return rewards
在这个函数中,检查模型回答是否等同于正确答案。这不是简单地比较原始文本,而是:
- 使用latex2sympy2将解决方案转换为结构化的数学格式。
- 如果解析失败,分配0.5的中性奖励。
- 提取模型输出并对其进行标准化,以提高稳健性。
- 使用
math\_verify
检查解析后的回答是否与解析后的解决方案匹配。 - 如果正确分配1,如果不正确分配0。
这确保了准确性评估不仅仅是关于文本相似性,而是真正的数学正确性。
格式化Reward
格式奖励是确保模型遵循指令并正确构造输出。我们要求将推理放在<think>
标签中,将最终答案放在<answer>
标签中。奖励函数就是检查这一点!如果模型正确使用了这些标签,给它1的奖励,否则奖励将为0。这使得模型更加关注我们想要的输出结构。具体实现代码如下:
# Implement Format Reward Function
def format\_reward(completions, **kwargs):
"""
Reward function to check if the completion has the correct format:
<think>...</think> <answer>...</answer>.
"""
# Define the regex pattern for the desired format
pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
# Extract the content from each completion
completion\_contents = [completion[0]["content"] for completion in completions]
# Check if each completion matches the pattern
matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE)
for content in completion\_contents]
# Reward 1.0 for correct format, 0.0 otherwise
return [1.0 if match else 0.0 for match in matches]
在这个函数中:
- 使用正则表达式定义一个pattern(模式)。这个模式大概含义是:待抽取的内容应该以
<think>
作为开始,直到</think>
,然后是一些空格,然后是<answer>
,直到</answer>
,最后在那里结束。 - 从每个模型输出的补全内容(即常说的completion)中获取实际的文本内容。
- 然后使用
re.match
来查看每个内容是否完全匹配上述的模式。re.DOTALL
在regex中的.也匹配换行符,re.MULTILINE
使^和$匹配整个字符串的开始/结束,而不仅仅是行。 - 最后,如果它完全匹配了格式,则给予1的奖励,如果没有,则给0。这是对格式正确性的严格奖励。
Reward中的推理
推理步骤奖励的设计则有点巧妙。我们希望鼓励模型展示其"思考过程"。因此,如果输出结果包含看起来像推理步骤的内容则对模型进行奖励。为此需要寻找通常出现在逐步推理中的关键词和模式,比如:
- 步骤1,步骤2等。
- 编号列表,如1、2
- 项目符号,如-或*
- 过渡词,如第一(First)、第二(Second)、最后(Finally)等这类词汇
回答中包含这些越多,奖励就越多。编写这个鼓励展示推理过程的函数:
def reasoning\_steps\_reward(completions, **kwargs):
r"""
Reward function to encourage clear step-by-step reasoning.
It looks for patterns like "Step 1:", numbered lists, bullet points,
and transition words.
"""
# Regex pattern to find indicators of reasoning steps
pattern = r"(Step \d+:|^\d+\.|\n-|\n\*|First,|Second,|Next,|Finally,)"
# Extract completion contents
completion\_contents = [completion[0]["content"] for completion in completions]
# Count the number of reasoning step indicators in each completion
matches = [len(re.findall(pattern, content, re.MULTILINE))
for content in completion\_contents]
# Reward is proportional to the number of reasoning steps, maxing out at 1.0
# We're using a "magic number" 3 here - encourage at least 3 steps for full reward
return [min(1.0, count / 3) for count in matches]
在上面代码中创建了一个更复杂的正则表达式,从而找出上面列出的所有推理指示器。使用re.findall
在每个内容中找到模式的所有匹配项。len(re.findall(…))
给出这些指示器的计数(记为count)。
奖励计算为min(1.0, count / 3)
。这意味着
- 如果找到3个或更多推理指示器(count >= 3),奖励为1.0(最大奖励)。
- 如果找到更少(例如,count = 1或2),它获得部分奖励(如1/3或2/3)。
- 如果一个也没找到(count = 0),奖励为0.0。
count为啥除以3?3在这里是一个大概的经验值。使用3是说"瞄准大约3个推理步骤来获得全额奖励"。如想鼓励更多或更少的步骤,可以调整这个数字。
余弦缩放Reward
余弦缩放奖励(Cosine Scaled Reward)是一种更加高级的奖励。它旨在鼓励简洁的正确答案, 同时对较长的错误答案不那么苛刻。
- 对于正确答案:我们希望更多地奖励简短、直接的回答,而不是冗长、啰嗦的答案。简短、正确的答案通常更好。
- 对于错误答案:简短的错误答案可能比至少尝试推理的较长错误答案更糟糕。因此,我们希望对简短的错误答案的惩罚大于长篇错误答案。
以下代码实现这种巧妙的余弦缩放:
# Implement Cosine Scaled Reward Function
def get\_cosine\_scaled\_reward(
min\_value\_wrong: float = -0.5,
max\_value\_wrong: float = -0.1,
min\_value\_correct: float = 0.8,
max\_value\_correct: float = 1.0,
max\_len: int = 1000,
):
"""
Returns a cosine scaled reward function. This function scales the accuracy reward
based on completion length. Shorter correct solutions get higher rewards,
longer incorrect solutions get less penalty.
"""
def cosine\_scaled\_reward(completions, solution, accuracy\_rewards, **kwargs):
"""
Cosine scaled reward function that adjusts accuracy rewards based on completion length.
"""
contents = [completion[0]["content"] for completion in completions]
rewards = []
for content, sol, acc\_reward in zip(contents, solution, accuracy\_rewards):
gen\_len = len(content) # Length of the generated answer
progress = gen\_len / max\_len # How far we are to max length
cosine = math.cos(progress * math.pi) # Cosine value based on progress
if acc\_reward > 0.5: # Assuming accuracy\_reward gives ~1.0 for correct answers
min\_value = min\_value\_correct
max\_value = max\_value\_correct
else: # Incorrect answer
min\_value = max\_value\_wrong # Note the swap!
max\_value = min\_value\_wrong
# Cosine scaling formula!
reward = min\_value + 0.5 * (max\_value - min\_value) * (1.0 + cosine)
rewards.append(float(reward))
return rewards
return cosine\_scaled\_reward
get\_cosine\_scaled\_reward(...)
生成用于训练的余弦缩放奖励函数,通过参数如min\_value\_wrong/max\_value\_wrong
(错误答案的惩罚范围)和min\_value\_correct/max\_value\_correct
(正确答案的奖励范围)来自定义缩放。max\_len
设置缩放的最大长度。
内部的cosine\_scaled\_reward(...)
函数则是根据completions
、solution
和accuracy\_rewards
计算奖励。它计算gen\_len
,将其标准化为progress = gen\_len / max\_len
,并得出一个余弦值,从1(短答案)开始减少到-1(长答案)。
如果acc\_reward >0.5
,它使用正确的奖励范围,否则它应用错误的范围, 但交换最小/最大值以减少对较长错误答案的惩罚。
Reward中的重复惩罚系数
重复惩罚主要是为了阻止模型陷入循环并重复自己。我们希望它生成新鲜、多样的推理和答案,而不仅仅是复制粘贴相同的短语!
这个奖励函数会对模型使用相同词序列(n-gram)过多的情况进行惩罚。在以下的例子中,将使用大小为3的n-gram(三元组),当然这个值是可以调整的。如果模型大量重复自己,它会得到一个负奖励(惩罚)。如果它更加多样化并避免重复,惩罚就会减少。以下是实现惩罚重复的代码:
def get\_repetition\_penalty\_reward(ngram\_size: int = 3, max\_penalty: float = -0.1):
"""
Returns a repetition penalty reward function. Penalizes repetitions of n-grams
in the generated text.
"""
if max\_penalty > 0:
raise ValueError(f"max\_penalty {max\_penalty} should not be positive")
def zipngram(text: str, ngram\_size: int):
"""Helper function to generate n-grams from text."""
words = text.lower().split() # Lowercase and split into words
return zip(*[words[i:] for i in range(ngram\_size)]) # Create n-grams
def repetition\_penalty\_reward(completions, **kwargs) -> float:
"""
Repetition penalty reward function.
"""
contents = [completion[0]["content"] for completion in completions]
rewards = []
for completion in contents:
if completion == "": # No penalty for empty completions
rewards.append(0.0)
continue
if len(completion.split()) < ngram\_size: # No penalty for short completions
rewards.append(0.0)
continue
ngrams = set() # Use a set to store unique n-grams
total = 0
for ng in zipngram(completion, ngram\_size): # Generate n-grams
ngrams.add(ng) # Add n-gram to the set (duplicates are ignored)
total += 1 # Count total n-grams
# Calculate scaling factor: more repetition -> higher scaling
scaling = 1 - len(ngrams) / total
reward = scaling * max\_penalty # Apply penalty based on scaling
rewards.append(reward)
return rewards
return get\_repetition\_penalty\_reward
上面的get\_repetition\_penalty\_reward(...)
创建一个奖励函数来惩罚重复,参数包括ngram\_size
(默认为3,用于三元组)和max\_penalty
(一个负值,例如-0.1)。辅助函数zipngram(text, ngram\_size)
通过将文本转换为小写、将其拆分为单词,并使用zip(*[words[i:] for i in range(ngram\_size)])
高效提取来生成n-gram。
在repetition\_penalty\_reward(...)
内部,计算每个完成(即一次completion)的惩罚。如果它为空或太短,它获得0.0的奖励。惩罚按scaling = 1 - len(ngrams) / total
进行缩放, 其中total
是n-gram的数量,len(ngrams)
是去重后的n-gram计数。更多重复使scaling接近1,增加惩罚。
最终奖励是scaling *max\_penalty
,这意味着较少的重复会导致较小的惩罚,而高重复会导致更强的负奖励。
至此,以上已经实现了所有5个奖励函数,从下一章节开始将进入下一阶段:定义训练参数。
R1-Zero的训练配置
现在需要编写一个配置,以实现实际微调的时候上面定义的奖励函数可以正常工作。因此,定义一个配置类:
# Define GRPOScriptArguments for reward function parameters
@dataclass
class GRPOScriptArguments:
"""
Script arguments for GRPO training, specifically related to reward functions.
"""
reward\_funcs: list[str] = field(
default\_factory=lambda: ["accuracy", "format"],
metadata={
"help": "List of reward functions. Possible values: 'accuracy', 'format', 'reasoning\_steps', 'cosine', 'repetition\_penalty'"
},
)
cosine\_min\_value\_wrong: float = field(
default=-0.5,
metadata={"help": "Minimum reward for cosine scaling for wrong answers"},
)
cosine\_max\_value\_wrong: float = field(
default=-0.1,
metadata={"help": "Maximum reward for cosine scaling for wrong answers"},
)
cosine\_min\_value\_correct: float = field(
default=0.8,
metadata={"help": "Minimum reward for cosine scaling for correct answers"},
)
cosine\_max\_value\_correct: float = field(
default=1.0,
metadata={"help": "Maximum reward for cosine scaling for correct answers"},
)
cosine\_max\_len: int = field(
default=1000,
metadata={"help": "Maximum length for cosine scaling"},
)
repetition\_n\_grams: int = field(
default=3,
metadata={"help": "Number of n-grams for repetition penalty reward"},
)
repetition\_max\_penalty: float = field(
default=-0.1,
metadata={"help": "Maximum (negative) penalty for for repetition penalty reward"},
)
GRPOScriptArguments
类保存奖励设置。reward\_funcs
列表决定使用哪些奖励,默认添加["accuracy", "format"]
,当然也可以添加更多如"reasoning\_steps"、"cosine"、"repetition\_penalty"
。其他一些设置项可以控制cosine\_scaled\_reward
和repetition\_penalty\_reward
的工作方式,从而调整如何给予奖励。
接下来,使用transformers库中的TrainingArguments
。这是控制训练过程的配置对象。
# Define TrainingArguments from transformers
training\_args = TrainingArguments(
output\_dir=OUTPUT\_DIR, # Output directory for checkpoints and logs
overwrite\_output\_dir=True,
num\_train\_epochs=1, # Total number of training epochs
per\_device\_train\_batch\_size=8, # Batch size per device during training
per\_device\_eval\_batch\_size=16, # Batch size for evaluation
gradient\_accumulation\_steps=2, # Accumulate gradients to simulate larger batch size
learning\_rate=5e-5, # Initial learning rate for AdamW optimizer
warmup\_ratio=0.1, # Linear warmup over warmup\_ratio fraction of training steps
weight\_decay=0.01, # Apply weight decay to all layers except bias and LayerNorm weights
logging\_steps=10, # Log every X updates steps
evaluation\_strategy="steps", # Evaluate every `eval\_steps`
eval\_steps=50, # Evaluation and logging steps
save\_strategy="steps", # Save checkpoint every `save\_steps`
save\_steps=50, # Save checkpoint every X updates steps
save\_total\_limit=2, # Limit the total amount of checkpoints. Deletes the older checkpoints.
dataloader\_num\_workers=2, # Number of subprocesses to use for data loading
seed=42, # Random seed for reproducibility
bf16=True, # Use mixed precision BFP16 training
push\_to\_hub=False, # Whether to push the final model to Hugging Face Hub
gradient\_checkpointing=True, # Enable gradient checkpointing
report\_to="none", # Reporting to no one
)
最后,需要有一个ModelConfig
。这是配置模型本身设置的地方,比如使用哪个预训练模型,使用什么数据类型(如bfloat16)等。如下定义ModelConfig:
@dataclass
class ModelConfig:
"""
Configuration for the model.
"""
model\_name\_or\_path: str = field(
default=MODEL\_NAME, metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
)
model\_revision: Optional[str] = field(
default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}
)
torch\_dtype: Optional[str] = field(
default="bfloat16", metadata={"help": "Override the default `torch\_dtype` and load the model under this dtype."}
)
trust\_remote\_code: bool = field(
default=True, metadata={"help": "Trust remote code when loading model and tokenizer."}
)
attn\_implementation: Optional[str] = field(
default="flash\_attention\_2", metadata={"help": "Attention implementation to use. 'flash\_attention\_2' or None"}
)
上述的ModelConfig
类保存关键设置,包括model\_name\_or\_path
。同时使用torch\_dtype="bfloat16"
以提高效率,并设置trust\_remote\_code=True
以安全地远程加载。此外,如果支持flash\_attention\_2
,还可以启用attn\_implementation="flash\_attention\_2"
以实现潜在的更快训练。
现在需要实际创建这些配置类的实例,以便后续使用:
# Instantiate configuration objects
script\_args = GRPOScriptArguments()
model\_args = ModelConfig()
接下来,需要获取奖励函数列表 和我们希望在训练过程中使用的任何"回调"。回调就像小助手,可以在训练过程的不同点做事情(比如记录进度、保存模型等)。现在,我们将只使用一个简单的日志记录回调。将我们的奖励函数集中在一起。
# Utility function to get reward functions based on script arguments
def get\_reward\_functions(script\_args):
"""
Returns a list of reward functions based on the script arguments.
"""
reward\_funcs\_list = []
reward\_funcs\_registry = {
"accuracy": accuracy\_reward, # Assuming accuracy\_reward is defined in previous steps
"format": format\_reward, # Assuming format\_reward is defined in previous steps
"reasoning\_steps": reasoning\_steps\_reward, # Assuming reasoning\_steps\_reward is defined
"cosine": get\_cosine\_scaled\_reward( # Assuming get\_cosine\_scaled\_reward is defined
min\_value\_wrong=script\_args.cosine\_min\_value\_wrong,
max\_value\_wrong=script\_args.cosine\_max\_value\_wrong,
min\_value\_correct=script\_args.cosine\_min\_value\_correct,
max\_value\_correct=script\_args.cosine\_max\_value\_correct,
max\_len=script\_args.cosine\_max\_len,
),
"repetition\_penalty": get\_repetition\_penalty\_reward( # Assuming get\_repetition\_penalty\_reward is defined
ngram\_size=script\_args.repetition\_n\_grams,
max\_penalty=script\_args.repetition\_max\_penalty,
),
}
for func\_name in script\_args.reward\_funcs:
if func\_name not in reward\_funcs\_registry:
raise ValueError(f"Reward function '{func\_name}' not found in registry.")
reward\_funcs\_list.append(reward\_funcs\_registry[func\_name])
return reward\_funcs\_list
我们的回调函数将跟踪损失和其他重要信息。
logger = logging.getLogger(\_\_name\_\_)
class LoggingCallback(TrainerCallback):
"""
A simple callback for logging training information at specific steps.
"""
def on\_step\_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
if state.global\_step % args.logging\_steps == 0:
logger.info(f"Step {state.global\_step}: Loss = {state.log\_history[-1].get('loss', None)}, Learning Rate = {state.log\_history[-1].get('learning\_rate', None)}")
def get\_callbacks(training\_args, model\_args, script\_args):
"""
Returns a list of callbacks to be used during training.
For now, it includes only the LoggingCallback. You can extend this to add more callbacks.
"""
callbacks = [LoggingCallback()] # Instantiate our LoggingCallback
return callbacks
最后,初始化这些函数。
# Get reward functions and callbacks
reward\_functions = get\_reward\_functions(script\_args)
callbacks = get\_callbacks(training\_args, model\_args, script\_args)
GRPO训练
至此,GRPO的训练已经万事俱备,只需要初始化GRPO训练的引擎,并为其提供我们已经准备的所有组件:模型、奖励函数、训练参数、数据集和回调函数! 初始化GRPO Trainer:
# Create GRPOConfig from TrainingArguments
grpo\_config = GRPOConfig(
**training\_args.to\_dict(), # Convert TrainingArguments to dictionary and unpack
**{
# REMOVED model\_init\_kwargs here
# We are passing the instantiated 'model' object, so GRPOTrainer doesn't need model\_init\_kwargs
}
)
grpo\_trainer = GRPOTrainer(
model=model, # Our initialized Qwen model
reward\_funcs=reward\_functions, # List of reward functions from previous step
args=grpo\_config, # GRPOConfig (created from TrainingArguments)
train\_dataset=dataset['train'], # Training dataset
eval\_dataset=dataset['test'], # Evaluation dataset
callbacks=callbacks # List of callbacks
)
现在可以开始训练了!只需要grpo\_trainer
上调用train()方法即可。
# Start the GRPO Training Loop
train\_result = grpo\_trainer.train()
当运行上述命令之后,应该看到训练过程开始。这里为了演示,num\_train\_epochs = 1
,但是对于实际的GRPO DeepSeek R1 Zero训练, 可能会训练更多的轮次和步骤。最终训练完成的日志信息如下:
{'eval\_loss': 0.09589701145887375, 'eval\_runtime': 237.893, 'eval\_samples\_per\_second': 0.416, 'eval\_steps\_per\_second': 0.029, 'eval\_rewards/accuracy\_reward': 0.2175, 'eval\_rewards/format\_reward': 0.96375, 'eval\_reward': 1.18125, 'eval\_reward\_std': 0.24513342082500458, 'eval\_completion\_length': 40.237916717529295, 'eval\_kl': 2.3843612051010132, 'epoch': 1.0}
{'train\_runtime': 195268.8548, 'train\_samples\_per\_second': 0.371, 'train\_steps\_per\_second': 0.046, 'train\_loss': 0.08140588973737459, 'rewards/accuracy\_reward': 0.153125, 'rewards/format\_reward': 0.9625, 'reward': 1.115625, 'reward\_std': 0.22041844427585602, 'completion\_length': 42.109375, 'kl': 2.380528378486633, 'epoch': 1.0}
GRPO Training Success
此时结果模型存于:/your\_dir/model\_results/Qwen/Qwen2.5-0.5B-Instruct/GRPO-training
此时的文件清单如下:
-rw-r--r-- 1 root root 758 Mar 26 16:00 config.json
-rw-r--r-- 1 root root 242 Mar 26 16:00 generation\_config.json
-rw-r--r-- 1 root root 943M Mar 26 16:00 model.safetensors
-rw-r--r-- 1 root root 7.2K Mar 26 16:00 tokenizer\_config.json
-rw-r--r-- 1 root root 613 Mar 26 16:00 special\_tokens\_map.json
-rw-r--r-- 1 root root 605 Mar 26 16:00 added\_tokens.json
-rw-r--r-- 1 root root 2.7M Mar 26 16:00 vocab.json
-rw-r--r-- 1 root root 1.6M Mar 26 16:00 merges.txt
-rw-r--r-- 1 root root 11M Mar 26 16:00 tokenizer.json
-rw-r--r-- 1 root root 5.8K Mar 26 16:00 training\_args.bin
-rw-r--r-- 1 root root 1.9G Mar 26 16:00 optimizer.pt
-rw-r--r-- 1 root root 1.1K Mar 26 16:00 scheduler.pt
-rw-r--r-- 1 root root 14K Mar 26 16:00 rng\_state.pth
-rw-r--r-- 1 root root 423K Mar 26 16:00 trainer\_state.json
保存Tiny R1 Zero LLM
一旦训练完成,可以保存我们的训练模型,用于推理。
# Define the path to your trained model (same as OUTPUT\_DIR)
TRAINED\_MODEL\_PATH = "data/Qwen-GRPO-training"
# Save the tokenizer
tokenizer.save\_pretrained(TRAINED\_MODEL\_PATH)
# Save the trained model
grpo\_trainer.save\_model(TRAINED\_MODEL\_PATH)
print(f"GRPO Trained model saved to {TRAINED\_MODEL\_PATH}")
然后可以简单地使用以下方式加载训练好的模型:
# Load the tokenizer - make sure to use trust\_remote\_code=True if needed
tokenizer = AutoTokenizer.from\_pretrained(
TRAINED\_MODEL\_PATH,
trust\_remote\_code=True, # If your model config requires it
padding\_side="right"# Ensure consistent padding side
)
# Set pad token if it wasn't saved or loaded correctly
if tokenizer.pad\_token is None:
tokenizer.pad\_token = tokenizer.eos\_token
# Load the trained model itself
trained\_model = AutoModelForCausalLM.from\_pretrained(
TRAINED\_MODEL\_PATH,
trust\_remote\_code=True, # If your model architecture requires it
torch\_dtype=torch.bfloat16 # Keep the same dtype as training for consistency
)
# Move the loaded model to your device (GPU if available)
trained\_model.to(device) # 'device' is still our CUDA device from before
为了用于推理:
# Testing Inference with the Trained Model
def test\_trained\_model\_inference(user\_input: str):
"""Test inference with the loaded trained model and tokenizer."""
messages = [
{"role": "system", "content": SYSTEM\_PROMPT}, # Re-use our system prompt
{"role": "user", "content": user\_input}
]
# Apply chat template using our tokenizer
text = tokenizer.apply\_chat\_template(
messages,
tokenize=False,
add\_generation\_prompt=True
)
# Tokenize the input text
inputs = tokenizer(text, return\_tensors="pt").to(device)
# Generate output using our *trained\_model*
outputs = trained\_model.generate(
**inputs,
max\_new\_tokens=200, # Maybe generate a bit longer now
do\_sample=True,
temperature=0.7
)
# Decode the generated tokens back to text
response = tokenizer.decode(outputs[0], skip\_special\_tokens=True)
return response
输出结果如下:
Test Input: A是C的爸爸,B是C的妈妈,那么A和B是什么关系?
Trained Model Response: system
A conversation between User and Assistant. The user asks a question,
and the Assistant solves it. The assistant
first thinks about the reasoning process in the mind and
then provides the user with the answer. The reasoning
process and answer are enclosed within <think> </think>
and <answer> </answer> tags, respectively, i.e.,
<think> reasoning process here </think><answer> answer here </answer>
user
A是C的爸爸,B是C的妈妈,那么A和B是什么关系?
assistant
<think> A 和 B 是 C 的父女关系。因为题干中提到 "A 是 C 的爸爸" 和 "B 是 C 的妈妈",所以根据这些信息,我们可以推断出 A 和 B 是 C 的父子关系。</thought>
<answer> 父母关系</answer>
R1 Zero中的2个主要问题
现在我们已经使用我们的基础模型Qwen2-0.5B(而不是他们的DeepSeek V3原始基础模型)完成了R1 zero训练方法。
DeepSeek的研究人员发现R1 Zero模型在推理测试中表现非常好,甚至在像AIME 2024这样的任务上得分与更高级的模型(如OpenAI-01-0912)相似。
这表明使用强化学习(RL)鼓励语言模型进行推理是一种有前景的方法。
但他们也注意到DeepSeek-R1-Zero有一些关键问题需要解决,以便在实际应用和更广泛的研究中使用。
DeepSeek的研究人员表示,模板故意设计得简单且注重结构,从而避免对推理过程本身施加任何特定的内容约束。例如,它不会说:
- "你必须使用逐步推理"(它只说"推理过程",让模型自己定义这意味着什么)。
- "你必须使用反思性推理"
- "你必须使用特定的问题解决策略"
主要问题是<think>
标签内的推理过程难以阅读,使人类难以跟踪和分析。
另一个问题是语言混合,当被问到多语言问题时,模型有时会在同一回答中混合语言,导致输出不一致和混乱。
如果你用西班牙语提问,突然间,它的"思考"会变成英语和西班牙语的混合体!这些问题,推理混乱和语言混淆,显然是需要克服的。
这也是官方将初始R1 Zero模型进一步升级为R1的两个主要原因。