LLM推理模型实战(中篇)：训练R1-Zero - 文章 - 开发者社区

引言
准备工作
R1训练速览
数据处理
Reward函数
R1-Zero

引言

日长篱落无人过，惟有蜻蜓蛱蝶飞。

picture.image

小伙伴们好，我是微信公众号"小窗幽记机器学习"的小编卖铁观音的小男孩。承接上文LLM推理中的强化学习及其实战：以GRPO为例(上篇)已经介绍DeepSeek-R1的理论，本文将从实战角度出发，重点阐述如何一步步训练出R1-Zero模型。下一篇则会进一步讲解R1类模型的训练细节。

准备工作

环境配置

  
conda create -n r1\_from\_scratch-env-py311 "llvmdev>=15" "cmake>=3.24" git python=3.11  
  
source activate  
conda deactivate  
conda activate r1\_from\_scratch-env-py311

准备数据

数据出处：

  
https://huggingface.co/datasets/AI-MO/NuminaMath-TIR  
https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k

本地数据存储位置：

  
/your\_dir/share\_data\_zoo/AI-MO/NuminaMath-TIR  
/your\_dir/share\_data\_zoo/bespokelabs/Bespoke-Stratos-17k

AI-MO/NuminaMath-TIR

数据下载： https://huggingface.co/datasets/AI-MO/NuminaMath-TIR

AI-MO/NuminaMath-TIR包含70K个数学问题，messages列表示解决方案背后的COT（思想链）推理过程。

| Field | Description | | --- | --- | | problem | The math problem | | solution | Step-by-step solution | | messages | Chat to solve the problem |

加载数据，查看样本：

  
from datasets import load\_dataset  
# Load the "AI-MO/NuminaMath-TIR" dataset from Local dir  
data\_dir="/your\_dir/share\_data\_zoo/AI-MO/NuminaMath-TIR/data"  
MATH\_le = load\_dataset("parquet", data\_dir=data\_dir)    
  
# Access the first sample in the training set  
print(MATH\_le['train'][0])

打印输出结果如下：

  
{'problem': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$?  Express your answer as a common fraction.',   
    'solution': "To determine the coefficient of \\(x^2y^6\\) in the expansion of \\(\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8\\), we can use the binomial theorem. ...",   
    'messages': [{'content': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$?  Express your answer as a common fraction.',   
                'role': 'user'},   
                {'content': "To determine the coefficient of \\(x^2y^6\\) in the expansion ...",   
                'role': 'assistant'}]}

bespokelabs/Bespoke-Stratos-17k

数据下载： https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k

Bespoke-Stratos包含17K个专注于数学和代码的问题。

| Field | Description | | --- | --- | | system | Guidelines for math and code problems | | conversation | Chat to solve the problem |

数据存储位置：

  
/your\_dir/share\_data\_zoo/bespokelabs/Bespoke-Stratos-17k

加载数据，查看样本：

  
# Load the "Bespoke-Stratos-17k" dataset from bespokelabs  
from datasets import load\_dataset  
data\_dir="/your\_dir/share\_data\_zoo/bespokelabs/Bespoke-Stratos-17k/data"  
bespoke\_rl = load\_dataset("parquet", data\_dir=data\_dir)  
# Access the first sample in the training set  
print(bespoke\_rl['train'][0])

打印输出结果如下：

  
{'system': "Your role as an assistant involves thoroughly exploring XXX, 'conversations': [{'from': 'user', 'value': 'Return your final response within \\boxed{}. ....}]}

R1训练速览

在深入探讨R1各步骤细节前先进行简要概述。更多详尽的内容可以参考此前的解读文章：

picture.image

图1：DeepSeek R1实现概览

为让模型获得强悍的推理能力，DeepSeek R1采用了强化学习（RL），当模型推理正确时给予奖励，反之则予以惩罚。这并非单一训练环节，而是一整套的"流水线"步骤。先用纯强化学习测试推理能力是否自然形成，这就是实验性质的DeepSeek-R1-Zero。而真正的DeepSeek-R1则更加系统化，分为多个阶段：先提供初始数据，再进行强化学习，然后是更多数据，更多强化学习...就像逐级提升的过程。这一切都是为了大幅提高语言模型的思考解题能力。

选择Base模型

由于DeepSeek团队选择了DeepSeek-V3作为他们创建R1 Zero和R1的基础模型，但DeepSeek-V3的大小高达685 GB ，显然不在我们能力范围内。为此，这里将使用一个小得多的基础模型Qwen/Qwen2.5–0.5B-Instruct（大小为0.9 GB）。当然，如果你有更高的GPU内存，甚至可以加载未量化的LLM，你也可以选择更大的模型，比如Qwen/Qwen2.5–7B-Instruct。

以下对所选用的模型进行加载和基本的试用：

  
#!/usr/bin/env python  
# -*- coding: utf-8 -*-  
# @Time    : 2025/3/23 13:36  
# @Author  : <小窗幽记机器学习>  
# @File    : check\_model.py  
import os  
import torch  
  
from transformers import (  
    AutoModelForCausalLM,  
    AutoTokenizer,  
    HfArgumentParser,  
    TrainingArguments,  
    set\_seed,  
    TrainerCallback,  
    TrainerControl,  
    TrainerState,  
)  
  
MODEL\_DIR = "/your\_dir/share\_model\_zoo/"  
MODEL\_NAME = "Qwen/Qwen2.5-0.5B-Instruct"  
OUTPUT\_DIR = "/your\_dir/model\_results/"  
OUTPUT\_FILE = "GRPO-training"# For saving our trained model  
OUTPUT\_DIR = os.path.join(OUTPUT\_DIR, MODEL\_NAME, OUTPUT\_FILE)  
print("Model Save dir=", OUTPUT\_DIR)  
MODEL\_NAME = os.path.join(MODEL\_DIR, MODEL\_NAME)  
# Create output directory if it doesn't exist  
os.makedirs(OUTPUT\_DIR, exist\_ok=True)  
  
# Initialize tokenizer with chat template  
tokenizer = AutoTokenizer.from\_pretrained(  
    MODEL\_NAME,  
    trust\_remote\_code=True,  
    padding\_side="right"  
)  
  
# Set pad token if not set  
if tokenizer.pad\_token is None:  
    tokenizer.pad\_token = tokenizer.eos\_token  
  
print(f"Vocabulary size: {len(tokenizer)}")  
print(f"Model max length: {tokenizer.model\_max\_length}")  
print(f"Pad token: {tokenizer.pad\_token}")  
print(f"EOS token: {tokenizer.eos\_token}")  
  
# Initialize base model  
model = AutoModelForCausalLM.from\_pretrained(  
    MODEL\_NAME,  
    trust\_remote\_code=True,  
    torch\_dtype=torch.bfloat16  
)  
  
print(f"Model parameters: {model.num\_parameters():,}")  
"""  
Vocabulary size: 151665  
Model max length: 131072  
Pad token: <|endoftext|>  
EOS token: <|im\_end|>  
  
Model parameters: 494,032,768  
"""  
# Check CUDA availability  
device = torch.device("cuda"if torch.cuda.is\_available() else"cpu")  
print(f"Using device: {device}")  
  
# Move model to the appropriate device  
model.to(device)  
  
  
# Test basic inference  
def test\_model\_inference(user\_input: str):  
    """Test basic model inference with the loaded model and tokenizer."""  
    messages = [  
        {"role": "system", "content": "你是微信公众号<小窗幽记机器学习>的智能助理，你叫卖打火机的小男孩"},  
        {"role": "user", "content": user\_input}  
    ]  
  
    # Apply chat template  
    text = tokenizer.apply\_chat\_template(  
        messages,  
        tokenize=False,  
        add\_generation\_prompt=True  
    )  
  
    # Tokenize and generate  
    inputs = tokenizer(text, return\_tensors="pt").to(device)  
    outputs = model.generate(  
        **inputs,  
        max\_new\_tokens=100,  
        do\_sample=True,  
        temperature=0.7  
    )  
  
    response = tokenizer.decode(outputs[0], skip\_special\_tokens=True)  
    return response  
  
  
# Test the model  
test\_input = "你好啊，请问你是？"  
response = test\_model\_inference(test\_input)  
print(f"Test Input: {test\_input}")  
print(f"Model Response: {response}")  
"""  
Test Input: 你好啊，请问你是？  
Model Response: system  
你是微信公众号<小窗幽记机器学习>的智能助理，你叫卖打火机的小男孩  
user  
你好啊，请问你是？  
assistant  
我是微信公众号<小窗幽记机器学习>的智能助理，专门回答用户的问题。有什么问题我可以帮你解答。  
"""

可以看出所选用的 "Qwen/Qwen2.5-0.5B-Instruct" 模型效果还不错，作为一个base模型应该是相对可靠的。

强化学习中的策略模型(R)

以上已经选择了基础模型，接下来需要了解如何为训练大语言模型(LLM)设置基本的强化学习(RL)环境。

对于DeepSeek R1，官方所选用的基础模型是DeepSeek V3，而在这里我们以Qwen2.5-0.5B-Instruct为起点，即基础模型。后续，将基于它创建了DeepSeek R1-Zero版本

R1 Zero是使用强化学习创建的，其中(DeepSeek v3/Qwen2.5-0.5B)充当RL agent(执行动作的行动者)。以下首先可视化它是如何工作的。

picture.image

图2：Qwen 2.5作为agent的workflow

RL Agent（DeepSeek V3/Qwen2-0.5B）首先执行一个动作，这意味着它为给定的问题生成一个答案和一些推理，这个问题被放入其环境中。在这种情况下，环境简单地就是推理任务本身。

执行动作后，环境会给出一个奖励。这个奖励就像反馈，它告诉我们的基础模型（DeepSeek V3/Qwen2-0.5B）它的动作有多好。正面奖励意味着它做对了某事，可能是得到了正确的答案或推理得很好。这个反馈信号返回给到我们的基础模型，帮助它学习并调整未来如何采取行动以获得更好的奖励。

在下一部分，我们将更详细地讨论这种方法。

R1-Zero中的GRPO算法

现在我们已经理解了基本的强化学习流程，接下来需要了解DeepSeek用于R1-Zero的具体强化学习算法。

有许多强化学习算法可用，但传统强化学习使用所谓的"评论器"(critic)来帮助主要决策部分(actor)，即DeepSeek-V3/Qwen2-0.5B)。这个评论器通常与actor一样大且复杂，这使得计算成本翻倍。

但DeepSeek使用GRPO来训练他们的初始模型(R1 Zero)，GRPO的做法不同，因为它直接从一组动作(actions)的结果中确定一个基准线，这是一种良好行动(actions)的参考点。因此，GRPO完全不需要单独的评论器模型。这节省了大量计算并提高了效率。

让我们绘制一个GRPO如何用于R1 Zero训练的流程图，然后我们将对其进行解释。

picture.image

图3：DeepSeek R1 Zero的GRPO流程

以下用Qwen2-0.5B这个基础模型来说明DeepSeek GRPO实现的工作原理。

首先，问题输入(A)被提供给Qwen模型(B)，Qwen尝试通过生成补全(C)来生成答案。最终结果，称为补全输出(D)，包含在<think>标签中的推理步骤和<answer>标签中的最终解决方案。

接下来，问题输入(A)和真实解决方案(E)被输入到奖励函数(F)中，这些函数充当智能评分器。这些函数将Qwen补全输出(D)与正确的解决方案进行比较，并评估不同方面，例如：

准确性（答案是否正确？）
格式（ <think> 和 <answer> 标签是否正确使用？）
推理步骤（逻辑是否清晰？）7
Cosine Scaling余弦缩放（回答是否简洁？）
重复惩罚（是否有不必要的重复？）

这些评估产生奖励分数(G)，然后传递给GRPO训练器(H)。训练器使用梯度来调整Qwen模型(B)，微调其生成答案的方式。这个过程被称为梯度奖励策略优化(Gradient Reward Policy Optimization，GRPO)，因为它使用梯度、奖励反馈和策略调整 来优化Qwen响应，以最大化性能。

最后，更新后的Qwen模型(B)会在新问题上再次测试，通过重复循环不断完善自己。每次迭代，Qwen都会成为更好的问题解决者。

在接下来的部分中，我们将开始为GRPO训练预处理我们的训练数据集。

数据处理

Prompt模板

我们正在使用与DeepSeek为GRPO算法所用的相同思考提示模板来构建R1 Zero，所以让我们定义一下：

  
# DeepSeek system prompt for GRPO based training  
SYSTEM\_PROMPT = (  
    "A conversation between User and Assistant. The user asks a question,   
     and the Assistant solves it. The assistant "  
    "first thinks about the reasoning process in the mind and   
     then provides the user with the answer. The reasoning "  
    "process and answer are enclosed within <think> </think>   
     and <answer> </answer> tags, respectively, i.e., "  
    "<think> reasoning process here </think><answer> answer here </answer>"  
)

这个系统提示告诉基础模型(Qwen2-0.5B)其作为一个有帮助的助手的角色，在回答前要逐步推理。<think>和<answer>标签用于构建模型响应，将其内部推理与最终答案分开，以便更好地评估和奖励。

训练数据预处理

现在我们的系统提示已准备好，我们需要根据我们的模板转换我们的训练数据。

picture.image

图4：数据预处理流程

我们需要创建make\_conversation函数，它将为我们处理对话。它将从我们的训练数据集中获取每个问题列的值，并为每一行返回一个包含系统提示和附加问题的字典。让我们创建这个准备数据集的函数。

  
# Function to structure the training data  
def make\_conversation(example):  
    """Convert dataset examples into conversation format."""  
    return {  
        "prompt": [  
            {"role": "system", "content": SYSTEM\_PROMPT},  
            {"role": "user", "content": example["problem"]},  
        ],  
    }  
  
# Load and prepare dataset  
def load\_math\_dataset():  
    """Load and prepare the mathematics dataset."""  
    dataset = load\_dataset(  
        "AI-MO/NuminaMath-TIR",  
        name="default",  
        split=['train', 'test']  
    )  
      
    # Convert splits into dictionary  
    dataset = {  
        'train': dataset[0],  
        'test': dataset[1]  
    }  
      
    # Apply conversation format  
    for split in dataset:  
        dataset[split] = dataset[split].map(make\_conversation)  
  
        # Remove 'messages' column if exists  
        if"messages"in dataset[split].column\_names:  
            dataset[split] = dataset[split].remove\_columns("messages")  
      
    return dataset

我们已经准备好了一切，让我们将我们的训练数据转换为所需格式并打印训练和测试集的大小。

  
# Load our training dataset and printing train/test size  
dataset = load\_math\_dataset()  
  
print(f"Train set size: {len(dataset['train'])}")  
print(f"Test set size: {len(dataset['test'])}")

打印结果如下：

  
Train set size: 72441  
Test set size: 99

现在我们已经分割了训练数据集，在进行下一步之前，我们需要验证我们的数据集（检查用户/助手对话是否存在）。

  
def validate\_dataset(dataset):  
    """Perform basic validation checks on the dataset."""  
      
    # Define the required fields for the dataset  
    required\_fields = ["problem", "prompt"]  
  
    # Loop through the 'train' and 'test' splits of the dataset  
    for split in ['train', 'test']:  
        print(f"\nValidating {split} split:")  
  
        # Retrieve column names from the dataset  
        fields = dataset[split].column\_names  
  
        # Check if any required fields are missing  
        missing = [field for field in required\_fields if field not in fields]  
        if missing:  
            print(f"Warning: Missing fields: {missing}")  # Warn if fields are missing  
        else:  
            print("✓ All required fields present")  # Confirm all fields are present  
  
        # Retrieve the first sample from the dataset split  
        sample = dataset[split][0]  
  
        # Extract the 'prompt' field, which contains a list of messages  
        messages = sample['prompt']  
  
        # Validate the prompt format:  
        # - It should contain at least two messages  
        # - The first message should be from the 'system' role  
        # - The second message should be from the 'user' role  
        if (len(messages) >= 2 and  
            messages[0]['role'] == 'system' and  
            messages[1]['role'] == 'user'):  
            print("✓ Prompt format is correct")  # Confirm correct format  
        else:  
            print("Warning: Incorrect prompt format")  # Warn if format is incorrect  
  
# Validate dataset  
validate\_dataset(dataset)

打印输出结果如下：

  
Validating train split:  
✓ All required fields present  
✓ Prompt format is correct  
  
Validating test split:  
✓ All required fields present  
✓ Prompt format is correct

从上述结果可以看出，训练数据集已成功验证，这意味着我们已经成功地将原始数据转换可用于训练的数据集。

Reward函数

在R1训练速览章节的 GRPO 部分已经介绍Reward函数，它通过五种不同的方式评估基础模型的回答：

准确性（回答是否正确？）
格式（包括标签是否正确使用？）
推理步骤（逻辑是否清晰？）
余弦缩放（回答是否简洁？）
重复惩罚（是否有不必要的重复？）。

这些都是计算每个回答奖励的函数，因此以下会先实现Reward Functions这部分代码。

picture.image

图5：奖励函数

Reward的Accuracy

奖励的准确性虽然最容易理解，但实现的时候代码稍微复杂。在奖励模型中，我们想从数学上检查，基础模型回答是否与真实解决方案等同。如果模型答案在数学上是正确的，则分配1.0的奖励。如果不正确，奖励为0.0。在无法解析真实解决方案的情况下，则分配0.5的中性奖励，以避免不公平的惩罚。

以下是这个函数的实现：

  
def accuracy\_reward(completions, solution, **kwargs):  
    """  
    Reward function to check if the model's response is mathematically   
    equivalent to the ground truth solution.  
    Uses latex2sympy2 for parsing and math\_verify for validation.  
    """  
      
    # Extract responses  
    contents = [completion[0]["content"] for completion in completions]  
    rewards = []  
      
    for content, sol in zip(contents, solution):  
        # Parse the ground truth solution  
        gold\_parsed = parse(sol, extraction\_mode="first\_match",   
                            extraction\_config=[LatexExtractionConfig()])  
          
        if gold\_parsed:  # Check if parsing was successful  
            # Parse the model's answer with relaxed normalization  
            answer\_parsed = parse(  
                content,  
                extraction\_config=[  
                    LatexExtractionConfig(  
                        normalization\_config=NormalizationConfig(  
                            nits=False,  
                            malformed\_operators=False,  
                            basic\_latex=True,  
                            equations=True,  
                            boxed="all",  
                            units=True,  
                        ),  
                        boxed\_match\_priority=0,  
                        try\_extract\_without\_anchor=False,  
                    )  
                ],  
                extraction\_mode="first\_match",  
            )  
  
            # Reward 1.0 if correct, 0.0 if incorrect  
            reward = float(verify(answer\_parsed, gold\_parsed))  
        else:  
            # If ground truth cannot be parsed, assign neutral reward (0.5)  
            reward = 0.5  
            print("Warning: Failed to parse gold solution:", sol)  
  
        rewards.append(reward)  
      
    return rewards

在这个函数中，检查模型回答是否等同于正确答案。这不是简单地比较原始文本，而是：

使用latex2sympy2将解决方案转换为结构化的数学格式。
如果解析失败，分配0.5的中性奖励。
提取模型输出并对其进行标准化，以提高稳健性。
使用 math\_verify 检查解析后的回答是否与解析后的解决方案匹配。
如果正确分配1，如果不正确分配0。

这确保了准确性评估不仅仅是关于文本相似性，而是真正的数学正确性。

格式化Reward

格式奖励是确保模型遵循指令并正确构造输出。我们要求将推理放在<think>标签中，将最终答案放在<answer>标签中。奖励函数就是检查这一点！如果模型正确使用了这些标签，给它1的奖励，否则奖励将为0。这使得模型更加关注我们想要的输出结构。具体实现代码如下：

  
# Implement Format Reward Function  
def format\_reward(completions, **kwargs):  
    """  
    Reward function to check if the completion has the correct format:  
    <think>...</think> <answer>...</answer>.  
    """  
    # Define the regex pattern for the desired format  
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"  
  
    # Extract the content from each completion  
    completion\_contents = [completion[0]["content"] for completion in completions]  
  
    # Check if each completion matches the pattern  
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE)  
               for content in completion\_contents]  
  
    # Reward 1.0 for correct format, 0.0 otherwise  
    return [1.0 if match else 0.0 for match in matches]

在这个函数中：

使用正则表达式定义一个pattern(模式)。这个模式大概含义是：待抽取的内容应该以 <think> 作为开始，直到 </think> ，然后是一些空格，然后是 <answer> ，直到 </answer> ，最后在那里结束。
从每个模型输出的补全内容(即常说的completion)中获取实际的文本内容。
然后使用 re.match 来查看每个内容是否完全匹配上述的模式。 re.DOTALL 在regex中的.也匹配换行符， re.MULTILINE 使^和$匹配整个字符串的开始/结束，而不仅仅是行。
最后，如果它完全匹配了格式，则给予1的奖励，如果没有，则给0。这是对格式正确性的严格奖励。

Reward中的推理

推理步骤奖励的设计则有点巧妙。我们希望鼓励模型展示其"思考过程"。因此，如果输出结果包含看起来像推理步骤的内容则对模型进行奖励。为此需要寻找通常出现在逐步推理中的关键词和模式，比如：

步骤1，步骤2等。
编号列表，如1、2
项目符号，如-或*
过渡词，如第一(First)、第二(Second)、最后(Finally)等这类词汇

回答中包含这些越多，奖励就越多。编写这个鼓励展示推理过程的函数：

  
def reasoning\_steps\_reward(completions, **kwargs):  
    r"""  
    Reward function to encourage clear step-by-step reasoning.  
    It looks for patterns like "Step 1:", numbered lists, bullet points,  
    and transition words.  
    """  
    # Regex pattern to find indicators of reasoning steps  
    pattern = r"(Step \d+:|^\d+\.|\n-|\n\*|First,|Second,|Next,|Finally,)"  
  
    # Extract completion contents  
    completion\_contents = [completion[0]["content"] for completion in completions]  
  
    # Count the number of reasoning step indicators in each completion  
    matches = [len(re.findall(pattern, content, re.MULTILINE))  
               for content in completion\_contents]  
  
    # Reward is proportional to the number of reasoning steps, maxing out at 1.0  
    # We're using a "magic number" 3 here - encourage at least 3 steps for full reward  
    return [min(1.0, count / 3) for count in matches]

在上面代码中创建了一个更复杂的正则表达式，从而找出上面列出的所有推理指示器。使用re.findall在每个内容中找到模式的所有匹配项。len(re.findall(…))给出这些指示器的计数(记为count)。

奖励计算为min(1.0, count / 3)。这意味着

如果找到3个或更多推理指示器（count >= 3），奖励为1.0（最大奖励）。
如果找到更少（例如，count = 1或2），它获得部分奖励（如1/3或2/3）。
如果一个也没找到（count = 0），奖励为0.0。

count为啥除以3？3在这里是一个大概的经验值。使用3是说"瞄准大约3个推理步骤来获得全额奖励"。如想鼓励更多或更少的步骤，可以调整这个数字。

余弦缩放Reward

余弦缩放奖励(Cosine Scaled Reward)是一种更加高级的奖励。它旨在鼓励简洁的正确答案，同时对较长的错误答案不那么苛刻。

对于正确答案：我们希望更多地奖励简短、直接的回答，而不是冗长、啰嗦的答案。简短、正确的答案通常更好。
对于错误答案：简短的错误答案可能比至少尝试推理的较长错误答案更糟糕。因此，我们希望对简短的错误答案的惩罚大于长篇错误答案。

以下代码实现这种巧妙的余弦缩放：

  
# Implement Cosine Scaled Reward Function  
def get\_cosine\_scaled\_reward(  
    min\_value\_wrong: float = -0.5,  
    max\_value\_wrong: float = -0.1,  
    min\_value\_correct: float = 0.8,  
    max\_value\_correct: float = 1.0,  
    max\_len: int = 1000,  
):  
    """  
    Returns a cosine scaled reward function. This function scales the accuracy reward  
    based on completion length. Shorter correct solutions get higher rewards,  
    longer incorrect solutions get less penalty.  
    """  
    def cosine\_scaled\_reward(completions, solution, accuracy\_rewards, **kwargs):  
        """  
        Cosine scaled reward function that adjusts accuracy rewards based on completion length.  
        """  
        contents = [completion[0]["content"] for completion in completions]  
        rewards = []  
  
        for content, sol, acc\_reward in zip(contents, solution, accuracy\_rewards):  
            gen\_len = len(content)  # Length of the generated answer  
            progress = gen\_len / max\_len # How far we are to max length  
            cosine = math.cos(progress * math.pi) # Cosine value based on progress  
  
            if acc\_reward > 0.5: # Assuming accuracy\_reward gives ~1.0 for correct answers  
                min\_value = min\_value\_correct  
                max\_value = max\_value\_correct  
            else: # Incorrect answer  
                min\_value = max\_value\_wrong  # Note the swap!  
                max\_value = min\_value\_wrong  
  
            # Cosine scaling formula!  
            reward = min\_value + 0.5 * (max\_value - min\_value) * (1.0 + cosine)  
            rewards.append(float(reward))  
        return rewards  
    return cosine\_scaled\_reward

get\_cosine\_scaled\_reward(...)生成用于训练的余弦缩放奖励函数，通过参数如min\_value\_wrong/max\_value\_wrong（错误答案的惩罚范围）和min\_value\_correct/max\_value\_correct（正确答案的奖励范围）来自定义缩放。max\_len设置缩放的最大长度。

内部的cosine\_scaled\_reward(...)函数则是根据completions、solution和accuracy\_rewards计算奖励。它计算gen\_len，将其标准化为progress = gen\_len / max\_len，并得出一个余弦值，从1（短答案）开始减少到-1（长答案）。

如果acc\_reward >0.5，它使用正确的奖励范围，否则它应用错误的范围，但交换最小/最大值以减少对较长错误答案的惩罚。

Reward中的重复惩罚系数

重复惩罚主要是为了阻止模型陷入循环并重复自己。我们希望它生成新鲜、多样的推理和答案，而不仅仅是复制粘贴相同的短语！

这个奖励函数会对模型使用相同词序列（n-gram）过多的情况进行惩罚。在以下的例子中，将使用大小为3的n-gram（三元组），当然这个值是可以调整的。如果模型大量重复自己，它会得到一个负奖励（惩罚）。如果它更加多样化并避免重复，惩罚就会减少。以下是实现惩罚重复的代码：

  
def get\_repetition\_penalty\_reward(ngram\_size: int = 3, max\_penalty: float = -0.1):  
    """  
    Returns a repetition penalty reward function. Penalizes repetitions of n-grams  
    in the generated text.  
    """  
    if max\_penalty > 0:  
        raise ValueError(f"max\_penalty {max\_penalty} should not be positive")  
  
    def zipngram(text: str, ngram\_size: int):  
        """Helper function to generate n-grams from text."""  
        words = text.lower().split() # Lowercase and split into words  
        return zip(*[words[i:] for i in range(ngram\_size)]) # Create n-grams  
  
    def repetition\_penalty\_reward(completions, **kwargs) -> float:  
        """  
        Repetition penalty reward function.  
        """  
        contents = [completion[0]["content"] for completion in completions]  
        rewards = []  
        for completion in contents:  
            if completion == "": # No penalty for empty completions  
                rewards.append(0.0)  
                continue  
            if len(completion.split()) < ngram\_size: # No penalty for short completions  
                rewards.append(0.0)  
                continue  
  
            ngrams = set() # Use a set to store unique n-grams  
            total = 0  
            for ng in zipngram(completion, ngram\_size): # Generate n-grams  
                ngrams.add(ng) # Add n-gram to the set (duplicates are ignored)  
                total += 1 # Count total n-grams  
  
            # Calculate scaling factor: more repetition -> higher scaling  
            scaling = 1 - len(ngrams) / total  
            reward = scaling * max\_penalty # Apply penalty based on scaling  
            rewards.append(reward)  
        return rewards  
    return get\_repetition\_penalty\_reward

上面的get\_repetition\_penalty\_reward(...)创建一个奖励函数来惩罚重复，参数包括ngram\_size（默认为3，用于三元组）和max\_penalty（一个负值，例如-0.1）。辅助函数zipngram(text, ngram\_size)通过将文本转换为小写、将其拆分为单词，并使用zip(*[words[i:] for i in range(ngram\_size)])高效提取来生成n-gram。

在repetition\_penalty\_reward(...)内部，计算每个完成(即一次completion)的惩罚。如果它为空或太短，它获得0.0的奖励。惩罚按scaling = 1 - len(ngrams) / total进行缩放，其中total是n-gram的数量，len(ngrams)是去重后的n-gram计数。更多重复使scaling接近1，增加惩罚。

最终奖励是scaling *max\_penalty，这意味着较少的重复会导致较小的惩罚，而高重复会导致更强的负奖励。

至此，以上已经实现了所有5个奖励函数，从下一章节开始将进入下一阶段：定义训练参数。

R1-Zero

R1-Zero的训练配置

现在需要编写一个配置，以实现实际微调的时候上面定义的奖励函数可以正常工作。因此，定义一个配置类：

  
# Define GRPOScriptArguments for reward function parameters  
@dataclass  
class GRPOScriptArguments:  
    """  
    Script arguments for GRPO training, specifically related to reward functions.  
    """  
  
    reward\_funcs: list[str] = field(  
        default\_factory=lambda: ["accuracy", "format"],  
        metadata={  
            "help": "List of reward functions. Possible values: 'accuracy', 'format', 'reasoning\_steps', 'cosine', 'repetition\_penalty'"  
        },  
    )  
    cosine\_min\_value\_wrong: float = field(  
        default=-0.5,  
        metadata={"help": "Minimum reward for cosine scaling for wrong answers"},  
    )  
    cosine\_max\_value\_wrong: float = field(  
        default=-0.1,  
        metadata={"help": "Maximum reward for cosine scaling for wrong answers"},  
    )  
    cosine\_min\_value\_correct: float = field(  
        default=0.8,  
        metadata={"help": "Minimum reward for cosine scaling for correct answers"},  
    )  
    cosine\_max\_value\_correct: float = field(  
        default=1.0,  
        metadata={"help": "Maximum reward for cosine scaling for correct answers"},  
    )  
    cosine\_max\_len: int = field(  
        default=1000,  
        metadata={"help": "Maximum length for cosine scaling"},  
    )  
  
    repetition\_n\_grams: int = field(  
        default=3,  
        metadata={"help": "Number of n-grams for repetition penalty reward"},  
    )  
    repetition\_max\_penalty: float = field(  
        default=-0.1,  
        metadata={"help": "Maximum (negative) penalty for for repetition penalty reward"},  
    )

GRPOScriptArguments类保存奖励设置。reward\_funcs列表决定使用哪些奖励，默认添加["accuracy", "format"]，当然也可以添加更多如"reasoning\_steps"、"cosine"、"repetition\_penalty"。其他一些设置项可以控制cosine\_scaled\_reward和repetition\_penalty\_reward的工作方式，从而调整如何给予奖励。

接下来，使用transformers库中的TrainingArguments。这是控制训练过程的配置对象。

  
# Define TrainingArguments from transformers  
training\_args = TrainingArguments(  
    output\_dir=OUTPUT\_DIR,          # Output directory for checkpoints and logs  
    overwrite\_output\_dir=True,  
    num\_train\_epochs=1,             # Total number of training epochs  
    per\_device\_train\_batch\_size=8,  # Batch size per device during training  
    per\_device\_eval\_batch\_size=16,   # Batch size for evaluation  
    gradient\_accumulation\_steps=2,  # Accumulate gradients to simulate larger batch size  
    learning\_rate=5e-5,            # Initial learning rate for AdamW optimizer  
    warmup\_ratio=0.1,              # Linear warmup over warmup\_ratio fraction of training steps  
    weight\_decay=0.01,             # Apply weight decay to all layers except bias and LayerNorm weights  
    logging\_steps=10,              # Log every X updates steps  
    evaluation\_strategy="steps",    # Evaluate every `eval\_steps`  
    eval\_steps=50,                 # Evaluation and logging steps  
    save\_strategy="steps",         # Save checkpoint every `save\_steps`  
    save\_steps=50,                 # Save checkpoint every X updates steps  
    save\_total\_limit=2,            # Limit the total amount of checkpoints. Deletes the older checkpoints.  
    dataloader\_num\_workers=2,      # Number of subprocesses to use for data loading  
    seed=42,                       # Random seed for reproducibility  
    bf16=True,                     # Use mixed precision BFP16 training  
    push\_to\_hub=False,             # Whether to push the final model to Hugging Face Hub  
    gradient\_checkpointing=True,   # Enable gradient checkpointing  
    report\_to="none",              # Reporting to no one  
)

最后，需要有一个ModelConfig。这是配置模型本身设置的地方，比如使用哪个预训练模型，使用什么数据类型（如bfloat16）等。如下定义ModelConfig：

  
@dataclass  
class ModelConfig:  
    """  
    Configuration for the model.  
    """  
    model\_name\_or\_path: str = field(  
        default=MODEL\_NAME, metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}  
    )  
    model\_revision: Optional[str] = field(  
        default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}  
    )  
    torch\_dtype: Optional[str] = field(  
        default="bfloat16", metadata={"help": "Override the default `torch\_dtype` and load the model under this dtype."}  
    )  
    trust\_remote\_code: bool = field(  
        default=True, metadata={"help": "Trust remote code when loading model and tokenizer."}  
    )  
    attn\_implementation: Optional[str] = field(  
        default="flash\_attention\_2", metadata={"help": "Attention implementation to use. 'flash\_attention\_2' or None"}  
    )

上述的ModelConfig类保存关键设置，包括model\_name\_or\_path。同时使用torch\_dtype="bfloat16"以提高效率，并设置trust\_remote\_code=True以安全地远程加载。此外，如果支持flash\_attention\_2，还可以启用attn\_implementation="flash\_attention\_2"以实现潜在的更快训练。

现在需要实际创建这些配置类的实例，以便后续使用：

  
# Instantiate configuration objects  
script\_args = GRPOScriptArguments()  
model\_args = ModelConfig()

接下来，需要获取奖励函数列表 和我们希望在训练过程中使用的任何"回调"。回调就像小助手，可以在训练过程的不同点做事情（比如记录进度、保存模型等）。现在，我们将只使用一个简单的日志记录回调。将我们的奖励函数集中在一起。

  
# Utility function to get reward functions based on script arguments  
def get\_reward\_functions(script\_args):  
    """  
    Returns a list of reward functions based on the script arguments.  
    """  
    reward\_funcs\_list = []  
    reward\_funcs\_registry = {  
        "accuracy": accuracy\_reward,  # Assuming accuracy\_reward is defined in previous steps  
        "format": format\_reward,      # Assuming format\_reward is defined in previous steps  
        "reasoning\_steps": reasoning\_steps\_reward, # Assuming reasoning\_steps\_reward is defined  
        "cosine": get\_cosine\_scaled\_reward( # Assuming get\_cosine\_scaled\_reward is defined  
            min\_value\_wrong=script\_args.cosine\_min\_value\_wrong,  
            max\_value\_wrong=script\_args.cosine\_max\_value\_wrong,  
            min\_value\_correct=script\_args.cosine\_min\_value\_correct,  
            max\_value\_correct=script\_args.cosine\_max\_value\_correct,  
            max\_len=script\_args.cosine\_max\_len,  
        ),  
        "repetition\_penalty": get\_repetition\_penalty\_reward( # Assuming get\_repetition\_penalty\_reward is defined  
            ngram\_size=script\_args.repetition\_n\_grams,  
            max\_penalty=script\_args.repetition\_max\_penalty,  
        ),  
    }  
  
    for func\_name in script\_args.reward\_funcs:  
        if func\_name not in reward\_funcs\_registry:  
            raise ValueError(f"Reward function '{func\_name}' not found in registry.")  
        reward\_funcs\_list.append(reward\_funcs\_registry[func\_name])  
  
    return reward\_funcs\_list

我们的回调函数将跟踪损失和其他重要信息。

  
logger = logging.getLogger(\_\_name\_\_)  
  
class LoggingCallback(TrainerCallback):  
    """  
    A simple callback for logging training information at specific steps.  
    """  
    def on\_step\_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):  
        if state.global\_step % args.logging\_steps == 0:  
            logger.info(f"Step {state.global\_step}: Loss = {state.log\_history[-1].get('loss', None)}, Learning Rate = {state.log\_history[-1].get('learning\_rate', None)}")  
  
def get\_callbacks(training\_args, model\_args, script\_args):  
    """  
    Returns a list of callbacks to be used during training.  
    For now, it includes only the LoggingCallback. You can extend this to add more callbacks.  
    """  
    callbacks = [LoggingCallback()] # Instantiate our LoggingCallback  
    return callbacks

最后，初始化这些函数。

  
# Get reward functions and callbacks  
reward\_functions = get\_reward\_functions(script\_args)  
callbacks = get\_callbacks(training\_args, model\_args, script\_args)

GRPO训练

至此，GRPO的训练已经万事俱备，只需要初始化GRPO训练的引擎，并为其提供我们已经准备的所有组件：模型、奖励函数、训练参数、数据集和回调函数！初始化GRPO Trainer：

  
# Create GRPOConfig from TrainingArguments  
grpo\_config = GRPOConfig(  
    **training\_args.to\_dict(), # Convert TrainingArguments to dictionary and unpack  
    **{   
       # REMOVED model\_init\_kwargs here   
       # We are passing the instantiated 'model' object, so GRPOTrainer doesn't need model\_init\_kwargs  
    }  
)  
  
grpo\_trainer = GRPOTrainer(  
    model=model,                      # Our initialized Qwen model  
    reward\_funcs=reward\_functions,    # List of reward functions from previous step  
    args=grpo\_config,                # GRPOConfig (created from TrainingArguments)  
    train\_dataset=dataset['train'],   # Training dataset  
    eval\_dataset=dataset['test'],    # Evaluation dataset  
    callbacks=callbacks              # List of callbacks  
)

现在可以开始训练了！只需要grpo\_trainer上调用train()方法即可。

  
# Start the GRPO Training Loop  
train\_result = grpo\_trainer.train()

当运行上述命令之后，应该看到训练过程开始。这里为了演示，num\_train\_epochs = 1，但是对于实际的GRPO DeepSeek R1 Zero训练，可能会训练更多的轮次和步骤。最终训练完成的日志信息如下：

  
{'eval\_loss': 0.09589701145887375, 'eval\_runtime': 237.893, 'eval\_samples\_per\_second': 0.416, 'eval\_steps\_per\_second': 0.029, 'eval\_rewards/accuracy\_reward': 0.2175, 'eval\_rewards/format\_reward': 0.96375, 'eval\_reward': 1.18125, 'eval\_reward\_std': 0.24513342082500458, 'eval\_completion\_length': 40.237916717529295, 'eval\_kl': 2.3843612051010132, 'epoch': 1.0}  
{'train\_runtime': 195268.8548, 'train\_samples\_per\_second': 0.371, 'train\_steps\_per\_second': 0.046, 'train\_loss': 0.08140588973737459, 'rewards/accuracy\_reward': 0.153125, 'rewards/format\_reward': 0.9625, 'reward': 1.115625, 'reward\_std': 0.22041844427585602, 'completion\_length': 42.109375, 'kl': 2.380528378486633, 'epoch': 1.0}  
GRPO Training Success

此时结果模型存于：/your\_dir/model\_results/Qwen/Qwen2.5-0.5B-Instruct/GRPO-training此时的文件清单如下：

  
-rw-r--r-- 1 root root  758 Mar 26 16:00 config.json  
-rw-r--r-- 1 root root  242 Mar 26 16:00 generation\_config.json  
-rw-r--r-- 1 root root 943M Mar 26 16:00 model.safetensors  
-rw-r--r-- 1 root root 7.2K Mar 26 16:00 tokenizer\_config.json  
-rw-r--r-- 1 root root  613 Mar 26 16:00 special\_tokens\_map.json  
-rw-r--r-- 1 root root  605 Mar 26 16:00 added\_tokens.json  
-rw-r--r-- 1 root root 2.7M Mar 26 16:00 vocab.json  
-rw-r--r-- 1 root root 1.6M Mar 26 16:00 merges.txt  
-rw-r--r-- 1 root root  11M Mar 26 16:00 tokenizer.json  
-rw-r--r-- 1 root root 5.8K Mar 26 16:00 training\_args.bin  
-rw-r--r-- 1 root root 1.9G Mar 26 16:00 optimizer.pt  
-rw-r--r-- 1 root root 1.1K Mar 26 16:00 scheduler.pt  
-rw-r--r-- 1 root root  14K Mar 26 16:00 rng\_state.pth  
-rw-r--r-- 1 root root 423K Mar 26 16:00 trainer\_state.json

保存Tiny R1 Zero LLM

一旦训练完成，可以保存我们的训练模型，用于推理。

  
# Define the path to your trained model (same as OUTPUT\_DIR)  
TRAINED\_MODEL\_PATH = "data/Qwen-GRPO-training"  
  
# Save the tokenizer  
tokenizer.save\_pretrained(TRAINED\_MODEL\_PATH)  
  
# Save the trained model  
grpo\_trainer.save\_model(TRAINED\_MODEL\_PATH)  
  
print(f"GRPO Trained model saved to {TRAINED\_MODEL\_PATH}")

然后可以简单地使用以下方式加载训练好的模型：

  
# Load the tokenizer - make sure to use trust\_remote\_code=True if needed  
tokenizer = AutoTokenizer.from\_pretrained(  
    TRAINED\_MODEL\_PATH,  
    trust\_remote\_code=True, # If your model config requires it  
    padding\_side="right"# Ensure consistent padding side  
)  
  
# Set pad token if it wasn't saved or loaded correctly  
if tokenizer.pad\_token is None:  
    tokenizer.pad\_token = tokenizer.eos\_token  
  
# Load the trained model itself  
trained\_model = AutoModelForCausalLM.from\_pretrained(  
    TRAINED\_MODEL\_PATH,  
    trust\_remote\_code=True, # If your model architecture requires it  
    torch\_dtype=torch.bfloat16 # Keep the same dtype as training for consistency  
)  
  
# Move the loaded model to your device (GPU if available)  
trained\_model.to(device) # 'device' is still our CUDA device from before

为了用于推理：

  
# Testing Inference with the Trained Model  
def test\_trained\_model\_inference(user\_input: str):  
    """Test inference with the loaded trained model and tokenizer."""  
    messages = [  
        {"role": "system", "content": SYSTEM\_PROMPT}, # Re-use our system prompt  
        {"role": "user", "content": user\_input}  
    ]  
  
    # Apply chat template using our tokenizer  
    text = tokenizer.apply\_chat\_template(  
        messages,  
        tokenize=False,  
        add\_generation\_prompt=True  
    )  
  
    # Tokenize the input text  
    inputs = tokenizer(text, return\_tensors="pt").to(device)  
  
    # Generate output using our *trained\_model*  
    outputs = trained\_model.generate(  
        **inputs,  
        max\_new\_tokens=200, # Maybe generate a bit longer now  
        do\_sample=True,  
        temperature=0.7  
    )  
  
    # Decode the generated tokens back to text  
    response = tokenizer.decode(outputs[0], skip\_special\_tokens=True)  
    return response

输出结果如下：

  
Test Input: A是C的爸爸，B是C的妈妈，那么A和B是什么关系？  
Trained Model Response: system  
A conversation between User and Assistant. The user asks a question,  
      and the Assistant solves it. The assistant  
      first thinks about the reasoning process in the mind and  
      then provides the user with the answer. The reasoning  
      process and answer are enclosed within <think> </think>  
      and <answer> </answer> tags, respectively, i.e.,  
      <think> reasoning process here </think><answer> answer here </answer>  
  
user  
A是C的爸爸，B是C的妈妈，那么A和B是什么关系？  
assistant  
<think> A 和 B 是 C 的父女关系。因为题干中提到 "A 是 C 的爸爸" 和 "B 是 C 的妈妈"，所以根据这些信息，我们可以推断出 A 和 B 是 C 的父子关系。</thought>  
  
<answer> 父母关系</answer>

R1 Zero中的2个主要问题

现在我们已经使用我们的基础模型Qwen2-0.5B（而不是他们的DeepSeek V3原始基础模型）完成了R1 zero训练方法。

DeepSeek的研究人员发现R1 Zero模型在推理测试中表现非常好，甚至在像AIME 2024这样的任务上得分与更高级的模型（如OpenAI-01-0912）相似。

这表明使用强化学习（RL）鼓励语言模型进行推理是一种有前景的方法。

但他们也注意到DeepSeek-R1-Zero有一些关键问题需要解决，以便在实际应用和更广泛的研究中使用。

DeepSeek的研究人员表示，模板故意设计得简单且注重结构，从而避免对推理过程本身施加任何特定的内容约束。例如，它不会说：

"你必须使用逐步推理"（它只说"推理过程"，让模型自己定义这意味着什么）。
"你必须使用反思性推理"
"你必须使用特定的问题解决策略"

主要问题是<think>标签内的推理过程难以阅读，使人类难以跟踪和分析。

另一个问题是语言混合，当被问到多语言问题时，模型有时会在同一回答中混合语言，导致输出不一致和混乱。

如果你用西班牙语提问，突然间，它的"思考"会变成英语和西班牙语的混合体！这些问题，推理混乱和语言混淆，显然是需要克服的。

这也是官方将初始R1 Zero模型进一步升级为R1的两个主要原因。