open-r1，代码解析~ - 文章 - 开发者社区

仔细瞅瞅抱抱脸针对R1的开源复现代码。

背景

DeepSeek R1训练分为2个阶段。通过课程学习持续优化，第二阶段的数据部分来源于第一阶段。

第一阶段纯RL, 第二阶段 SFT + RL picture.image

R1训练完成之后，通过蒸馏到小模型，可以让小模型获得非常好的推理性能，同时是优于使用小模型直接进行强化学习的。 picture.image

OpenR1

蒸馏复刻，使用 R1 构造推理思维链数据，使用小模型SFT

picture.image

数据集来源包括中国高中数学练习、美国和国际数学奥林匹克竞赛问题。数据主要来自在线考试试卷 PDF 和数学讨论论坛。处理步骤包括（a）从原始 PDF 中进行 OCR，（b）分割成问题-解决方案对，（c）翻译成英语，（d）重新排列以产生 CoT 推理格式，以及（e）最终答案格式化。

构造思维链数据

picture.image

SFT训练

picture.image https://github.com/huggingface/transformers/blob/main/src/transformers/trainer\_pt\_utils.py#L554

transformers 默认针对生成式模型用到的是标签平滑的损失。

标签平滑，将正确标签的设为1-epsilon，而其他标签平均分配epsilon的部分。

picture.image

GRPO

picture.image

Reward:

准确率


        
          
def accuracy_reward(completions, solution, **kwargs):  
    """Reward function that checks if the completion is the same as the ground truth."""

大格式


        
          
def format_reward(completions, **kwargs):  
    """Reward function that checks if the completion has a specific format."""  
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"

逐步格式


        
          
def reasoning_steps_reward(completions, **kwargs):  
    r"""Reward function that checks for clear step-by-step reasoning.  
    Regex pattern:  
        Step \d+: - matches "Step 1:", "Step 2:", etc.  
        ^\d+\. - matches numbered lists like "1.", "2.", etc. at start of line  
        \n- - matches bullet points with hyphens  
        \n\* - matches bullet points with asterisks  
        First,|Second,|Next,|Finally, - matches transition words  
    """  
    pattern = r"(Step \d+:|^\d+\.|\n-|\n\*|First,|Second,|Next,|Finally,)"

长度


        
          
    def cosine_scaled_reward(completions, solution, **kwargs):  
        """Reward function that scales based on completion length using a cosine schedule.  
  
        Shorter correct solutions are rewarded more than longer ones.  
        Longer incorrect solutions are penalized less than shorter ones.

重复的ngram惩罚


        
          
def get_repetition_penalty_reward(ngram_size: int, max_penalty: float):  
    """  
    Computes N-gram repetition penalty as described in Appendix C.2 of https://arxiv.org/abs/2502.03373.  
    Reference implementation from: https://github.com/eddycmu/demystify-long-cot/blob/release/openrlhf/openrlhf/reward/repetition.py

picture.image