推理模型实战| 如何训练自己的R1模型(上篇)：GRPO前奏预微调SFT - 文章 - 开发者社区

引言

鉴于推理模型已在各个领域取得了卓越成就，我们将在现有系列文章的基础上，继续深入探讨该领域的前沿模型与技术。

往期推理模型相关文章回顾：

推理模型专题｜一文纵览DeepSeek模型家族：从LLM到R1

推理模型专题｜深度揭秘DeepSeek R1 背后的强化学习

推理模型专题｜DeepSeek-R1如何用强化学习、冷启动和蒸馏，开启大模型训练新思路？

推理模型专题 | DeepSeek-R1比肩OpenAI满血版o1(技术报告解读)

推理模型专题｜LLM推理中的强化学习及其实战：以GRPO为例

今天将从实战角度 介绍如何基于Unsloth训练私有、定制化的推理模型R1。具体到模型训练目标： 利用OpenR1的Math数据集，通过GRPO将Qwen3-4B-Base训练成推理模型 。

考虑到篇幅较长，将其分为上下两篇。本文(上篇) 将聚焦于如何通过监督微调(SFT)技术，训练出一个能严格遵循自定义GRPO格式的基座模型。这一步至关重要，它将为后续训练提供一个高质量的初始化模型。在下篇中，我们将以此模型为起点，深入探讨如何应用GRPO算法进行进一步优化，最终打造出我们自己的目标R1模型。

本文(上篇)完整代码 可以前往微信公众号"小窗幽记机器学习"，回复"推理模型实战1-上篇"获取。如想进一步与小编进一步交流也可以在微信公众号"小窗幽记机器学习"上添加小编微信号。

更多大模型实战相关欢迎关注微信公众号"小窗幽记机器学习"：

Unsloth简介

Unsloth项目地址：https://github.com/unslothai/unsloth

Unsloth 是一个专门为大型语言模型(LLM) 微调(fine-tuning) 设计的开源加速框架。它旨在通过优化训练过程 ，显著提高微调速度并降低内存消耗 ，同时保持模型精度的不变。Unsloth 支持多种流行的LLM，如Llama、Mistral、Gemma 等，并与Hugging Face Transformers 库无缝集成。

Unsloth 的主要特点和优势包括：

高效的微调性能

Unsloth 通过优化内核和手动反向传播引擎，实现了 2 倍的训练速度提升。

降低内存消耗:

在内存使用方面，Unsloth 比传统的微调方法减少了 70% - 80% 的内存占用。这意味着你可以在相同的硬件资源下，处理更大的数据集和更复杂的模型。

保持模型精度:

Unsloth 引入了动态 4 位量化技术，通过动态选择不量化某些参数，大大提高了模型的准确性，同时只比 BnB 4 位量化多使用了不到 10% 的显存（VRAM）。在加速和节省内存的同时，保证了微调后模型的精度，不会对模型性能产生负面影响。

支持多种模型:

Unsloth 支持多种流行的 LLM 模型，包括 Llama 3.3、Mistral、Phi-4、Qwen 2.5 和 Gemma 等。无论你是想微调一个 70B 参数的 Llama 3.3 模型，还是一个 9B 参数的 Gemma 模型，Unsloth 都能满足你的需求。

picture.image

选择基座模型

picture.image

先解释一下base模型以及Instruct模型的区别：

Base 模型

定义：Base 模型是未经专门任务微调的基础模型，它通常是在大规模数据上进行自监督学习后得到的。Base 模型通过学习语言中的统计模式来理解语言结构，但并没有针对特定任务进行优化。
训练目标：Base 模型的目标是预测下一个 token（即词或字符片段），它通过广泛的文本数据来学习语言的通用特征，但并不具备特定任务的指令执行能力。

Instruct 模型

定义：Instruct 模型是在 Base 模型的基础上，通过 指令微调（Instruction Tuning） 得来的版本。这类模型专门被设计成能够按照用户的指令执行任务，例如生成、回答问题、翻译等。
训练目标：Instruct 模型不仅学会了语言模式，还被训练去理解并按照用户输入的明确指令执行相应的任务。这通过监督学习来实现，模型接受大量人类指令及其对应的输出进行微调，使得它能更好地处理明确的任务请求。
特点：任务导向：模型不仅理解语言，还能理解任务需求，并生成相关的输出。指令响应能力强：Instruct 模型能够按照用户的请求完成诸如生成文本、回答问题等任务，表现出比 Base 模型更好的指令执行能力。用户友好：相比 Base 模型，它的设计更倾向于真实的应用场景，通常不需要进一步微调即可直接用于任务执行。

格式预微调设置

首先对模型进行预微调以加速并稳定后续GRPO训练过程。

  
from unsloth import FastLanguageModel  
import torch  
max\_seq\_length = 2048 # Can increase for longer reasoning traces  
lora\_rank = 32 # Larger rank = smarter, but slower  
  
# 载入Qwen3-Base模型  
model, tokenizer = FastLanguageModel.from\_pretrained(  
    model\_name = "unsloth/Qwen3-4B-Base",  
    max\_seq\_length = max\_seq\_length,  
    load\_in\_4bit = False, # False for LoRA 16bit  
    fast\_inference = True, # Enable vLLM fast inference  
    max\_lora\_rank = lora\_rank,  
    gpu\_memory\_utilization = 0.7, # Reduce if out of memory  
)  
  
model = FastLanguageModel.get\_peft\_model(  
    model,  
    r = lora\_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128  
    target\_modules = [  
        "q\_proj", "k\_proj", "v\_proj", "o\_proj",  
        "gate\_proj", "up\_proj", "down\_proj",  
    ],  
    lora\_alpha = lora\_rank*2, # *2 speeds up training  
    use\_gradient\_checkpointing = "unsloth", # Reduces memory usage  
    random\_state = 3407,  
)

Deepseek选用<think><\think>用于构建推理数据，这个标签并非不可变动。此处用<start\_working\_out><end\_working\_out>来做替代，你也可以自定义自己想要的匹配HTML格式文本。最后，建议将该Prompt写成system\_prompt。

  
# 设定固定数据匹配格式  
reasoning\_start = "<start\_working\_out>" # Acts as <think>  
reasoning\_end   = "<end\_working\_out>"   # Acts as </think>  
solution\_start  = "<SOLUTION>"  
solution\_end    = "</SOLUTION>"  
  
system\_prompt = \  
f"""You are given a problem.  
Think about the problem and provide your working out.  
Place it between {reasoning\_start} and {reasoning\_end}.  
Then, provide your solution between {solution\_start}{solution\_end}"""  
  
"""  
You are given a problem.  
Think about the problem and provide your working out.  
Place it between <start\_working\_out> and <end\_working\_out>.  
Then, provide your solution between <SOLUTION></SOLUTION>  
"""

我们在下面创建一个简单的聊天模板。注意，add\_generation\_prompt包括前缀<start\_working\_out>，以指导模型开始其推理过程。

  
# Chat范式  
chat\_template = \  
    "{% if messages[0]['role'] == 'system' %}"\  
        "{{ messages[0]['content'] + eos\_token }}"\  
        "{% set loop\_messages = messages[1:] %}"\  
    "{% else %}"\  
        "{{ '{system\_prompt}' + eos\_token }}"\  
        "{% set loop\_messages = messages %}"\  
    "{% endif %}"\  
    "{% for message in loop\_messages %}"\  
        "{% if message['role'] == 'user' %}"\  
            "{{ message['content'] }}"\  
        "{% elif message['role'] == 'assistant' %}"\  
            "{{ message['content'] + eos\_token }}"\  
        "{% endif %}"\  
    "{% endfor %}"\  
    "{% if add\_generation\_prompt %}{{ '{reasoning\_start}' }}"\  
    "{% endif %}"  
  
# Replace with out specific template:  
chat\_template = chat\_template\  
    .replace("'{system\_prompt}'",   f"'{system\_prompt}'")\  
    .replace("'{reasoning\_start}'", f"'{reasoning\_start}'")  
tokenizer.chat\_template = chat\_template

让我们来看看我们通过以上范式生成的例子。

  
tokenizer.apply\_chat\_template([  
    {"role" : "user", "content" : "What is 1+1?"},  
    {"role" : "assistant", "content" : f"{reasoning\_start}I think it's 2.{reasoning\_end}{solution\_start}2{solution\_end}"},  
    {"role" : "user", "content" : "What is 2+2?"},  
], tokenize = False, add\_generation\_prompt = True)  
  
"""  
You are given a problem.  
Think about the problem and provide your working out.  
Place it between <start\_working\_out> and <end\_working\_out>.  
Then, provide your solution between <SOLUTION></SOLUTION><|endoftext|>What is 1+1?<start\_working\_out>I think it's 2.<end\_working\_out><SOLUTION>2</SOLUTION><|endoftext|>What is 2+2?<start\_working\_out>  
"""

训练数据预处理

现在使用NVIDIA的开放数学推理数据集(nvidia/OpenMathReasoning)的一个子集(unsloth/OpenMathReasoning-mini)，该数据集经过过滤，只包括高质量的DeepSeek R1推理路径数据。该数据概览如下： picture.image

这里根据规则过滤出约59个示例，以“初始化”预微调模型，从而辅助理解自定义的GRPO格式 。构造数据范式的代码如下所示：

  
from datasets import load\_dataset  
import pandas as pd  
import numpy as np  
  
dataset = load\_dataset("unsloth/OpenMathReasoning-mini", split = "cot")  
dataset = dataset.to\_pandas()[  
    ["expected\_answer", "problem", "generated\_solution"]  
]  
  
# Try converting to number - if not, replace with NaN  
is\_number = pd.to\_numeric(pd.Series(dataset["expected\_answer"]), errors = "coerce").notnull()  
# Select only numbers  
dataset = dataset.iloc[np.where(is\_number)[0]]  
  
# 将数据构造成GRPO的训练格式  
def format\_dataset(x):  
    expected\_answer = x["expected\_answer"]  
    problem = x["problem"]  
  
    # Remove generated <think> and </think>  
    thoughts = x["generated\_solution"]  
    thoughts = thoughts.replace("<think>", "").replace("</think>", "")  
  
    # Strip newlines on left and right  
    thoughts = thoughts.strip()  
    # Add our custom formatting  
    final\_prompt = \  
        reasoning\_start + thoughts + reasoning\_end + \  
        solution\_start + expected\_answer + solution\_end  
    return [  
        {"role" : "system",    "content" : system\_prompt},  
        {"role" : "user",      "content" : problem},  
        {"role" : "assistant", "content" : final\_prompt},  
    ]  
  
dataset["Messages"] = dataset.apply(format\_dataset, axis = 1)  
tokenizer.apply\_chat\_template(dataset["Messages"][0], tokenize = False)  
  
dataset["N"] = dataset["Messages"].apply(lambda x: len(tokenizer.apply\_chat\_template(x)))  
# 此处将预微调数据集截断为max\_seq\_length/2，因为我们不想要太长的推理“步长”。  
dataset = dataset.loc[dataset["N"] <= max\_seq\_length/2].copy()

到这里，我们就已经筛选出后续微调的数据集dataset了。

随后，我们需将该数据转化为huggingface兼容的数据格式。

  
from datasets import Dataset  
  
dataset["text"] = tokenizer.apply\_chat\_template(dataset["Messages"].values.tolist(), tokenize = False)  
dataset = Dataset.from\_pandas(dataset)

预微调(SFT)

现在可以开始实践模型以遵循我们设置的GRPO格式。

  
from unsloth import is\_bfloat16\_supported  
  
from trl import SFTTrainer, SFTConfig  
trainer = SFTTrainer(  
    model = model,  
    tokenizer = tokenizer,  
    train\_dataset = dataset,  
    args = SFTConfig(  
        dataset\_text\_field = "text",  
        per\_device\_train\_batch\_size = 1,  
        bf16 = is\_bfloat16\_supported(),  
        fp16 = not is\_bfloat16\_supported(),  
        gradient\_accumulation\_steps = 1, # 用梯度累积来模拟更改batch size  
        warmup\_steps = 5,  
        num\_train\_epochs = 2, # 设置训练的epoch数量  
        learning\_rate = 2e-4, # 可以将学习率降至2e-5以加快训练速度  
        logging\_steps = 5,  
        optim = "adamw\_8bit",  
        weight\_decay = 0.01,  
        lr\_scheduler\_type = "linear",  
        seed = 3407,  
        report\_to = "none", # 可以接入如“wandb”一类的本地云，设置为“none”即不接入  
    ),  
)  
  
trainer.train()

下图是我们在训练的参数内设定每5个steps打出训练的log，可以看到Traingloss在稳定下降直至最后实现动态平衡。

picture.image

验证模型效果

在训练完模型后，我们需验证一下SFT后的模型能否遵循我们给定的输出格式。

  
text = tokenizer.apply\_chat\_template(  
    dataset[0]["Messages"][:2],  
    tokenize = False,  
    add\_generation\_prompt = True, # 生成结果是否携带Prompt  
)  
  
from transformers import TextStreamer  
\_ = model.generate(  
    **tokenizer(text, return\_tensors = "pt").to("cuda"),  
    temperature = 0,  
    max\_new\_tokens = 1024,  
    streamer = TextStreamer(tokenizer, skip\_prompt = False),  
)

经过上述步骤，模型的输出如下，可以看到模型确实遵循了我们给定的思考格式，Generation Prompt出现了<start\_working\_out>、<end\_working\_out>等的标签内容，同时Qwen3-4b-base也出现了"推理"模式：

  
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS\_VERBOSITY=info` for more details.  
You are given a problem.  
Think about the problem and provide your working out.  
Place it between <start\_working\_out> and <end\_working\_out>.  
Then, provide your solution between <SOLUTION></SOLUTION><|endoftext|>Jenifer has 82 cents in pennies and nickels. Her younger brother mistook all her nickels for dimes and counted the total as $1.47. How many pennies does Jenifer have?<start\_working\_out>Okay, let's see. Jenifer has 82 cents in pennies and nickels. Her brother thought all the nickels were dimes and counted the total as $1.47. I need to find out how many pennies she has. Hmm, let's break this down.  
  
First, I need to set up some equations. Let's say the number of pennies is P and the number of nickels is N. Since pennies are worth 1 cent each and nickels are 5 cents each, the total value Jenifer has is:  
  
P + 5N = 82 cents.  
  
Now, her brother thought all the nickels were dimes. Dimes are 10 cents each. So, he counted the total as $1.47, which is 147 cents. So, the equation based on his mistaken count would be:  
  
P + 10N = 147 cents.  
  
Now I have two equations:  
  
1) P + 5N = 82  
2) P + 10N = 147  
  
I need to solve these two equations to find P. Let me subtract the first equation from the second to eliminate P. So, subtracting equation 1 from equation 2:  
  
(P + 10N) - (P + 5N) = 147 - 82  
  
That simplifies to:  
  
5N = 65  
  
Then, dividing both sides by 5:  
  
N = 13  
  
So, there are 13 nickels. Now, plug N back into the first equation to find P:  
  
P + 5(13) = 82  
  
5*13 is 65, so:  
  
P + 65 = 82  
  
Subtract 65 from both sides:  
  
P = 17  
  
So, Jenifer has 17 pennies. Let me check that. If she has 17 pennies and 13 nickels, that's 17 + 65 = 82 cents, which matches. Her brother thought the nickels were dimes, so 13 dimes would be 130 cents. Adding the 17 pennies gives 147 cents, which is $1.47. That checks out. So, the answer should be 17 pennies.  
To solve the problem, we start by defining the variables:  
- Let \( P \) be the number of pennies.  
- Let \( N \) be the number of nickels.  
  
We know the following:  
1. The total value of the pennies and nickels is 82 cents.  
2. The brother mistakenly counted the nickels as dimes and the total as $1.47 (147 cents).  
  
We can set up the following system of equations based on the given information:  
1. \( P + 5N = 82 \) (total value in cents)  
2. \( P + 10N = 147 \) (mistaken total value in cents)  
  
To find the number of pennies \( P \), we subtract the first equation from the second:  
\[  
(P + 10N) - (P + 5N) = 147 - 82  
\]  
This simplifies to:  
\[  
5N = 65  
\]  
Solving for \( N \):  
\[  
N = \frac{65}{5} = 13  
\]  
  
Now that we know \( N = 13 \), we substitute this value back into the first equation to find \( P \):  
\[  
P + 5(13) = 82  
\]  
\[  
P + 65 = 82  
\]  
\[  
P = 82 - 65 = 17  
\]  
  
Thus, Jenifer has \(\boxed{17}\) pennies.<end\_working\_out><SOLUTION>17</SOLUTION><|endoftext|>

可以看出，整个思考过程被<start\_working\_out>和<end\_working\_out>包裹住。最终的回答结果被<SOLUTION>和</SOLUTION>所包裹住。

总结

本文从实践角度介绍如何基于 Unsloth 框架 构建属于自己的推理模型，涵盖了从选择基座模型、预微调策略、自定义推理格式，到训练数据构造与最终模型验证的完整流程。演示了如何将 Qwen3-4B-Base 微调为具备 R1 式思考能力的模型，让模型以显式的 <start\_working\_out> 和 <SOLUTION> 等标签结构化输出推理内容。

通过 Unsloth 提供的高效微调机制，我们不仅降低了资源消耗，还实现了对复杂思维链的控制性生成。