- 引言
鉴于推理模型已在各个领域取得了卓越成就,我们将在现有系列文章的基础上,继续深入探讨该领域的前沿模型与技术。
往期推理模型相关文章回顾:
推理模型专题|一文纵览DeepSeek模型家族:从LLM到R1
推理模型专题|深度揭秘DeepSeek R1 背后的强化学习
推理模型专题|DeepSeek-R1如何用强化学习、冷启动和蒸馏,开启大模型训练新思路?
推理模型专题 | DeepSeek-R1比肩OpenAI满血版o1(技术报告解读)
推理模型专题|LLM推理中的强化学习及其实战:以GRPO为例
今天将从实战角度 介绍如何基于Unsloth训练私有、定制化的推理模型R1。具体到模型训练目标: 利用OpenR1的Math数据集,通过GRPO将Qwen3-4B-Base训练成推理模型 。
考虑到篇幅较长,将其分为上下两篇。本文(上篇) 将聚焦于如何通过监督微调(SFT)技术,训练出一个能严格遵循自定义GRPO格式的基座模型。这一步至关重要,它将为后续训练提供一个高质量的初始化模型。在下篇中,我们将以此模型为起点,深入探讨如何应用GRPO算法进行进一步优化,最终打造出我们自己的目标R1模型。
本文(上篇)完整代码 可以前往微信公众号"小窗幽记机器学习",回复"推理模型实战1-上篇"获取。如想进一步与小编进一步交流也可以在微信公众号"小窗幽记机器学习"上添加小编微信号。
更多大模型实战相关欢迎关注微信公众号"小窗幽记机器学习":
- Unsloth简介
Unsloth项目地址:https://github.com/unslothai/unsloth
Unsloth 是一个专门为大型语言模型(LLM) 微调(fine-tuning) 设计的开源加速框架。它旨在通过优化训练过程 ,显著提高微调速度并降低内存消耗 ,同时保持模型精度的不变。Unsloth 支持多种流行的LLM,如Llama、Mistral、Gemma 等,并与Hugging Face Transformers 库无缝集成。
Unsloth 的主要特点和优势包括:
- 高效的微调性能
Unsloth 通过优化内核和手动反向传播引擎,实现了 2 倍的训练速度提升。
- 降低内存消耗:
在内存使用方面,Unsloth 比传统的微调方法减少了 70% - 80% 的内存占用。这意味着你可以在相同的硬件资源下,处理更大的数据集和更复杂的模型。
- 保持模型精度:
Unsloth 引入了动态 4 位量化技术,通过动态选择不量化某些参数,大大提高了模型的准确性,同时只比 BnB 4 位量化多使用了不到 10% 的显存(VRAM)。在加速和节省内存的同时,保证了微调后模型的精度,不会对模型性能产生负面影响。
- 支持多种模型:
Unsloth 支持多种流行的 LLM 模型,包括 Llama 3.3、Mistral、Phi-4、Qwen 2.5 和 Gemma 等。无论你是想微调一个 70B 参数的 Llama 3.3 模型,还是一个 9B 参数的 Gemma 模型,Unsloth 都能满足你的需求。
- 选择基座模型
先解释一下base模型以及Instruct模型的区别:
Base 模型
- 定义:Base 模型是未经专门任务微调的基础模型,它通常是在大规模数据上进行自监督学习后得到的。Base 模型通过学习语言中的统计模式来理解语言结构,但并没有针对特定任务进行优化。
- 训练目标:Base 模型的目标是预测下一个 token(即词或字符片段),它通过广泛的文本数据来学习语言的通用特征,但并不具备特定任务的指令执行能力。
Instruct 模型
- 定义:Instruct 模型是在 Base 模型的基础上,通过 指令微调(Instruction Tuning) 得来的版本。这类模型专门被设计成能够按照用户的指令执行任务,例如生成、回答问题、翻译等。
- 训练目标:Instruct 模型不仅学会了语言模式,还被训练去理解并按照用户输入的明确指令执行相应的任务。这通过监督学习来实现,模型接受大量人类指令及其对应的输出进行微调,使得它能更好地处理明确的任务请求。
- 特点: 任务导向:模型不仅理解语言,还能理解任务需求,并生成相关的输出。 指令响应能力强:Instruct 模型能够按照用户的请求完成诸如生成文本、回答问题等任务,表现出比 Base 模型更好的指令执行能力。 用户友好:相比 Base 模型,它的设计更倾向于真实的应用场景,通常不需要进一步微调即可直接用于任务执行。
- 格式预微调设置
首先对模型进行预微调以加速并稳定后续GRPO训练过程。
from unsloth import FastLanguageModel
import torch
max\_seq\_length = 2048 # Can increase for longer reasoning traces
lora\_rank = 32 # Larger rank = smarter, but slower
# 载入Qwen3-Base模型
model, tokenizer = FastLanguageModel.from\_pretrained(
model\_name = "unsloth/Qwen3-4B-Base",
max\_seq\_length = max\_seq\_length,
load\_in\_4bit = False, # False for LoRA 16bit
fast\_inference = True, # Enable vLLM fast inference
max\_lora\_rank = lora\_rank,
gpu\_memory\_utilization = 0.7, # Reduce if out of memory
)
model = FastLanguageModel.get\_peft\_model(
model,
r = lora\_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target\_modules = [
"q\_proj", "k\_proj", "v\_proj", "o\_proj",
"gate\_proj", "up\_proj", "down\_proj",
],
lora\_alpha = lora\_rank*2, # *2 speeds up training
use\_gradient\_checkpointing = "unsloth", # Reduces memory usage
random\_state = 3407,
)
Deepseek选用<think><\think>
用于构建推理数据,这个标签并非不可变动。此处用<start\_working\_out><end\_working\_out>
来做替代,你也可以自定义自己想要的匹配HTML格式文本。最后,建议将该Prompt写成system\_prompt
。
# 设定固定数据匹配格式
reasoning\_start = "<start\_working\_out>" # Acts as <think>
reasoning\_end = "<end\_working\_out>" # Acts as </think>
solution\_start = "<SOLUTION>"
solution\_end = "</SOLUTION>"
system\_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning\_start} and {reasoning\_end}.
Then, provide your solution between {solution\_start}{solution\_end}"""
"""
You are given a problem.
Think about the problem and provide your working out.
Place it between <start\_working\_out> and <end\_working\_out>.
Then, provide your solution between <SOLUTION></SOLUTION>
"""
我们在下面创建一个简单的聊天模板。注意,add\_generation\_prompt
包括前缀<start\_working\_out>
,以指导模型开始其推理过程。
# Chat范式
chat\_template = \
"{% if messages[0]['role'] == 'system' %}"\
"{{ messages[0]['content'] + eos\_token }}"\
"{% set loop\_messages = messages[1:] %}"\
"{% else %}"\
"{{ '{system\_prompt}' + eos\_token }}"\
"{% set loop\_messages = messages %}"\
"{% endif %}"\
"{% for message in loop\_messages %}"\
"{% if message['role'] == 'user' %}"\
"{{ message['content'] }}"\
"{% elif message['role'] == 'assistant' %}"\
"{{ message['content'] + eos\_token }}"\
"{% endif %}"\
"{% endfor %}"\
"{% if add\_generation\_prompt %}{{ '{reasoning\_start}' }}"\
"{% endif %}"
# Replace with out specific template:
chat\_template = chat\_template\
.replace("'{system\_prompt}'", f"'{system\_prompt}'")\
.replace("'{reasoning\_start}'", f"'{reasoning\_start}'")
tokenizer.chat\_template = chat\_template
让我们来看看我们通过以上范式生成的例子。
tokenizer.apply\_chat\_template([
{"role" : "user", "content" : "What is 1+1?"},
{"role" : "assistant", "content" : f"{reasoning\_start}I think it's 2.{reasoning\_end}{solution\_start}2{solution\_end}"},
{"role" : "user", "content" : "What is 2+2?"},
], tokenize = False, add\_generation\_prompt = True)
"""
You are given a problem.
Think about the problem and provide your working out.
Place it between <start\_working\_out> and <end\_working\_out>.
Then, provide your solution between <SOLUTION></SOLUTION><|endoftext|>What is 1+1?<start\_working\_out>I think it's 2.<end\_working\_out><SOLUTION>2</SOLUTION><|endoftext|>What is 2+2?<start\_working\_out>
"""
- 训练数据预处理
现在使用NVIDIA的开放数学推理数据集(nvidia/OpenMathReasoning
)的一个子集(unsloth/OpenMathReasoning-mini
),该数据集经过过滤,只包括高质量的DeepSeek R1推理路径数据。该数据概览如下:
这里根据规则过滤出约59个示例,以“初始化”预微调模型,从而辅助理解自定义的GRPO格式 。构造数据范式的代码如下所示:
from datasets import load\_dataset
import pandas as pd
import numpy as np
dataset = load\_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
dataset = dataset.to\_pandas()[
["expected\_answer", "problem", "generated\_solution"]
]
# Try converting to number - if not, replace with NaN
is\_number = pd.to\_numeric(pd.Series(dataset["expected\_answer"]), errors = "coerce").notnull()
# Select only numbers
dataset = dataset.iloc[np.where(is\_number)[0]]
# 将数据构造成GRPO的训练格式
def format\_dataset(x):
expected\_answer = x["expected\_answer"]
problem = x["problem"]
# Remove generated <think> and </think>
thoughts = x["generated\_solution"]
thoughts = thoughts.replace("<think>", "").replace("</think>", "")
# Strip newlines on left and right
thoughts = thoughts.strip()
# Add our custom formatting
final\_prompt = \
reasoning\_start + thoughts + reasoning\_end + \
solution\_start + expected\_answer + solution\_end
return [
{"role" : "system", "content" : system\_prompt},
{"role" : "user", "content" : problem},
{"role" : "assistant", "content" : final\_prompt},
]
dataset["Messages"] = dataset.apply(format\_dataset, axis = 1)
tokenizer.apply\_chat\_template(dataset["Messages"][0], tokenize = False)
dataset["N"] = dataset["Messages"].apply(lambda x: len(tokenizer.apply\_chat\_template(x)))
# 此处将预微调数据集截断为max\_seq\_length/2,因为我们不想要太长的推理“步长”。
dataset = dataset.loc[dataset["N"] <= max\_seq\_length/2].copy()
到这里,我们就已经筛选出后续微调的数据集dataset
了。
随后,我们需将该数据转化为huggingface兼容的数据格式。
from datasets import Dataset
dataset["text"] = tokenizer.apply\_chat\_template(dataset["Messages"].values.tolist(), tokenize = False)
dataset = Dataset.from\_pandas(dataset)
- 预微调(SFT)
现在可以开始实践模型以遵循我们设置的GRPO格式。
from unsloth import is\_bfloat16\_supported
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train\_dataset = dataset,
args = SFTConfig(
dataset\_text\_field = "text",
per\_device\_train\_batch\_size = 1,
bf16 = is\_bfloat16\_supported(),
fp16 = not is\_bfloat16\_supported(),
gradient\_accumulation\_steps = 1, # 用梯度累积来模拟更改batch size
warmup\_steps = 5,
num\_train\_epochs = 2, # 设置训练的epoch数量
learning\_rate = 2e-4, # 可以将学习率降至2e-5以加快训练速度
logging\_steps = 5,
optim = "adamw\_8bit",
weight\_decay = 0.01,
lr\_scheduler\_type = "linear",
seed = 3407,
report\_to = "none", # 可以接入如“wandb”一类的本地云,设置为“none”即不接入
),
)
trainer.train()
下图是我们在训练的参数内设定每5个steps打出训练的log,可以看到Traingloss在稳定下降直至最后实现动态平衡。
- 验证模型效果
在训练完模型后,我们需验证一下SFT后的模型能否遵循我们给定的输出格式。
text = tokenizer.apply\_chat\_template(
dataset[0]["Messages"][:2],
tokenize = False,
add\_generation\_prompt = True, # 生成结果是否携带Prompt
)
from transformers import TextStreamer
\_ = model.generate(
**tokenizer(text, return\_tensors = "pt").to("cuda"),
temperature = 0,
max\_new\_tokens = 1024,
streamer = TextStreamer(tokenizer, skip\_prompt = False),
)
经过上述步骤,模型的输出如下,可以看到模型确实遵循了我们给定的思考格式,Generation Prompt出现了<start\_working\_out>
、<end\_working\_out>
等的标签内容,同时Qwen3-4b-base也出现了"推理"模式:
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS\_VERBOSITY=info` for more details.
You are given a problem.
Think about the problem and provide your working out.
Place it between <start\_working\_out> and <end\_working\_out>.
Then, provide your solution between <SOLUTION></SOLUTION><|endoftext|>Jenifer has 82 cents in pennies and nickels. Her younger brother mistook all her nickels for dimes and counted the total as $1.47. How many pennies does Jenifer have?<start\_working\_out>Okay, let's see. Jenifer has 82 cents in pennies and nickels. Her brother thought all the nickels were dimes and counted the total as $1.47. I need to find out how many pennies she has. Hmm, let's break this down.
First, I need to set up some equations. Let's say the number of pennies is P and the number of nickels is N. Since pennies are worth 1 cent each and nickels are 5 cents each, the total value Jenifer has is:
P + 5N = 82 cents.
Now, her brother thought all the nickels were dimes. Dimes are 10 cents each. So, he counted the total as $1.47, which is 147 cents. So, the equation based on his mistaken count would be:
P + 10N = 147 cents.
Now I have two equations:
1) P + 5N = 82
2) P + 10N = 147
I need to solve these two equations to find P. Let me subtract the first equation from the second to eliminate P. So, subtracting equation 1 from equation 2:
(P + 10N) - (P + 5N) = 147 - 82
That simplifies to:
5N = 65
Then, dividing both sides by 5:
N = 13
So, there are 13 nickels. Now, plug N back into the first equation to find P:
P + 5(13) = 82
5*13 is 65, so:
P + 65 = 82
Subtract 65 from both sides:
P = 17
So, Jenifer has 17 pennies. Let me check that. If she has 17 pennies and 13 nickels, that's 17 + 65 = 82 cents, which matches. Her brother thought the nickels were dimes, so 13 dimes would be 130 cents. Adding the 17 pennies gives 147 cents, which is $1.47. That checks out. So, the answer should be 17 pennies.
To solve the problem, we start by defining the variables:
- Let \( P \) be the number of pennies.
- Let \( N \) be the number of nickels.
We know the following:
1. The total value of the pennies and nickels is 82 cents.
2. The brother mistakenly counted the nickels as dimes and the total as $1.47 (147 cents).
We can set up the following system of equations based on the given information:
1. \( P + 5N = 82 \) (total value in cents)
2. \( P + 10N = 147 \) (mistaken total value in cents)
To find the number of pennies \( P \), we subtract the first equation from the second:
\[
(P + 10N) - (P + 5N) = 147 - 82
\]
This simplifies to:
\[
5N = 65
\]
Solving for \( N \):
\[
N = \frac{65}{5} = 13
\]
Now that we know \( N = 13 \), we substitute this value back into the first equation to find \( P \):
\[
P + 5(13) = 82
\]
\[
P + 65 = 82
\]
\[
P = 82 - 65 = 17
\]
Thus, Jenifer has \(\boxed{17}\) pennies.<end\_working\_out><SOLUTION>17</SOLUTION><|endoftext|>
可以看出,整个思考过程被<start\_working\_out>
和<end\_working\_out>
包裹住。最终的回答结果被<SOLUTION>
和</SOLUTION>
所包裹住。
- 总结
本文从实践角度介绍如何基于 Unsloth 框架 构建属于自己的推理模型,涵盖了从选择基座模型、预微调策略、自定义推理格式,到训练数据构造与最终模型验证的完整流程。演示了如何将 Qwen3-4B-Base 微调为具备 R1 式思考能力的模型,让模型以显式的 <start\_working\_out>
和 <SOLUTION>
等标签结构化输出推理内容。
通过 Unsloth 提供的高效微调机制,我们不仅降低了资源消耗,还实现了对复杂思维链的控制性生成。