实测用最新的 LLaMA-Factory 项目 SFT 微调最新的 qwen3 模型,只需如下几步:(LLaMA-Factory 在5.1之前以第一时间完成了 Qwen3 的深层次优化,包括训练和推理逻辑)
在上一篇 单卡4090上用最新LLaMA-Factory微调qwen3 最新模型 只测试了 0.6 B的 Qwen3,今天来测试 14B的。
LLaMA-Factory 项目提供了镜像构建的 dockerfile: https://github.com/hiyouga/LLaMA-Factory/blob/main/docker/docker-cuda/Dockerfile
只需根据自己的环境略作修改即可。比如我的环境cuda驱动是12.1,所以修改 Dockerfile 的基础镜像为
FROM pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel
因为后面会用到 qlora ,需要在镜像中额外安装 bitsandbytes 包:
pip install bitsandbytes
构建好的镜像,启动的容器中,主要库的版本:
accelerate 1.6.0
datasets 3.5.0
llamafactory 0.9.3.dev0 /app
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
opencv-python-headless 4.5.5.64
peft 0.15.1
stack-data 0.6.2
torch 2.5.1+cu121
torchaudio 2.5.1+cu121
torchelastic 0.2.2
torchvision 0.20.1+cu121
transformers 4.51.3
trl 0.9.6
types-dataclasses 0.6.6
tzdata 2025.2
uvicorn 0.34.2
bitsandbytes 0.45.5
本文采用14B尺寸的qwen3模型:https://modelscope.cn/models/Qwen/Qwen3-14B
数据集不变:https://huggingface.co/datasets/qihoo360/Light-R1-SFTData/stage2-3k.json
编写 LLaMA-Factory 格式的 dataset_info.json 放到容器挂载的 /datasets 根目录下:
{
"Light-R1-SFT-stage2": {
"file_name": "qihoo360/Light-R1-SFTData/stage2-3k.json",
"file_sha1": "481cd356262d36b9d16ac49f7fc8ff3d4c9f349c",
"formatting": "sharegpt",
"columns": {
"messages": "conversations"
},
"tags": {
"role_tag": "from",
"content_tag": "value",
"user_tag": "user",
"assistant_tag": "assistant"
},
"ranking": false,
"field": "auto"
}
}
容器启动后,在界面表单上选择如下内容:
- 模型路径:/root/.cache/modelscope/hub/qwen/Qwen3-14B
- 数据路径:data
- 数据集:Light-R1-SFT-stage2
- 微调方法: Lora
- 量化等级(启用量化QLora):4
- 量化方法:bnb
- 对话模版:qwen3
- 训练阶段:Supervised Fine-Tuning
对于14B模型,我们开启 qlora ,否则会显存 OOM;
界面设置好后,点击 “Preview command” 按钮(或预览命令)按钮,显示的命令内容应如下所示:
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path /root/.cache/modelscope/hub/qwen/Qwen3-14B \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--template qwen3 \
--flash_attn auto \
--dataset_dir data \
--dataset Light-R1-SFT-stage2 \
--cutoff_len 2048 \
--learning_rate 5e-05 \
--num_train_epochs 3.0 \
--max_samples 100000 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--packing False \
--report_to none \
--output_dir saves/Qwen3-14B-Instruct/lora/train_2025-05-16-13-17-41 \
--bf16 True \
--plot_loss True \
--trust_remote_code True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--optim adamw_torch \
--quantization_bit 4 \
--quantization_method bnb \
--double_quantization True \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target all
训练日志:
。。。
[INFO|configuration_utils.py:691] 2025-05-16 13:20:36,257 >> loading configuration file /root/.cache/modelscope/hub/qwen/Qwen3-14B/config.json
[INFO|configuration_utils.py:765] 2025-05-16 13:20:36,258 >> Model config Qwen3Config {
"architectures": [
"Qwen3ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 17408,
"max_position_embeddings": 40960,
"max_window_layers": 40,
"model_type": "qwen3",
"num_attention_heads": 40,
"num_hidden_layers": 40,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.51.3",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}
。。。
Loading checkpoint shards: 100%|██████████| 8/8 [02:25<00:00, 18.15s/it]
。。。
[INFO|trainer.py:748] 2025-05-16 13:23:03,174 >> Using auto half precision backend
[INFO|trainer.py:2414] 2025-05-16 13:23:03,496 >> ***** Running training *****
[INFO|trainer.py:2415] 2025-05-16 13:23:03,496 >> Num examples = 3,259
[INFO|trainer.py:2416] 2025-05-16 13:23:03,496 >> Num Epochs = 3
[INFO|trainer.py:2417] 2025-05-16 13:23:03,496 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2420] 2025-05-16 13:23:03,496 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:2421] 2025-05-16 13:23:03,496 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2422] 2025-05-16 13:23:03,496 >> Total optimization steps = 609
[INFO|trainer.py:2423] 2025-05-16 13:23:03,500 >> Number of trainable parameters = 32,112,640
[INFO|2025-05-16 13:25:21] llamafactory.train.callbacks:143 >> {'loss': 0.4967, 'learning_rate': 4.9995e-05, 'epoch': 0.02, 'throughput': 1188.24}
{'loss': 0.4967, 'grad_norm': 0.14969857037067413, 'learning_rate': 4.999467794024707e-05, 'epoch': 0.02, 'num_input_tokens_seen': 163840}
。。。
{'loss': 0.4186, 'grad_norm': 0.08530048280954361, 'learning_rate': 8.315552395404824e-09, 'epoch': 2.98, 'num_input_tokens_seen': 19829952}
{'train_runtime': 16766.8078, 'train_samples_per_second': 0.583, 'train_steps_per_second': 0.036, 'train_loss': 0.40401273466683374, 'epoch': 3.0, 'num_input_tokens_seen': 19961024}
***** train metrics *****
epoch = 2.9963
num_input_tokens_seen = 19961024
total_flos = 1564083299GF
train_loss = 0.404
train_runtime = 4:39:26.80
train_samples_per_second = 0.583
train_steps_per_second = 0.036
Figure saved at: saves/Qwen3-14B-Instruct/lora/train_2025-05-16-13-17-41/training_loss.png
同样的3千多条数据,相比 0.6B 只用了40分钟,14B 即使开了qlora 也用了近5个小时才训练完。loss图:
用以上相同的环境和参数,qlora微调 Qwen3-32B 或 DeepSeek-R1-Distill-Qwen-32B,实测都报错 torch.OutOfMemoryError: CUDA out of memory ; 能把 32B 模型的微调训练在 24G 显存的卡上跑起来,还不是一件容易的事情。后面祭出 unsloth 来试试。 之前有用 unsloth 成功在 24G 的4090卡上微调训练 DeepSeek-R1-Distill-Qwen-32B、 QwQ-32B 且成功商用的案例,参见:花费20块5小时,用单卡4090微调一个超越满血版的,可商用行业垂直模型