单卡4090上用最新LLaMA-Factory微调Qwen3 最新模型(14B) - 文章 - 开发者社区

实测用最新的 LLaMA-Factory 项目 SFT 微调最新的 qwen3 模型，只需如下几步：（LLaMA-Factory 在5.1之前以第一时间完成了 Qwen3 的深层次优化，包括训练和推理逻辑）

在上一篇单卡4090上用最新LLaMA-Factory微调qwen3 最新模型只测试了 0.6 B的 Qwen3，今天来测试 14B的。

镜像构建

LLaMA-Factory 项目提供了镜像构建的 dockerfile： https://github.com/hiyouga/LLaMA-Factory/blob/main/docker/docker-cuda/Dockerfile

只需根据自己的环境略作修改即可。比如我的环境cuda驱动是12.1，所以修改 Dockerfile 的基础镜像为

  
FROM pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel

因为后面会用到 qlora ，需要在镜像中额外安装 bitsandbytes 包：

  
pip install bitsandbytes

构建好的镜像，启动的容器中，主要库的版本：

  
accelerate                1.6.0  
datasets                  3.5.0  
llamafactory              0.9.3.dev0   /app  
nvidia-cuda-cupti-cu12    12.1.105  
nvidia-cuda-nvrtc-cu12    12.1.105  
nvidia-cuda-runtime-cu12  12.1.105  
opencv-python-headless    4.5.5.64  
peft                      0.15.1  
stack-data                0.6.2  
torch                     2.5.1+cu121  
torchaudio                2.5.1+cu121  
torchelastic              0.2.2  
torchvision               0.20.1+cu121  
transformers              4.51.3  
trl                       0.9.6  
types-dataclasses         0.6.6  
tzdata                    2025.2  
uvicorn                   0.34.2  
bitsandbytes              0.45.5

模型和数据集准备

本文采用14B尺寸的qwen3模型：https://modelscope.cn/models/Qwen/Qwen3-14B

数据集不变：https://huggingface.co/datasets/qihoo360/Light-R1-SFTData/stage2-3k.json

编写 LLaMA-Factory 格式的 dataset_info.json 放到容器挂载的 /datasets 根目录下：

  
{  
    "Light-R1-SFT-stage2": {  
      "file_name": "qihoo360/Light-R1-SFTData/stage2-3k.json",  
      "file_sha1": "481cd356262d36b9d16ac49f7fc8ff3d4c9f349c",  
      "formatting": "sharegpt",  
      "columns": {  
        "messages": "conversations"  
      },  
      "tags": {  
        "role_tag": "from",  
        "content_tag": "value",  
        "user_tag": "user",  
        "assistant_tag": "assistant"  
      },  
      "ranking": false,  
      "field": "auto"  
    }  
}

Qwen3-14B训练

容器启动后，在界面表单上选择如下内容：

模型路径：/root/.cache/modelscope/hub/qwen/Qwen3-14B
数据路径：data
数据集：Light-R1-SFT-stage2
微调方法： Lora
量化等级（启用量化QLora）：4
量化方法：bnb
对话模版：qwen3
训练阶段：Supervised Fine-Tuning

对于14B模型，我们开启 qlora ，否则会显存 OOM；

界面设置好后，点击 “Preview command” 按钮（或预览命令）按钮，显示的命令内容应如下所示：

  
llamafactory-cli train \  
    --stage sft \  
    --do_train True \  
    --model_name_or_path /root/.cache/modelscope/hub/qwen/Qwen3-14B \  
    --preprocessing_num_workers 16 \  
    --finetuning_type lora \  
    --template qwen3 \  
    --flash_attn auto \  
    --dataset_dir data \  
    --dataset Light-R1-SFT-stage2 \  
    --cutoff_len 2048 \  
    --learning_rate 5e-05 \  
    --num_train_epochs 3.0 \  
    --max_samples 100000 \  
    --per_device_train_batch_size 2 \  
    --gradient_accumulation_steps 8 \  
    --lr_scheduler_type cosine \  
    --max_grad_norm 1.0 \  
    --logging_steps 5 \  
    --save_steps 100 \  
    --warmup_steps 0 \  
    --packing False \  
    --report_to none \  
    --output_dir saves/Qwen3-14B-Instruct/lora/train_2025-05-16-13-17-41 \  
    --bf16 True \  
    --plot_loss True \  
    --trust_remote_code True \  
    --ddp_timeout 180000000 \  
    --include_num_input_tokens_seen True \  
    --optim adamw_torch \  
    --quantization_bit 4 \  
    --quantization_method bnb \  
    --double_quantization True \  
    --lora_rank 8 \  
    --lora_alpha 16 \  
    --lora_dropout 0 \  
    --lora_target all

训练日志：

  
。。。  
[INFO|configuration_utils.py:691] 2025-05-16 13:20:36,257 >> loading configuration file /root/.cache/modelscope/hub/qwen/Qwen3-14B/config.json  
[INFO|configuration_utils.py:765] 2025-05-16 13:20:36,258 >> Model config Qwen3Config {  
  "architectures": [  
    "Qwen3ForCausalLM"  
  ],  
  "attention_bias": false,  
  "attention_dropout": 0.0,  
  "bos_token_id": 151643,  
  "eos_token_id": 151645,  
  "head_dim": 128,  
  "hidden_act": "silu",  
  "hidden_size": 5120,  
  "initializer_range": 0.02,  
  "intermediate_size": 17408,  
  "max_position_embeddings": 40960,  
  "max_window_layers": 40,  
  "model_type": "qwen3",  
  "num_attention_heads": 40,  
  "num_hidden_layers": 40,  
  "num_key_value_heads": 8,  
  "rms_norm_eps": 1e-06,  
  "rope_scaling": null,  
  "rope_theta": 1000000,  
  "sliding_window": null,  
  "tie_word_embeddings": false,  
  "torch_dtype": "bfloat16",  
  "transformers_version": "4.51.3",  
  "use_cache": true,  
  "use_sliding_window": false,  
  "vocab_size": 151936  
}  
。。。  
Loading checkpoint shards: 100%|██████████| 8/8 [02:25<00:00, 18.15s/it]  
。。。  
[INFO|trainer.py:748] 2025-05-16 13:23:03,174 >> Using auto half precision backend  
[INFO|trainer.py:2414] 2025-05-16 13:23:03,496 >> ***** Running training *****  
[INFO|trainer.py:2415] 2025-05-16 13:23:03,496 >>   Num examples = 3,259  
[INFO|trainer.py:2416] 2025-05-16 13:23:03,496 >>   Num Epochs = 3  
[INFO|trainer.py:2417] 2025-05-16 13:23:03,496 >>   Instantaneous batch size per device = 2  
[INFO|trainer.py:2420] 2025-05-16 13:23:03,496 >>   Total train batch size (w. parallel, distributed & accumulation) = 16  
[INFO|trainer.py:2421] 2025-05-16 13:23:03,496 >>   Gradient Accumulation steps = 8  
[INFO|trainer.py:2422] 2025-05-16 13:23:03,496 >>   Total optimization steps = 609  
[INFO|trainer.py:2423] 2025-05-16 13:23:03,500 >>   Number of trainable parameters = 32,112,640  
[INFO|2025-05-16 13:25:21] llamafactory.train.callbacks:143 >> {'loss': 0.4967, 'learning_rate': 4.9995e-05, 'epoch': 0.02, 'throughput': 1188.24}  
{'loss': 0.4967, 'grad_norm': 0.14969857037067413, 'learning_rate': 4.999467794024707e-05, 'epoch': 0.02, 'num_input_tokens_seen': 163840}  
。。。  
{'loss': 0.4186, 'grad_norm': 0.08530048280954361, 'learning_rate': 8.315552395404824e-09, 'epoch': 2.98, 'num_input_tokens_seen': 19829952}  
{'train_runtime': 16766.8078, 'train_samples_per_second': 0.583, 'train_steps_per_second': 0.036, 'train_loss': 0.40401273466683374, 'epoch': 3.0, 'num_input_tokens_seen': 19961024}  
***** train metrics *****  
  epoch                    =       2.9963  
  num_input_tokens_seen    =     19961024  
  total_flos               = 1564083299GF  
  train_loss               =        0.404  
  train_runtime            =   4:39:26.80  
  train_samples_per_second =        0.583  
  train_steps_per_second   =        0.036  
Figure saved at: saves/Qwen3-14B-Instruct/lora/train_2025-05-16-13-17-41/training_loss.png

同样的3千多条数据，相比 0.6B 只用了40分钟，14B 即使开了qlora 也用了近5个小时才训练完。loss图：

picture.image

用以上相同的环境和参数，qlora微调 Qwen3-32B 或 DeepSeek-R1-Distill-Qwen-32B，实测都报错 torch.OutOfMemoryError: CUDA out of memory ；能把 32B 模型的微调训练在 24G 显存的卡上跑起来，还不是一件容易的事情。后面祭出 unsloth 来试试。之前有用 unsloth 成功在 24G 的4090卡上微调训练 DeepSeek-R1-Distill-Qwen-32B、 QwQ-32B 且成功商用的案例，参见：花费20块5小时，用单卡4090微调一个超越满血版的，可商用行业垂直模型