前段时间在一台8卡H20机器上,对360的Light-R1项目进行了复现,下面是一个简单的实测记录:
( Light-R1 项目介绍参见之前的文章:DeepSeek-R1复现之集大成者 )
- 360-LLaMA-Factory 镜像
- Light-R1 github 源码
- 基础模型
- 数据集
- https://huggingface.co/datasets/qihoo360/Light-R1-SFTData
- https://huggingface.co/datasets/qihoo360/Light-R1-DPOData
数据集
stage1-76k 数据:
root@H20:/data/ai/datasets/qihoo360/Light-R1-SFTData/stage1
# head -16 stage1-76k.json
[
{
"conversations": [
{
"from": "user",
"value": "130 trees are planted in a circle: birches and lindens (both types are present). Each tree has a sign that reads: \"Two different trees are growing nearby.\" It is known that the sign is false on all lindens and exactly on one birch. How many birches could have been planted? Specify all possible options."
},
{
"from": "assistant",
"value": "<think>\nOkay, let's see. ..."
}
]
},
{
"conversations": [
{
"from": "user",
"value": "3-5. The distance from \\(A\\) to \\(B\\) is 999 km. Along the road, there are kilometer markers indicating the distances to \\(A\\) and to \\(B: 0\\) ।999, 1 ।998, \\(\\ldots, 999\\) ।0. How many of these markers have only two different digits?"
},
{
"from": "assistant",
"value": "<think>\nOkay, so I need to。。。"
}
]
},
根据数据集内容,生成的 LLaMa-factory 要求的 dataset_info.json :
root@e2d84af62b56:/app
# cat /datasets/qihoo360/Light-R1-SFTData/stage1/dataset_info.json
{
"Light-R1-SFT-stage1": {
"file_name": "stage1-76k.json",
"file_sha1": "c1f12ddad36373958b8a8c7055671edb03ef12df",
"formatting": "sharegpt",
"columns": {
"messages": "conversations"
},
"tags": {
"role_tag": "from",
"content_tag": "value",
"user_tag": "user",
"assistant_tag": "assistant"
}
}
}
其中 file_sha1 的值,来自下面命令的输出:
sha1sum stage1-76k.json | awk '{print $1}'
stage2-3k 数据:
# head -30 stage2-3k.json
[
{
"source": "aime-amc",
"stdans": "x(y-1)=0",
"conversations": [
{
"from": "user",
"value": "If $|x-\\log y|=x+\\log y$ where $x$ and $\\log y$ are real, then"
},
{
"from": "assistant",
"value": "<think>\nOkay, so I...]"
}
]
},
{
"source": "",
"conversations": [
{
"from": "user",
"value": "There are 100 boxers, each of them having different strengths, who participate in a tournament. Any of them fights each other only once. Several boxers form a plot. In one of their matches, they hide in their glove a horse shoe. If in a fight, only one of the boxers has a horse shoe hidden, he wins the fight; otherwise, the stronger boxer wins. It is known that there are three boxers who obtained (strictly) more wins than the strongest three boxers. What is the minimum number of plotters ?\n\n[i]Proposed by N. Kalinin[/i]"
},
{
"from": "assistant",
"value": "<think>\nOkay, let's try to...)."
}
]
},
dpo数据:
[
{
"conversations": [
{
"from": "human",
"value": "Given that line $l$ is both the tangent line of curve C$_{1}$: $y=e^{x}$ and the tangent line of curve C$_{2}$: $y=\\frac{1}{4}e^{2x^{2}}$, determine the x-intercept of line $l$ ( ).\nA: 2\nB: 1\nC: $e^{2}$\nD: $-e^{2}$."
}
],
"chosen": {
"from": "gpt",
"value": "<think>\nAlright, let me try to。。。)."
},
"rejected": {
"from": "gpt",
"value": "\nOkay, so I need to find the。。。]"
}
},
{
"conversations": [
{
"from": "human",
"value": "Given that line $l$ is both the tangent line of curve C$_{1}$: $y=e^{x}$ and the tangent line of curve C$_{2}$: $y=\\frac{1}{4}e^{2x^{2}}$, determine the x-intercept of line $l$ ( ).\nA: 2\nB: 1\nC: $e^{2}$\nD: $-e^{2}$."
}
],
"chosen": {
"from": "gpt",
"value": "<think>\nAlright, let me try to。。。)."
},
"rejected": {
"from": "gpt",
"value": "\nOkay, so I need to。。。\]"
}
},
训练镜像
root@1e7cd7c067d4:/workspace/Light-R1/train-scripts
# pip list | grep -E 'cuda|torch|trans|tensor|dataset|peft|trl|flash|deep|numpy|scipy'
datasets 3.1.0
deepspeed 0.14.4
flash_attn 2.7.4.post1
numpy 1.26.4
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
peft 0.12.0
ring-flash-attn 0.1.3
safetensors 0.5.3
scipy 1.15.2
tensorboard 2.19.0
tensorboard-data-server 0.7.2
torch 2.5.1+cu121
torchaudio 2.5.1+cu121
torchelastic 0.2.2
torchvision 0.20.1+cu121
transformers 4.45.2
trl 0.9.6
训练容器
容器启动命令:
docker run --name llama-factory-360 -itd \
--gpus all --shm-size=600gb -p 7860:7860 \
-v /data/ai/models:/models \
-v /data/ai/datasets:/datasets \
-v /data/ai/workspace/llama-factory:/workspace \
llama-factory-360:250311_cu121 bash
启动后进入容器:
# docker exec -it llama-factory-360 bash
训练命令
修改 Light-R1/train-scripts/train-sft-stage1.sh , 改动如下:
root@H20:/data/ai/workspace/llama-factory/Light-R1/train-scripts# cat train-sft-stage1.sh
# Light-R1 SFT used a slightly different internal version codebase. This script is the closest counterpart in 360-LLaMA-Factory
# Light-R1 DPO used 360-LLaMA-Factory directly
hostfile="hostfile.12nodes"
#deepspeed --hostfile $hostfile /app/src/train.py \
deepspeed /app/src/train.py \
--stage sft \
--do_train \
--max_steps -1 \
--model_name_or_path /models/qwen/Qwen2.5-32B-Instruct \
--template qwen \
--dataset Light-R1-SFT-stage1 \
--dataset_dir /datasets/qihoo360/Light-R1-SFTData/stage1 \
--preprocessing_num_workers 16 \
--finetuning_type full \
--sequence_parallel_size 8 \
--gradient_checkpointing True \
--flash_attn fa2 \
--cache_dir .cache \
--overwrite_cache \
--cutoff_len 20000 \
--output_dir /workspace/output \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--save_strategy epoch \
--logging_steps 1 \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--adam_epsilon 1e-8 \
--max_grad_norm 1.0 \
--weight_decay 0.1 \
--warmup_ratio 0.01 \
--save_total_limit 10 \
--learning_rate 5e-5 \
--save_only_model True \
--num_train_epochs 100 \
--bf16 true \
--plot_loss \
--seed 42 \
--do_eval false \
--deepspeed /app/examples/deepspeed/ds_z3_config.json \
--remove_unused_columns False \
--report_to tensorboard \
--overwrite_output_dir \
--ddp_timeout 180000000 \
--packing True \
--num_train_epochs 3 \
--enable_liger_kernel
其中: /app/examples/deepspeed/ds_z3_config.json 内容:
# cat /app/examples/deepspeed/ds_z3_config.json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
训练跑起来后的主要进程:
root@e2d84af62b56:/workspace# ps axu
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 5879 0.0 0.0 2884 960 pts/1 S Mar19 0:00 /bin/sh ./train-sft-stage1.sh
root 5880 0.0 0.0 48859008 687248 pts/1 Sl Mar19 0:10 /opt/conda/bin/python3.11 /opt/conda/bin/deepspeed /app/src/train.py --stage sft --do_train --max_steps -1 --model_name_or_path /models/qwen/Qw
root 5964 0.0 0.0 48858860 689584 pts/1 Sl Mar19 0:12 /opt/conda/bin/python3.11 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0
root 6049 99.9 0.1 168629012 3360288 pts/1 Sl Mar19 885:34 /opt/conda/bin/python3.11 -u /app/src/train.py --local_rank=0 --stage sft --do_train --max_steps -1 --model_name_or_path /models/qwen/Qwen2.5-3
root 6050 100 0.1 170005752 3312924 pts/1 Sl Mar19 887:21 /opt/conda/bin/python3.11 -u /app/src/train.py --local_rank=1 --stage sft --do_train --max_steps -1 --model_name_or_path /models/qwen/Qwen2.5-3
root 6051 100 0.1 170136788 3414108 pts/1 Sl Mar19 887:22 /opt/conda/bin/python3.11 -u /app/src/train.py --local_rank=2 --stage sft --do_train --max_steps -1 --model_name_or_path /models/qwen/Qwen2.5-3
root 6052 100 0.1 170005612 3316392 pts/1 Sl Mar19 887:20 /opt/conda/bin/python3.11 -u /app/src/train.py --local_rank=3 --stage sft --do_train --max_steps -1 --model_name_or_path /models/qwen/Qwen2.5-3
root 6053 100 0.1 170136856 3361536 pts/1 Sl Mar19 887:27 /opt/conda/bin/python3.11 -u /app/src/train.py --local_rank=4 --stage sft --do_train --max_steps -1 --model_name_or_path /models/qwen/Qwen2.5-3
root 6054 100 0.1 169940224 3363380 pts/1 Sl Mar19 887:20 /opt/conda/bin/python3.11 -u /app/src/train.py --local_rank=5 --stage sft --do_train --max_steps -1 --model_name_or_path /models/qwen/Qwen2.5-3
root 6055 100 0.1 169874652 3348140 pts/1 Sl Mar19 887:20 /opt/conda/bin/python3.11 -u /app/src/train.py --local_rank=6 --stage sft --do_train --max_steps -1 --model_name_or_path /models/qwen/Qwen2.5-3
root 6056 100 0.1 169743528 3333312 pts/1 Sl Mar19 887:20 /opt/conda/bin/python3.11 -u /app/src/train.py --local_rank=7 --stage sft --do_train --max_steps -1 --model_name_or_path /models/qwen/Qwen2.5-3
完整参数:
root 5880 0.0 0.0 48859008 687248 pts/1 Sl Mar19 0:10 /opt/conda/bin/python3.11 /opt/conda/bin/deepspeed /app/src/train.py --stage sft --do_train --max_steps -1 --model_name_or_path /models/qwen/Qwen2.5-32B-Instruct --template qwen --dataset Light-R1-SFT-stage1 --dataset_dir /datasets/qihoo360/Light-R1-SFTData/stage1 --preprocessing_num_workers 32 --finetuning_type full --sequence_parallel_size 8 --gradient_checkpointing True --flash_attn fa2 --cache_dir /dev/shm/cache --overwrite_cache --cutoff_len 20000 --output_dir /workspace/output --per_device_train_batch_size 4 --gradient_accumulation_steps 2 --lr_scheduler_type cosine --save_strategy epoch --logging_steps 1 --adam_beta1 0.9 --adam_beta2 0.95 --adam_epsilon 1e-8 --max_grad_norm 1.0 --weight_decay 0.1 --warmup_ratio 0.01 --save_total_limit 10 --learning_rate 5e-5 --save_only_model True --num_train_epochs 3 --bf16 true --plot_loss --seed 42 --do_eval false --deepspeed /app/examples/deepspeed/ds_z3_config.json --report_to tensorboard --overwrite_output_dir --ddp_timeout 180000000 --packing True --enable_liger_kernel
。。。
loss:
[INFO|trainer.py:2243] 2025-03-19 11:30:24,555 >> ***** Running training *****
[INFO|trainer.py:2244] 2025-03-19 11:30:24,555 >> Num examples = 197,064
[INFO|trainer.py:2245] 2025-03-19 11:30:24,555 >> Num Epochs = 3
[INFO|trainer.py:2246] 2025-03-19 11:30:24,555 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2249] 2025-03-19 11:30:24,555 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2250] 2025-03-19 11:30:24,555 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2251] 2025-03-19 11:30:24,555 >> Total optimization steps = 18,474
[INFO|trainer.py:2252] 2025-03-19 11:30:24,557 >> Number of trainable parameters = 32,763,876,352
...
{'loss': 1.3224, 'grad_norm': 0.9693541365411733, 'learning_rate': 4.999907792302487e-05, 'epoch': 0.04}
1%|▏ | 235/18474 [1:28:13<113:33:28, 22.41s/it]
要接近5天才能完成
启动日志
启动日志:
root@1e7cd7c067d4:/workspace/Light-R1/train-scripts# nohup ./train-sft-stage1.sh > train-sft-stage1.log 2>&1 &
[1] 8950
root@1e7cd7c067d4:/workspace/Light-R1/train-scripts# tail -100f train-sft-stage1.log
nohup: ignoring input
[2025-03-19 06:20:31,408] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
。。。
[2025-03-19 06:34:32,898] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 771, num_elems = 32.76B
Loading checkpoint shards: 100%|██████████| 17/17 [00:17<00:00, 1.05s/it]
Loading checkpoint shards: 100%|██████████| 17/17 [00:17<00:00, 1.05s/it]
Loading checkpoint shards: 100%|██████████| 17/17 [00:17<00:00, 1.06s/it]
Loading checkpoint shards: 100%|██████████| 17/17 [00:17<00:00, 1.06s/it]
Loading checkpoint shards: 100%|██████████| 17/17 [00:18<00:00, 1.06s/it]
Loading checkpoint shards: 100%|██████████| 17/17 [00:17<00:00, 1.06s/it]
Loading checkpoint shards: 100%|██████████| 17/17 [00:17<00:00, 1.06s/it]
...
[INFO|trainer.py:2243] 2025-03-19 11:30:24,555 >> ***** Running training *****
[INFO|trainer.py:2244] 2025-03-19 11:30:24,555 >> Num examples = 197,064
[INFO|trainer.py:2245] 2025-03-19 11:30:24,555 >> Num Epochs = 3
[INFO|trainer.py:2246] 2025-03-19 11:30:24,555 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2249] 2025-03-19 11:30:24,555 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2250] 2025-03-19 11:30:24,555 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2251] 2025-03-19 11:30:24,555 >> Total optimization steps = 18,474
[INFO|trainer.py:2252] 2025-03-19 11:30:24,557 >> Number of trainable parameters = 32,763,876,352
^M 0%| | 0/18474 [00:00<?, ?it/s]Outputs keys: odict_keys(['loss'])
资源快照:
Wed Mar 19 14:26:28 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H20 On | 00000000:65:02.0 Off | 0 |
| N/A 31C P0 114W / 500W | 3135MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
。。。
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H20 On | 00000000:6B:03.0 Off | 0 |
| N/A 32C P0 118W / 500W | 2943MiB / 97871MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
Wed Mar 19 14:33:20 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H20 On | 00000000:65:02.0 Off | 0 |
| N/A 31C P0 116W / 500W | 7591MiB / 97871MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
。。。
| 7 NVIDIA H20 On | 00000000:6B:03.0 Off | 0 |
| N/A 32C P0 117W / 500W | 2943MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
加载模型和数据时,是0卡和其余7卡交替打满,训练日志:
{'loss': 1.0637, 'grad_norm': 0.4386183881484553, 'learning_rate': 4.8895456262071824e-05, 'epoch': 0.31}
{'loss': 1.0155, 'grad_norm': 0.45154923810655184, 'learning_rate': 4.889293163732299e-05, 'epoch': 0.31}
{'loss': 0.9491, 'grad_norm': 0.4136980526729538, 'learning_rate': 4.889040419596236e-05, 'epoch': 0.31}
{'loss': 0.9714, 'grad_norm': 0.40281781322617083, 'learning_rate': 4.8887873938287875e-05, 'epoch': 0.31}
{'loss': 0.91, 'grad_norm': 0.4171647622140142, 'learning_rate': 4.888534086459783e-05, 'epoch': 0.31}
{'loss': 0.9549, 'grad_norm': 0.42573545631189347, 'learning_rate': 4.888280497519082e-05, 'epoch': 0.31}
{'loss': 0.9208, 'grad_norm': 0.39488083756725223, 'learning_rate': 4.8880266270365795e-05, 'epoch': 0.31}
{'loss': 0.9535, 'grad_norm': 0.3966213881760763, 'learning_rate': 4.887772475042203e-05, 'epoch': 0.31}
{'loss': 0.9771, 'grad_norm': 0.38810767320142764, 'learning_rate': 4.887518041565913e-05, 'epoch': 0.31}
{'loss': 0.9425, 'grad_norm': 0.39328975921555004, 'learning_rate': 4.887263326637704e-05, 'epoch': 0.32}
{'loss': 1.0212, 'grad_norm': 0.41029549783792807, 'learning_rate': 4.887008330287602e-05, 'epoch': 0.32}
{'loss': 0.9748, 'grad_norm': 0.41659411982655514, 'learning_rate': 4.886753052545667e-05, 'epoch': 0.32}
{'loss': 0.9247, 'grad_norm': 0.3666894281257893, 'learning_rate': 4.8864974934419937e-05, 'epoch': 0.32}
{'loss': 1.1383, 'grad_norm': 0.5280706054656945, 'learning_rate': 4.886241653006708e-05, 'epoch': 0.32}
{'loss': 1.0298, 'grad_norm': 0.4266698337452691, 'learning_rate': 4.885985531269969e-05, 'epoch': 0.32}
{'loss': 0.967, 'grad_norm': 0.4851980198998547, 'learning_rate': 4.88572912826197e-05, 'epoch': 0.32}
{'loss': 0.8977, 'grad_norm': 0.38936129097253375, 'learning_rate': 4.8854724440129376e-05, 'epoch': 0.32}
{'loss': 0.9614, 'grad_norm': 0.37238612880755884, 'learning_rate': 4.885215478553129e-05, 'epoch': 0.32}
{'loss': 0.9459, 'grad_norm': 0.3700623729742305, 'learning_rate': 4.884958231912838e-05, 'epoch': 0.32}
{'loss': 0.9013, 'grad_norm': 0.3572491524540146, 'learning_rate': 4.8847007041223915e-05, 'epoch': 0.32}
{'loss': 1.0043, 'grad_norm': 0.37856370889720986, 'learning_rate': 4.8844428952121444e-05, 'epoch': 0.32}
{'loss': 0.9846, 'grad_norm': 0.4175190481912249, 'learning_rate': 4.884184805212492e-05, 'epoch': 0.32}
{'loss': 0.9447, 'grad_norm': 0.40868635898854067, 'learning_rate': 4.883926434153857e-05, 'epoch': 0.32}
{'loss': 0.9582, 'grad_norm': 0.4061283897093283, 'learning_rate': 4.883667782066698e-05, 'epoch': 0.32}
{'loss': 0.9616, 'grad_norm': 0.39678823029270827, 'learning_rate': 4.8834088489815065e-05, 'epoch': 0.32}
{'loss': 1.1108, 'grad_norm': 0.4483004634664135, 'learning_rate': 4.883149634928807e-05, 'epoch': 0.32}
{'loss': 0.9341, 'grad_norm': 0.38085532316667214, 'learning_rate': 4.8828901399391544e-05, 'epoch': 0.32}
{'loss': 0.9254, 'grad_norm': 0.3987444754437665, 'learning_rate': 4.8826303640431425e-05, 'epoch': 0.32}
{'loss': 0.9458, 'grad_norm': 0.37851719520079863, 'learning_rate': 4.8823703072713936e-05, 'epoch': 0.32}
{'loss': 0.9498, 'grad_norm': 0.41974496606635453, 'learning_rate': 4.882109969654564e-05, 'epoch': 0.32}
{'loss': 1.004, 'grad_norm': 0.4047618224522193, 'learning_rate': 4.8818493512233445e-05, 'epoch': 0.32}
{'loss': 0.9833, 'grad_norm': 0.44724685020706495, 'learning_rate': 4.881588452008456e-05, 'epoch': 0.32}
11%|█ | 993/9243 [12:05:19<100:07:29, 43.69s/it]
可视化展板
因为启动命令加了 --report_to tensorboard ,所以我们可以通过 tensorboard 来可视化查看训练情况。在 output 父目录执行:
# tensorboard --logdir=./output --port=7880 --host=0.0.0.0
...
TensorBoard 2.19.0 at http://0.0.0.0:7860/ (Press CTRL+C to quit)
在浏览器打开 http://the-host-ip:7860 即可查看 tensorboard 训练展板
训练结果
最后的日志:
{'loss': 0.3636, 'grad_norm': 0.3910459459567325, 'learning_rate': 1.4735591247205805e-12, 'epoch': 3.0}
{'loss': 0.2997, 'grad_norm': 0.35167990426012613, 'learning_rate': 0.0, 'epoch': 3.0}
100%|██████████| 9243/9243 [112:27:03<00:00, 44.03s/it][INFO|trainer.py:3705] 2025-03-24 06:59:52,793 >> Saving model checkpoint to /workspace/output/checkpoint-9243
[INFO|configuration_utils.py:410] 2025-03-24 06:59:52,796 >> Configuration saved in /workspace/output/checkpoint-9243/config.json
[INFO|configuration_utils.py:868] 2025-03-24 06:59:52,796 >> Configuration saved in /workspace/output/checkpoint-9243/generation_config.json
。。。
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 405499.6608, 'train_samples_per_second': 1.459, 'train_steps_per_second': 0.023, 'train_loss': 0.6830891583448981, 'epoch': 3.0}
100%|██████████| 9243/9243 [112:38:19<00:00, 43.87s/it]
[INFO|trainer.py:3705] 2025-03-24 07:11:08,947 >> Saving model checkpoint to /workspace/output
[INFO|configuration_utils.py:410] 2025-03-24 07:11:08,950 >> Configuration saved in /workspace/output/config.json
[INFO|configuration_utils.py:868] 2025-03-24 07:11:08,951 >> Configuration saved in /workspace/output/generation_config.json
[2025-03-24 07:11:11,024] [INFO] [launch.py:351:main] Process 6051 exits successfully.
[2025-03-24 07:11:12,024] [INFO] [launch.py:351:main] Process 6056 exits successfully.
[2025-03-24 07:11:12,025] [INFO] [launch.py:351:main] Process 6054 exits successfully.
[2025-03-24 07:11:12,025] [INFO] [launch.py:351:main] Process 6053 exits successfully.
[2025-03-24 07:11:12,025] [INFO] [launch.py:351:main] Process 6055 exits successfully.
[2025-03-24 07:11:12,025] [INFO] [launch.py:351:main] Process 6050 exits successfully.
[2025-03-24 07:11:12,025] [INFO] [launch.py:351:main] Process 6052 exits successfully.
[INFO|modeling_utils.py:2844] 2025-03-24 07:12:16,141 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 14 checkpoint shards. You can find where each parameters has been saved in the index located at /workspace/output/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2641] 2025-03-24 07:12:16,144 >> tokenizer config file saved in /workspace/output/tokenizer_config.json
[INFO|tokenization_utils_base.py:2650] 2025-03-24 07:12:16,145 >> Special tokens file saved in /workspace/output/special_tokens_map.json
***** train metrics *****
epoch = 3.0
total_flos = 3231408045GF
train_loss = 0.6831
train_runtime = 4 days, 16:38:19.66
train_samples_per_second = 1.459
train_steps_per_second = 0.023
Figure saved at: /workspace/output/training_loss.png
[WARNING|2025-03-24 07:12:16] llamafactory.extras.ploting:162 >> No metric eval_loss to plot.
[WARNING|2025-03-24 07:12:16] llamafactory.extras.ploting:162 >> No metric eval_accuracy to plot.
[INFO|modelcard.py:449] 2025-03-24 07:12:16,969 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
[2025-03-24 07:12:19,045] [INFO] [launch.py:351:main] Process 6049 exits successfully.
训练命令
stage2:
deepspeed /app/src/train.py \
--stage sft \
--do_train \
--max_steps -1 \
--model_name_or_path /workspace/output \
--template qwen \
--dataset Light-R1-SFT-stage2 \
--dataset_dir /datasets/qihoo360/Light-R1-SFTData/stage2 \
--preprocessing_num_workers 32 \
--finetuning_type full \
--sequence_parallel_size 8 \
--gradient_checkpointing True \
--flash_attn fa2 \
--cache_dir /dev/shm/cache \
--overwrite_cache \
--cutoff_len 20000 \
--output_dir /outputs \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--save_strategy epoch \
--logging_steps 1 \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--adam_epsilon 1e-8 \
--max_grad_norm 1.0 \
--weight_decay 0.1 \
--warmup_ratio 0.01 \
--save_total_limit 10 \
--learning_rate 1e-5 \
--save_only_model True \
--num_train_epochs 3 \
--bf16 true \
--plot_loss \
--seed 42 \
--do_eval false \
--deepspeed /app/examples/deepspeed/ds_z3_config.json \
--report_to tensorboard \
--overwrite_output_dir \
--ddp_timeout 180000000 \
--enable_liger_kernel
训练日志
训练日志:
{'loss': 0.1366, 'grad_norm': 0.5624666486370892, 'learning_rate': 1.2321255536962285e-08, 'epoch': 2.93}
{'loss': 0.3242, 'grad_norm': 0.5439143403995895, 'learning_rate': 1.1425812857557838e-08, 'epoch': 2.93}
{'loss': 0.3826, 'grad_norm': 0.6079594180502511, 'learning_rate': 1.0564109944228851e-08, 'epoch': 2.93}
。。。
{'loss': 0.1528, 'grad_norm': 0.5872353917953322, 'learning_rate': 6.763397267628424e-11, 'epoch': 2.99}
{'loss': 0.2237, 'grad_norm': 0.5587437086655467, 'learning_rate': 1.69085217588405e-11, 'epoch': 2.99}
{'loss': 0.3469, 'grad_norm': 0.6298207631005864, 'learning_rate': 0.0, 'epoch': 3.0}
100%|██████████| 1221/1221 [14:39:32<00:00, 43.15s/it][INFO|trainer.py:3705] 2025-03-24 23:49:56,702 >> Saving model checkpoint to /outputs/checkpoint-1221
。。。
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 52933.0715, 'train_samples_per_second': 1.478, 'train_steps_per_second': 0.023, 'train_loss': 0.35941831754744785, 'epoch': 3.0}
100%|██████████| 1221/1221 [14:42:13<00:00, 43.35s/it]
。。。
***** train metrics *****
epoch = 2.9963
total_flos = 426868462GF
train_loss = 0.3594
train_runtime = 14:42:13.07
train_samples_per_second = 1.478
train_steps_per_second = 0.023
可以看到最后一个epoch loss 在0.1和0.3之间横跳。决定再加一轮 epoch,基于上一轮的命令修改下面3个地方:
deepspeed /app/src/train.py \
...
--model_name_or_path /outputs/output-stage2-epoch3 \
--output_dir /outputs/output-stage2-epoch4 \
--learning_rate 5e-6 \
即把模型设置为上一轮的output,将学习率从 1e-5 改成了 5e-6, output_dir 改一下作区分,其他参数都不变
最后一轮训练日志:
{'loss': 0.0944, 'grad_norm': 0.4513410927913028, 'learning_rate': 4.989014955054746e-06, 'epoch': 0.04}
{'loss': 0.0985, 'grad_norm': 0.48359447678842327, 'learning_rate': 4.9871094696878e-06, 'epoch': 0.04}
{'loss': 0.0532, 'grad_norm': 0.3290373941867023, 'learning_rate': 4.9850520904219406e-06, 'epoch': 0.05}
{'loss': 0.1828, 'grad_norm': 0.3683830455867503, 'learning_rate': 4.982842942906386e-06, 'epoch': 0.05}
。。。
{'loss': 0.1539, 'grad_norm': 0.7096213739948747, 'learning_rate': 6.870372254602631e-10, 'epoch': 0.99}
{'loss': 0.2249, 'grad_norm': 0.6267022563977058, 'learning_rate': 3.0535764836747696e-10, 'epoch': 0.99}
{'loss': 0.3483, 'grad_norm': 0.6598710005084351, 'learning_rate': 7.63405776685322e-11, 'epoch': 1.0}
{'loss': 0.2063, 'grad_norm': 0.6315841265679273, 'learning_rate': 0.0, 'epoch': 1.0}
100%|██████████| 407/407 [4:52:23<00:00, 43.32s/it][INFO|trainer.py:3705] 2025-03-25 06:35:08,416 >> Saving model checkpoint to /outputs/output-stage2-epoch4/checkpoint-407
。。。
{'train_runtime': 17704.263, 'train_samples_per_second': 1.473, 'train_steps_per_second': 0.023, 'train_loss': 0.14680606485584738, 'epoch': 1.0}
100%|██████████| 407/407 [4:55:04<00:00, 43.50s/it]
。。。
***** train metrics *****
epoch = 0.9988
total_flos = 142289153GF
train_loss = 0.1468
train_runtime = 4:55:04.26
train_samples_per_second = 1.473
train_steps_per_second = 0.023
Figure saved at: /outputs/output-stage2-epoch4/training_loss.png
貌似也出现了过拟合
交流后发现,可能是epoch设置的问题。 官方团队epoch都设置了100,但是stage1训练完第4个epoch就停了,stage2训练完第8个epoch就停了。这种方法是提前停止法,设置了100是为了调整余弦学习率的取值。
结果不及预期,时间关系,没有继续调试了。
结论
H20的机型还是更适合推理,不适合全参微调训练。SFT两个阶段跑下来,瓶颈在算力。因为计算慢,时间久,造成总的成本反而比360官方公布的在H100上的训练成本要高出约1倍。