【LLM训练系列01】Qlora如何加载、训练、合并大模型 - 文章 - 开发者社区

picture.image

示例1：Qlora训练Qwen2.5

参考脚本： https://github.com/QwenLM/Qwen/blob/main/recipes/finetune/deepspeed/finetune\_qlora\_multi\_gpu.ipynb

训练命令如下：


            
              
!torchrun --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6601 ../../finetune.py \  
    --model_name_or_path "Qwen/Qwen-1\_8B-Chat-Int4/" \  
    --data_path "Belle\_sampled\_qwen.json" \  
    --bf16 True \  
    --output_dir "output\_qwen" \  
    --num_train_epochs 5 \  
    --per_device_train_batch_size 1 \  
    --per_device_eval_batch_size 1 \  
    --gradient_accumulation_steps 16 \  
    --evaluation_strategy "no" \  
    --save_strategy "steps" \  
    --save_steps 1000 \  
    --save_total_limit 10 \  
    --learning_rate 1e-5 \  
    --weight_decay 0.1 \  
    --adam_beta2 0.95 \  
    --warmup_ratio 0.01 \  
    --lr_scheduler_type "cosine" \  
    --logging_steps 1 \  
    --report_to "none" \  
    --model_max_length 512 \  
    --gradient_checkpointing True \  
    --lazy_preprocess True \  
    --deepspeed "../../finetune/ds\_config\_zero2.json" \  
    --use_lora \  
    --q_lora

选择底座模型

上面命令选用的Qwen/Qwen-1_8B-Chat-Int4/

加载模型


            
              
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training  
# Load model and tokenizer  
model = transformers.AutoModelForCausalLM.from_pretrained(  
    model_args.model_name_or_path,  
    config=config,  
    cache_dir=training_args.cache_dir,  
    device_map=device_map,  
    trust_remote_code=True,  
    quantization_config=GPTQConfig(  
        bits=4, disable_exllama=True  
    )  
    if training_args.use_lora and lora_args.q_lora  
    else None,  
    **model_load_kwargs,  
)  
tokenizer = transformers.AutoTokenizer.from_pretrained(  
    model_args.model_name_or_path,  
    cache_dir=training_args.cache_dir,  
    model_max_length=training_args.model_max_length,  
    padding_side="right",  
    use_fast=False,  
    trust_remote_code=True,  
)  
tokenizer.pad_token_id = tokenizer.eod_id  
  
if training_args.use_lora:  
    if lora_args.q_lora or is_chat_model:  
        modules_to_save = None  
    else:  
        modules_to_save = ["wte", "lm\_head"]  
    lora_config = LoraConfig(  
        r=lora_args.lora_r,  
        lora_alpha=lora_args.lora_alpha,  
        target_modules=lora_args.lora_target_modules,  
        lora_dropout=lora_args.lora_dropout,  
        bias=lora_args.lora_bias,  
        task_type="CAUSAL\_LM",  
        modules_to_save=modules_to_save  # This argument serves for adding new tokens.  
    )  
    if lora_args.q_lora:  
        model = prepare_model_for_kbit_training(  
            model, use_gradient_checkpointing=training_args.gradient_checkpointing  
        )  
  
    model = get_peft_model(model, lora_config)  
  
    # Print peft trainable params  
    model.print_trainable_parameters()  
  
    if training_args.gradient_checkpointing:  
        model.enable_input_require_grads()

prepare_model_for_kbit_training函数说明

调用 prepare_model_for_kbit_training（）函数来预处理用于训练的量化模型。我们在peft库中可以看到源码：


            
              
def prepare_model_for_kbit_training(model, use_gradient_checkpointing=True, gradient_checkpointing_kwargs=None):  
    r"""  
    Note this method only works for `transformers` models.  
  
    This method wraps the entire protocol for preparing a model before running a training. This includes:  
        1- Cast the layernorm in fp32 2- making output embedding layer require grads 3- Add the upcasting of the lm  
        head to fp32  
  
    Args:  
        model (`transformers.PreTrainedModel`):  
            The loaded model from `transformers`  
        use\_gradient\_checkpointing (`bool`, *optional*, defaults to `True`):  
            If True, use gradient checkpointing to save memory at the expense of slower backward pass.  
        gradient\_checkpointing\_kwargs (`dict`, *optional*, defaults to `None`):  
            Keyword arguments to pass to the gradient checkpointing function, please refer to the documentation of  
            `torch.utils.checkpoint.checkpoint` for more details about the arguments that you can pass to that method.  
            Note this is only available in the latest transformers versions (> 4.34.1).  
    """  
    loaded_in_kbit = getattr(model, "is\_loaded\_in\_8bit", False) or getattr(model, "is\_loaded\_in\_4bit", False)  
    is_gptq_quantized = getattr(model, "quantization\_method", None) == "gptq"  
    is_aqlm_quantized = getattr(model, "quantization\_method", None) == "aqlm"  
    is_eetq_quantized = getattr(model, "quantization\_method", None) == "eetq"  
    is_hqq_quantized = getattr(model, "quantization\_method", None) == "hqq" or getattr(model, "hqq\_quantized", False)  
  
    if gradient_checkpointing_kwargs is None:  
        gradient_checkpointing_kwargs = {}  
  
    for name, param in model.named_parameters():  
        # freeze base model's layers  
        param.requires_grad = False  
  
    if not is_gptq_quantized and not is_aqlm_quantized and not is_eetq_quantized and not is_hqq_quantized:  
        # cast all non INT8 parameters to fp32  
        for param in model.parameters():  
            if (  
                (param.dtype == torch.float16) or (param.dtype == torch.bfloat16)  
            ) and param.__class__.__name__ != "Params4bit":  
                param.data = param.data.to(torch.float32)  
  
    if (  
        loaded_in_kbit or is_gptq_quantized or is_aqlm_quantized or is_eetq_quantized or is_hqq_quantized  
    ) and use_gradient_checkpointing:  
        # When having `use\_reentrant=False` + gradient\_checkpointing, there is no need for this hack  
        if "use\_reentrant" not in gradient_checkpointing_kwargs or gradient_checkpointing_kwargs["use\_reentrant"]:  
            # For backward compatibility  
            if hasattr(model, "enable\_input\_require\_grads"):  
                model.enable_input_require_grads()  
            else:  
  
                def make_inputs_require_grad(module, input, output):  
                    output.requires_grad_(True)  
  
                model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)  
  
        # To support older transformers versions, check if the model supports gradient\_checkpointing\_kwargs  
        _supports_gc_kwargs = "gradient\_checkpointing\_kwargs" in list(  
            inspect.signature(model.gradient_checkpointing_enable).parameters  
        )  
  
        if not _supports_gc_kwargs and len(gradient_checkpointing_kwargs) > 0:  
            warnings.warn(  
                "gradient\_checkpointing\_kwargs is not supported in this version of transformers. The passed kwargs will be ignored."  
                " if you want to use that feature, please upgrade to the latest version of transformers.",  
                FutureWarning,  
            )  
  
        gc_enable_kwargs = (  
            {} if not _supports_gc_kwargs else {"gradient\_checkpointing\_kwargs": gradient_checkpointing_kwargs}  
        )  
  
        # enable gradient checkpointing for memory efficiency  
        model.gradient_checkpointing_enable(**gc_enable_kwargs)  
    return model

这个函数prepare_model_for_kbit_training主要用于准备一个transformers库的预训练模型（PreTrainedModel），以便进行低比特（k-bit）量化训练 或其他特定情况下的训练。函数提供了一些设置和优化步骤，使模型更适合量化训练环境。

核心功能：

将 LayerNorm 层参数转换为 FP32 （32 位浮点数）。
设置模型的输出嵌入层参数为需要计算梯度 （即使冻结了其他参数，嵌入层可以被微调）。
将语言模型头（lm head）的计算强制提升为 FP32 ，以提高训练的数值稳定性。

参数说明：

model:

一个从transformers加载的预训练模型对象（如 GPT、BERT）。

use_gradient_checkpointing:

是否启用梯度检查点功能，用于在内存和计算速度之间进行权衡（减少内存占用，牺牲反向传播速度）。

gradient_checkpointing_kwargs:

一个可选字典，传递给梯度检查点的配置参数。需要transformers版本大于4.34.1才支持。

函数分步骤解析：

识别模型的量化情况 ：

检查模型是否被加载为低比特格式（8 位或 4 位），以及是否采用了特定的量化方法（如 GPTQ、AQLM、EETQ 或 HQQ 等）。

冻结所有参数 ：

遍历模型的所有参数，设置requires_grad = False，即冻结所有层，不计算梯度。这是低比特量化训练常见的步骤，用于只训练部分特定参数。

非量化模型处理 ：

如果模型未被量化，所有的非 INT8 参数（比如 FP16 或 BF16）都会被强制转换为 FP32。这是为了确保数值稳定性，特别是在低精度下训练时。

启用梯度检查点（可选） ：

检查是否需要启用输入张量的梯度。对于某些老版本的transformers，可能需要通过forward_hook的方式显式设置输入张量的requires_grad。
检查模型是否支持gradient_checkpointing_kwargs，并发出警告（如果版本过旧）。
如果模型是低比特量化模型且启用了use_gradient_checkpointing：

启用梯度检查点功能 ：

调用模型的gradient_checkpointing_enable方法，根据是否支持额外参数传递对应配置，最终节省内存。

输出：

经过此函数处理后的模型：

更适合在量化或低精度（FP16/BF16）环境下训练 。
非量化模型的关键参数被转换为 FP32 ，以提升稳定性。
冻结大部分参数 ，只保留需要训练的部分。
在内存有限的情况下启用梯度检查点功能 ，优化 GPU 显存占用。

使用场景：

这个函数特别适用于以下情境：

使用低比特（如 8-bit 或 4-bit）的模型进行训练。
微调大模型时希望通过梯度检查点功能减少显存消耗。
对特定参数（如语言模型头或嵌入层）进行微调，而冻结其他层的参数。

合并模型

这里注意：合并模型需要使用Base模型合并，不是量化模型


            
              
from modelscope.hub.snapshot_download import snapshot_download  
snapshot_download('Qwen/Qwen-1\_8B-Chat', cache_dir='.', revision='master')  
  
from transformers import AutoModelForCausalLM  
from peft import PeftModel  
import torch  
  
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1\_8B-Chat/", torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)  
model = PeftModel.from_pretrained(model, "output\_qwen/")  
merged_model = model.merge_and_unload()  
merged_model.save_pretrained("output\_qwen\_merged", max_shard_size="2048MB", safe_serialization=True)

在 LoRA 和 Q-LoRA 的训练过程中，仅保存了适配器参数（adapter parameters），而不是完整的模型权重。需要注意的是，权重不能直接合并到量化模型（quantized models）中。相反，我们可以基于原始的非量化模型来合并权重。

这意味着，合并权重的过程需要加载原始的基础模型，并将微调的适配器参数与之结合，生成一个新的模型权重文件。以下是实现权重合并的示例代码：

示例2：Qlora微调Llama

fine-tuning-llama-2-using-lora-and-qlora-a-comprehensive-guide

选择模型


            
              
model_name = "NousResearch/Llama-2-7b-chat-hf"  
dataset_name = "mlabonne/guanaco-llama2-1k"  
new_model = "Llama-2-7b-chat-finetune-qlora"

参数设置


            
              
lora_r = 64 #lora attention dimension/ rank  
lora_alpha = 16 #lora scaling parameter  
lora_dropout = 0.1 #lora dropout probability

量化设置


            
              
use_4bit = True  
bnb_4bit_compute_dtype = "float16"  
bnb_4bit_quant_type = "nf4"  
use_nested_quant = False

BitsAndBytes 配置项中文解释

**use\_4bit = True**

功能 : 启用 4 位量化（4-bit quantization）以减少模型的内存占用。
作用 : 将模型参数从通常的高精度（如 FP32 或 FP16）压缩为 4 位表示，显著降低显存使用。

**bnb\_4bit\_compute\_dtype = "float16"**

功能 : 指定训练过程中用于计算的精度类型，这里选择float16（16 位浮点数）。
作用 : 即使模型参数被量化为 4 位，计算时仍使用更高的精度（FP16），以确保训练过程中的数值稳定性和性能。

**bnb\_4bit\_quant\_type = "nf4"**

功能 : 设置 4 位量化的类型，nf4（Normalized Float 4）是一种常见的选择。
作用 :nf4是一种专门设计的量化格式，相比传统的量化类型，能够更好地保留模型权重的数值分布特性，提升量化模型的性能。

**use\_nested\_quant = False**

功能 : 是否启用嵌套量化（Nested Quantization），即“双重量化”。
作用 : 嵌套量化是一种更进一步的量化技术，可以进一步减少内存占用，但可能会对模型性能有一定影响。这里选择禁用该功能。

这组配置是为了使用BitsAndBytes 库实现4 位量化 ，目的是在显存资源有限的情况下训练大型模型，同时尽量保持模型性能。具体设置包括：

启用 4 位量化 来压缩模型权重。
使用 FP16 进行计算 ，平衡计算速度与精度。
采用 **nf4** 量化类型 来优化量化模型的效果。
禁用嵌套量化 以避免额外的复杂性或性能损失。

此配置非常适合需要在低资源环境下进行高效训练的场景。

加载模型


            
              
#load dataset  
dataset = load_dataset(dataset_name,split = "train")  
  
#load tokenizer and model with QLoRA config  
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)  
  
bnb_config = BitsAndBytesConfig(  
    load_in_4bit = use_4bit,  
    bnb_4bit_quant_type = bnb_4bit_quant_type,  
    bnb_4bit_compute_dtype = compute_dtype,  
    bnb_4bit_use_double_quant = use_nested_quant,)  
  
#cheking GPU compatibility with bfloat16  
if compute_dtype == torch.float16 and use_4bit:  
    major, _ = torch.cuda.get_device_capability()  
    if major >= 8:  
        print("="*80)  
        print("Your GPU supports bfloat16, you are getting accelerate training with bf16= True")  
        print("="*80)  
  
#load base model  
model = AutoModelForCausalLM.from_pretrained(  
    model_name,  
    quantization_config = bnb_config,  
    device_map = device_map,  
)  
  
model.config.use_cache = False  
model.config.pretraining_tp = 1

合并模型

同样使用底座合并模型


            
              
# Reload model in FP16 and merge it with LoRA weights  
base_model = AutoModelForCausalLM.from_pretrained(  
    model_name,  
    low_cpu_mem_usage=True,  
    return_dict=True,  
    torch_dtype=torch.float16,  
    device_map=device_map,  
)  
model = PeftModel.from_pretrained(base_model, new_model)  
model = model.merge_and_unload()  
  
# Reload tokenizer to save it  
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)  
tokenizer.pad_token = tokenizer.eos_token  
tokenizer.padding_side = "right"

总结

模型为量化模型

训练：需要prepare_model_for_kbit_training(model)
合并：加载基础模型进行合并qlora
推理：加载base模型然后加载qlora权重也可以加载合并之后的

模型为基础模型

训练：加载需要使用bnb对基础模型量化
合并：加载基础模型进行合并qlora
推理：加载base模型然后加载qlora权重也可以加载合并之后的