LLM漫谈（十）| DeepSeek R1 微调指南 - 文章 - 开发者社区

picture.image

在本文中，我们将深入探讨使用 Python 微调 DeepSeek R1模型的过程。

一、先决条件：

unsloth：可以使 Llama-3、Mistral、Phi-4 和 Gemma 等大型语言模型的微调速度提高了 2 倍，使用的内存减少70%，关键是准确性没有降低！
torch：使用 PyTorch 进行深度学习的基本构建块，它提供了一个强大的张量库，类似于 NumPy，但它具有 GPU 加速的额外优势，这在使用 LLMs的时候，非常关键。
Transformers：是一个功能强大且流行的自然语言处理（NLP）开源库。它为各种最先进的预训练模型提供了易于使用的界面。由于预先训练的模型构成了任何微调任务的基础，因此此软件包有助于轻松访问经过训练的模型。
trl包：是一个用于带有 transformer 模型的强化学习（RL）的专用库。它建立在 Hugging Face transformers库之上，利用其优势使带有 transformer 的 RL 更易于访问和高效。

二、算力要求

 微调模型是一种使 LLM的响应更加结构化和适配于特定领域的技术，不需要训练大模型的全参数。




  然而，对于大多数普通计算机硬件来说，微调更大的LLMs过程仍然不可行，因为所有可训练参数和实际LLM参数都存储在 GPU 的 vRAM（虚拟 RAM）中，而巨大的尺寸LLMs构成了实现这一目标的主要障碍。




 因此，我们将微调一个小的 LLM，即具有 47.4 亿个参数的 DeepSeek-R1-Distill，至少需要 8-12 GB 的 vRAM，为了让所有人都能使用它，我们将使用 Google Colab 的免费 T4 GPU，它具有 16 GB 的 vRAM。

三、数据准备

 为了微调 ，LLM我们需要结构化和特定于任务的数据。有许多数据准备策略，无论是废弃社交媒体平台、网站、书籍还是研究论文。




 在本文中，我们将使用数据集库来加载 Hugging Face Hub中存在的数据。我们将使用 Hugging Face 的 yahma/alpaca-cleaned数据集。

四、开始微调

 使用 Google Colab 完成此微调任务的一个主要好处是，大多数软件包都已经安装完毕。我们只需要安装一个包，即 unsloth。


        
            

          
 !pip install unsloth

4.1 初始化 model 和 tokenizer

 我们将使用 unsloth包来加载预先训练的模型，因为它提供了许多有用的技术，可以帮助我们更快地下载和微调 LLM.

加载模型和分词器的代码如下所示：


          
from unsloth import FastLanguageModel
          
model, tokenizer = FastLanguageModel.from_pretrained(
          
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit",
          
    max_seq_length = 2048,
          
    dtype = None,    
          
    load_in_4bit = True,    
          
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
          
)

 这里我们指定了模型名称 'unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit'，用于访问预训练的 DeepSeek-R1-Distill模型。




 我们将 max\_seq\_length定义为 2048，它设置了模型可以处理的输入序列的最大长度。通过合理设置，我们可以优化内存使用和处理速度。




 dtype设置为 None，这有助于映射将要获取模型的数据类型，与可用的硬件兼容。通过使用它，我们不必显式检查和提及数据类型，unsloth会处理所有事情。




  load\_in\_4bit 增强了推理并减少了内存使用，这里将模型量化到 4 bit精度。

4.2 添加 LoRA 适配器

 我们将 LoRA 矩阵添加到预训练 LLM中，这将有助于微调模型的响应。使用 unsloth，整个过程只需几行即可完成。


          
model = FastLanguageModel.get_peft_model(    
          
    model,    
          
    r = 64,    
          
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
          
                      "gate_proj", "up_proj", "down_proj",],    
          
    lora_alpha = 32,    
          
    lora_dropout = 0.05, # Can be set to any, but = 0 is optimized    
          
    bias = "none",    # Can be set to any, but = "none" is optimized    
          
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context    
          
    random_state = 3977,    
          
    use_rslora = False,  # unsloth also supports rank stabilized LoRA    
          
    loftq_config = None, # And LoftQ
          
)

代码解释

 现在，我们已经使用 FastLanguageModel中的get\_peft\_model重新初始化了模型，以便使用 PEFT技术。




  我们还需要传递我们在上一步中获取的预训练model。




  在这里，r=64参数定义了 LoRA 自适应中低秩矩阵的秩。此排名通常在 8-128范围内产生最佳结果。




 在这个 LoRA 适配器模型的训练过程中，lora\_dropout=0.05参数会给低秩矩阵引入 dropout。此参数可防止模型过度拟合。




 target\_modules指定了模型中要应用于 LoRA 适配的特定类或模块的名称列表。

4.3 数据准备

 现在，我们已经在预先训练LLM的 .我们可以开始构建将用于训练模型的数据。




  为了构建数据，我们必须以包含输入、指令和响应的方式指定提示。




  Instructions，表示对 LLM.这是我们从 LLM中提出的问题。




  Input，表示除了指令或查询之外，我们还传递了一些数据进行分析。




 Response，表示 LLM.它用于提及如何根据特定的 instruction（query） 定制来自 的LLM响应，无论是否传递任何 input（data）。

Prompt的结构如下所示：


          
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
          
### Instruction:
          
{}
          
### Input:
          
{}
          
### Response:
          
{}"""

  我们创建了一个函数，它将正确地构建 alpaca\_prompt中的所有数据，即


          
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
          
def formatting_prompts_func(examples):
          
    instructions = examples["instruction"]    
          
    inputs       = examples["input"]    
          
    outputs      = examples["output"]    
          
    texts = []    
          
    for instruction, input, output in zip(instructions, inputs, outputs):
          
         # Must add EOS_TOKEN, otherwise your generation will go on forever!        
          
         text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN        
          
         texts.append(text)    
          
    return { "text" : texts, }
          
pass

   现在，我们必须加载将用于微调模型的数据集，在我们的例子中，它是 “yahma/alpaca-cleaned”。


          
from datasets import load_dataset
          
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
          
dataset = dataset.map(formatting_prompts_func, batched = True,)

4.4 训练模型

 现在我们既有了结构化数据，又有了带有 LoRA 适配器或矩阵的模型，我们可以继续训练模型了。




 要训练模型，我们必须初始化某些超参数，这将促进训练过程，也会在一定程度上影响模型的准确性。




  我们将使用 SFTTrainer和 hyperparameter 初始化trainer。


          
from trl import SFTTrainer
          
from transformers import TrainingArguments
          
from unsloth import is_bfloat16_supported
          
trainer = SFTTrainer(
          
    model = model, # The model with LoRA adapters    
          
    tokenizer = tokenizer, # The tokenizer of the model    
          
    train_dataset = dataset, # The dataset to use for training    
          
    dataset_text_field = "text", # The field in the dataset that contains the structured data    
          
    max_seq_length = max_seq_length, # Max length of input sequence that the model can process    
          
    dataset_num_proc = 2, # Noe of processes to use for loading and processing the data    
          
    packing = False, # Can make training 5x faster for short sequences.    
          
    args = TrainingArguments(
          
        per_device_train_batch_size = 2, # Batch size per GPU        
          
        gradient_accumulation_steps = 4, # Step size of gradient accumulation        
          
        warmup_steps = 5,        # num_train_epochs = 1, # Set this for 1 full training run.        
          
        max_steps = 120, # Maximum steps of training        
          
        learning_rate = 2e-4, # Initial learning rate        
          
        fp16 = not is_bfloat16_supported(),        
          
        bf16 = is_bfloat16_supported(),        
          
        logging_steps = 1,        
          
        optim = "adamw_8bit", # The optimizer that will be used for updating the weights        
          
        weight_decay = 0.01,        
          
        lr_scheduler_type = "linear",        
          
        seed = 3407,        
          
        output_dir = "outputs",        
          
        report_to = "none", # Use this for WandB etc    
          
    ),
          
)

   现在，使用此 trainer 开始模型训练


        
            

          trainer\_stats = trainer.train()

  这将开始模型的训练，并将所有步骤及其各自的 Training Loss记录在内核上。

picture.image

4.5 推理微调后的模型

 现在，我们已经完成了模型的训练，我们所要做的就是推断微调后的模型以评估其响应。




  对模型进行推理的代码如下所示：


          
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
          
inputs = tokenizer(
          
[
          
    alpaca_prompt.format(        
          
        "Continue the fibonnaci sequence.", # instruction        
          
        "1, 1, 2, 3, 5, 8", # input        
          
        "", # output - leave this blank for generation!    
          
    )
          
], return_tensors = "pt").to("cuda")
          
outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
          
tokenizer.batch_decode(outputs)

代码解释

 我们使用了 unsloth包中的 FastLanguageModel来加载微调模型以进行推理。此方法可更快地产生结果。




 为了推断模型，我们必须首先将查询转换为结构化提示，然后对提示进行标记化。




 我们还设置了 return\_tensors=“pt”以使分词器返回一个 PyTorch 张量，然后使用 .to（“cuda”）将该张量加载到 GPU 上，以提高处理速度。




  然后我们调用 model.generate（）来生成查询的响应。




 在生成时，我们提到了 max\_new\_tokens=64，它提到了模型应该生成的最大标记数。




  use\_cache=True还可以加快生成速度，尤其是对于较长的序列。




  最后，我们将微调模型的输出从张量解码为文本。

picture.image

4.6 保存微调模型

 这一步完成了对模型进行微调的整个过程，现在我们可以保存优化后的模型以供推理或将来使用。




  我们还需要将 tokenizer 与模型一起保存。以下是将微调后的模型保存在 Hugging Face Hub上的方法。


          
# Pushing with 4bit precision
          
model.push_to_hub_merged("<YOUR_HF_ID>/<MODEL_NAME>", tokenizer, save_method = "merged_4bit", token = "<YOUR_HF_TOKEN>")
          
# Pushing with 16 bit precision 
          
model.push_to_hub_merged("<YOUR_HF_ID>/<MODEL_NAME>", tokenizer, save_method = "merged_16bit", token = "<YOUR_HF_TOKEN>")

 在这里，您必须设置模型的名称，该名称将用于在 hub 上设置模型的 id。




  可以上传 4 位或 16 位精度的完整合并模型。合并模型表示预先训练的模型以及 LoRA 矩阵上传到中心，而有一些选项只能推送除模型之外的 LoRA 矩阵。