超简单：使用colab微调自己的Llama 2模型 - 文章 - 开发者社区

动手点关注

干货不迷路

标题 | Fine-Tune Your Own Llama 2 Model in a Colab Notebook

作者 | Maxime Labonne

为了能够充分理解本文，建议前置阅读：

一文探秘LLM应用开发(5)-微调(背景与挑战)

一文探秘LLM应用开发(6)-微调(方案理论)

一文探秘LLM应用开发(7)-微调(工具实践)

开发必备工具-Colab使用介绍

开发必备工具-Colab进阶的21个小技巧

随着 llama 1 的发布，微调模型获得寒武纪大爆炸式出现，其中包括 Alpaca、Vicuna 和 WizardLM 等（但llama 1不可商用）。这一原因驱使企业纷纷推出自己的适合商业用途license的基础模型，如 OpenLLaMA、Falcon、XGen 等。而现在发布的 Llama 2 结合了两者优势：高效的基础模型和更宽松的许可证。

2023 年上半年，构建于大语言模型（LLM）的基础设施之上LLM API（如 OpenAI API）的广泛使用极大地改变了软件格局。这其中，LangChain 和 LlamaIndex 等库在这一趋势中发挥了关键作用。进入下半年，fine-tuning（或指令微调）这些模型的过程将成为 LLMOps 工作流程中的标准过程。推动这一趋势的原因有很多：节约成本的潜力、处理私有保密数据的能力，甚至是开发在某些特定任务中性能超过 ChatGPT 和 GPT-4 等著名模型的潜力。

在本文中，我们将了解指令微调为何有效，以及如何在 Google Colab 笔记本中实施指令调整，训练自己的 Llama 2 模型。

微调的背景知识

picture.image

LLM 是在大量文本语料库中进行预训练的。就 Llama 2 而言，我们对训练集的组成知之甚少，只知道其长度为 2 万亿个tokens。相比之下，BERT模型（2018）仅在 BookCorpus（800M）和英语维基百科（2500M）上进行了训练。根据经验，训练过程是一个非常昂贵且漫长的过程，会遇到很多工程问题。这也是一般企业没必要也不能从零开始训练模型的原因。预训练完成后，像 Llama 2 这样的自回归模型就可以预测序列中的下一个token。然而，这并不能使它们成为特别有用的助手，因为它们并不能遵从人类指令。这就是我们采用指令微调的原因，以使它们的回答与人类的期望相一致。微调类型主要有两种：

监督微调（SFT）：在指令和回答的数据集上对模型进行训练。它调整LLM 中参数的权重，以尽量减少生成的答案与真实标签（label）回答之间的差异。
基于人类反馈的强化学习（RLHF）：模型通过与环境互动和接收反馈来学习。对模型的训练是为了最大化奖励信号（使用 PPO），奖励信号通常来自人类对模型输出的评估。

一般来说，RLHF 可以捕捉到更复杂、更细微的人类偏好，但有效实施起来也更具挑战性。事实上，它需要对奖励系统进行精心设计，并对人类反馈的质量和一致性非常敏感。直接偏好优化（DPO）算法是未来可能的替代方案，它可以直接在 SFT 模型上运行偏好学习。在这个的案例中将执行 SFT，但这就提出了一个问题：为什么微调起作用？正如 Orca 论文中所强调的，我们的理解是，微调利用的是预训练过程中学到的知识。换句话说，如果模型从未见过您感兴趣的数据，那么微调就不会有什么帮助。不过，如果是这种情况，SFT 的性能就会非常出色（补齐了领域背景支持，这也是为什么通过微调可以做企业大模型的原因）。

例如，LIMA 论文展示了如何在 1000 个高质量样本上微调具有 650 亿个参数的 LLaMA (v1) 模型，从而超越 GPT-3 (DaVinci003)。要达到这样的性能水平，指令数据集的质量至关重要，这也是大量工作（如 evol-instruct、Orca 或 phi-1）关注这一问题的原因。请注意，LLM 的大小（65b，而不是 13b 或 7b）对于有效利用已有知识也至关重要。

与数据质量有关的另一个要点是prompt模板。提示由类似的元素组成：用于指导模型的系统提示（可选）、用于发出指令的用户提示（必选）、需要考虑的附加输入（可选）以及模型的答案（必选）。就 Llama 2 而言，Meta使用了以下模板：

<s>[INST] <<SYS>>  
System prompt  
<</SYS>>  
  
User prompt [/INST] Model answer </s>

还有其他一些模板，如Alpaca和Vicuna的模板，它们的影响并不十分明显。在本例中，我们将按照 Llama 2 的模板重新格式化我们的指令数据集。为了本教程的目的，我已经使用timdettmers/openassistant-guanaco （https://huggingface.co/datasets/timdettmers/openassistant-guanaco）数据集完成了这一工作。实际训练数据集做了精简，保留了1k数据，名称为mlabonne/guanaco-llama2-1k。

picture.image

如何微调Llama 2

colab地址：https://colab.research.google.com/drive/1PEQyJO1-f6j0S\_XJ8DV50NkpzasXkrzd?usp=sharing

在本节中，将使用谷歌 Colab（2.21 credits /小时）在配备高内存的 T4 GPU 上微调包含 70 亿个参数的 Llama 2 模型。请注意，T4 只有 16 GB 的 V RAM，仅够存储 Llama 2-7b 的权重（在 FP16 中为 7b × 2 字节 = 14 GB）。此外，我们还需要考虑优化器状态、梯度和前向激活带来的开销。这意味着完全微调在这里是不可能的：我们需要像 LoRA 或 QLoRA 这样的参数高效微调（PEFT）技术。

为了大幅降低显存占用率，我们必须以 4 位精度对模型进行微调，这就是我们在这里使用 QLoRA 的原因。好在我们可以利用 Hugging Face 生态系统中的 transformers、accelerate、peft、trl 和 bitsandbytes 库。下面的代码就是根据 Younes Belkada 的 GitHub Gist（https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da）编写的。

首先，安装并加载这些库。


        
            

          
 !pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7


          
import os
          
import torch
          
from datasets import load_dataset
          
from transformers import (
          
    AutoModelForCausalLM,
          
    AutoTokenizer,
          
    BitsAndBytesConfig,
          
    HfArgumentParser,
          
    TrainingArguments,
          
    pipeline,
          
    logging,
          
)
          
from peft import LoraConfig, PeftModel
          
from trl import SFTTrainer

我们要加载一个 llama-2-7b-chat-hf 模型，并在 mlabonne/guanaco-llama2-1k （1,000 个样本）上对其进行训练，这将生成我们的微调模型 llama-2-7b-miniguanaco。 colab地址：https://colab.research.google.com/drive/1Ad7a9zMmkxuXTOh1Z7-rNSICA4dybpM2?usp=sharing。也可以修改为Hugging Face 其它的数据集（包括中文数据集），比如 databricks/databricks-dolly-15k。

再调整以下训练参数，QL oRA 的秩为 64，缩放参数为 16。使用 NF4 类型以 4-bit精度直接加载 Llama 2 模型，并对其进行 1 个epoch的训练。要获取有关其他参数的更多信息，可在huggingface上查看 TrainingArguments、PeftModel 和 SFTTrainer 文档。


          
# The model that you want to train from the Hugging Face hub
          
model_name = "daryl149/llama-2-7b-chat-hf"
          

          
# The instruction dataset to use
          
dataset_name = "mlabonne/guanaco-llama2-1k"
          

          
# Fine-tuned model name
          
new_model = "llama-2-7b-miniguanaco"
          

          
################################################################################
          
# QLoRA parameters
          
################################################################################
          

          
# LoRA attention dimension
          
lora_r = 64
          

          
# Alpha parameter for LoRA scaling
          
lora_alpha = 16
          

          
# Dropout probability for LoRA layers
          
lora_dropout = 0.1
          

          
################################################################################
          
# bitsandbytes parameters
          
################################################################################
          

          
# Activate 4-bit precision base model loading
          
use_4bit = True
          

          
# Compute dtype for 4-bit base models
          
bnb_4bit_compute_dtype = "float16"
          

          
# Quantization type (fp4 or nf4)
          
bnb_4bit_quant_type = "nf4"
          

          
# Activate nested quantization for 4-bit base models (double quantization)
          
use_nested_quant = False
          

          
################################################################################
          
# TrainingArguments parameters
          
################################################################################
          

          
# Output directory where the model predictions and checkpoints will be stored
          
output_dir = "./results"
          

          
# Number of training epochs
          
num_train_epochs = 1
          

          
# Enable fp16/bf16 training (set bf16 to True with an A100)
          
fp16 = False
          
bf16 = False
          

          
# Batch size per GPU for training
          
per_device_train_batch_size = 4
          

          
# Batch size per GPU for evaluation
          
per_device_eval_batch_size = 4
          

          
# Number of update steps to accumulate the gradients for
          
gradient_accumulation_steps = 2
          

          
# Enable gradient checkpointing
          
gradient_checkpointing = True
          

          
# Maximum gradient normal (gradient clipping)
          
max_grad_norm = 0.3
          

          
# Initial learning rate (AdamW optimizer)
          
learning_rate = 2e-4
          

          
# Weight decay to apply to all layers except bias/LayerNorm weights
          
weight_decay = 0.001
          

          
# Optimizer to use
          
optim = "paged_adamw_32bit"
          

          
# Learning rate schedule (constant a bit better than cosine)
          
lr_scheduler_type = "constant"
          

          
# Number of training steps (overrides num_train_epochs)
          
max_steps = -1
          

          
# Ratio of steps for a linear warmup (from 0 to learning rate)
          
warmup_ratio = 0.03
          

          
# Group sequences into batches with same length
          
# Saves memory and speeds up training considerably
          
group_by_length = True
          

          
# Save checkpoint every X updates steps
          
save_steps = 10
          

          
# Log every X updates steps
          
logging_steps = 1
          

          
################################################################################
          
# SFT parameters
          
################################################################################
          

          
# Maximum sequence length to use
          
max_seq_length = None
          

          
# Pack multiple short examples in the same input sequence to increase efficiency
          
packing = False
          

          
# Load the entire model on the GPU 0
          
device_map = {"": 0}

现在就可以加载所有内容，开始微调过程。由于依赖多个封装器，所以需要耐心等待，具体流程：

加载定义的数据集。在这里，数据集已经过预处理，但通常情况下，需要对prompt进行重新格式化、过滤掉错误的数据、合并多个数据集等
将bitsandbytes配置为 4 位量化。
在 GPU 上加载 4-bit精度的 Llama 2 模型，并使用相对应的tokenizer。
加载 QLoRA 的配置和常规训练参数，并将所有内容传递给 SFTTrainer。


          
# Load dataset (you can process it here)
          
dataset = load_dataset(dataset_name, split="train")
          

          
# Load tokenizer and model with QLoRA configuration
          
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
          

          
bnb_config = BitsAndBytesConfig(
          
    load_in_4bit=use_4bit,
          
    bnb_4bit_quant_type=bnb_4bit_quant_type,
          
    bnb_4bit_compute_dtype=compute_dtype,
          
    bnb_4bit_use_double_quant=use_nested_quant,
          
)
          

          
# Check GPU compatibility with bfloat16
          
if compute_dtype == torch.float16 and use_4bit:
          
    major, _ = torch.cuda.get_device_capability()
          
    if major >= 8:
          
        print("=" * 80)
          
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
          
        print("=" * 80)
          

          
# Load base model
          
model = AutoModelForCausalLM.from_pretrained(
          
    model_name,
          
    quantization_config=bnb_config,
          
    device_map=device_map
          
)
          
model.config.use_cache = False
          
model.config.pretraining_tp = 1
          

          
# Load LLaMA tokenizer
          
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
          
tokenizer.pad_token = tokenizer.eos_token
          
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
          

          
# Load LoRA configuration
          
peft_config = LoraConfig(
          
    lora_alpha=lora_alpha,
          
    lora_dropout=lora_dropout,
          
    r=lora_r,
          
    bias="none",
          
    task_type="CAUSAL_LM",
          
)
          

          
# Set training parameters
          
training_arguments = TrainingArguments(
          
    output_dir=output_dir,
          
    num_train_epochs=num_train_epochs,
          
    per_device_train_batch_size=per_device_train_batch_size,
          
    gradient_accumulation_steps=gradient_accumulation_steps,
          
    optim=optim,
          
    save_steps=save_steps,
          
    logging_steps=logging_steps,
          
    learning_rate=learning_rate,
          
    weight_decay=weight_decay,
          
    fp16=fp16,
          
    bf16=bf16,
          
    max_grad_norm=max_grad_norm,
          
    max_steps=max_steps,
          
    warmup_ratio=warmup_ratio,
          
    group_by_length=group_by_length,
          
    lr_scheduler_type=lr_scheduler_type,
          
    report_to="tensorboard"
          
)
          

          
# Set supervised fine-tuning parameters
          
trainer = SFTTrainer(
          
    model=model,
          
    train_dataset=dataset,
          
    peft_config=peft_config,
          
    dataset_text_field="text",
          
    max_seq_length=max_seq_length,
          
    tokenizer=tokenizer,
          
    args=training_arguments,
          
    packing=packing,
          
)
          

          
# Train model
          
trainer.train()
          

          
# Save trained model
          
trainer.model.save_pretrained(output_dir)

启动训练！

picture.image

根据数据集的大小不同，训练时间可能会很长。在这里，使用 T4 GPU 花了不到一个小时。我们可以在 tensorboard 上查看曲线图，如下所示：


          
%load_ext tensorboard
          
%tensorboard --logdir results/runs

picture.image

如何确保微调后的模型的行为是复合预期的？这需要更详尽的评估，但我们可以使用huggingface transformers中的text generation pipeline 来提出类似 "What is a large language model?"这样的问题。请注意，这是按照 Llama 2 的prompt模板来格式化输入内容的。


          
# Ignore warnings
          
logging.set_verbosity(logging.CRITICAL)
          

          
# Run text generation pipeline with our next model
          
prompt = "What is a large language model?"
          
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
          
result = pipe(f"<s>[INST] {prompt} [/INST]")
          
print(result[0]['generated_text'])


          
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
          
  warnings.warn(
          
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
          
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
          
<s>[INST] What is a large language model? [/INST] A large language model is a type of artificial intelligence that is trained on a large dataset of text to generate human-like language. It is typically trained on a dataset of text that is much larger than the dataset used for smaller language models. The large dataset allows the model to learn more complex patterns in language, which can result in more accurate and natural-sounding language generation.
          

          
Large language models are often used for tasks such as text summarization, language translation, and chatbots. They are also used for more complex tasks such as writing articles, generating creative content, and even composing music.
          

          
Large language models are trained using a technique called deep learning, which involves using many layers of artificial neural networks to learn complex patterns in the data. The model is trained on a large dataset of text, and the neural networks are trained to predict the next word in a sequence of text given

模型输出的响应如下：

A large language model is a type of artificial intelligence that is trained on a large dataset of text to generate human-like language. It is typically trained on a dataset of text that is much larger than the dataset used for smaller language models. The large dataset allows the model to learn more complex patterns in language, which can result in more accurate and natural-sounding language generation.  
  
Large language models are often used for tasks such as text summarization, language translation, and chatbots. They are also used for more complex tasks such as writing articles, generating creative content, and even composing music.  
  
Large language models are trained using a technique called deep learning, which involves using many layers of artificial neural networks to learn complex patterns in the data. The model is trained on a large dataset of text, and the neural networks are trained to predict the next word in a sequence of text given

根据经验，对于一个只有 70 亿个参数的模型来说，还是非常不错的。您可以使用它，并从 BigBench-Hard（https://github.com/suzgunmirac/BIG-Bench-Hard）等评估数据集中提出更难的问题。Guanaco 是一个优秀的数据集，过去曾产生过高质量的模型。可以使用 mlabonne/guanaco-llama2 （https://huggingface.co/datasets/mlabonne/guanaco-llama2）在整个数据集上训练 Llama 2 模型。

现在如何存储新的 llama-2-7b-miniguanaco 模型？需要将 LoRA 的权重与基础模型合并。不幸的是，据我所知，没有直接的方法可以做到这一点：需要以 FP16 精度重新加载基础模型，然后使用 peft 库合并所有内容。唉，这也会给 VRAM 带来问题（尽管已经清空了），所以建议重启笔记本，重新执行前三个单元，然后再执行下一个单元。


          
# Reload model in FP16 and merge it with LoRA weights
          
base_model = AutoModelForCausalLM.from_pretrained(
          
    model_name,
          
    low_cpu_mem_usage=True,
          
    return_dict=True,
          
    torch_dtype=torch.float16,
          
    device_map=device_map,
          
)
          
model = PeftModel.from_pretrained(base_model, output_dir)
          
model = model.merge_and_unload()
          

          
# Reload tokenizer to save it
          
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
          
tokenizer.pad_token = tokenizer.eos_token
          
tokenizer.padding_side = "right"

合并完权重并重新加载了tokenizer。现在就可以将所有内容推送到 Hugging Face Hub保存模型。


          
!huggingface-cli login
          

          
model.push_to_hub(new_model, use_temp_dir=False)
          
tokenizer.push_to_hub(new_model, use_temp_dir=False)


        
            

          CommitInfo(commit\_url='https://huggingface.co/mlabonne/llama-2-7b-guanaco/commit/0f5ed9581b805b659aec68484820edb5e3e6c3f5', commit\_message='Upload tokenizer', commit\_description='', oid='0f5ed9581b805b659aec68484820edb5e3e6c3f5', pr\_url=None, pr\_revision=None, pr\_num=None)

现在，就可以像从Hub上加载其他 Llama 2 模型一样，使用该模型进行推理。还可以重新加载该模型以进行更多微调，比如使用另一个数据集。也可以使用以下脚本直接微调：


          
pip install trl
          
git clone https://github.com/lvwerra/trl
          
python trl/examples/scripts/sft_trainer.py \
          
    --model_name meta-llama/Llama-2-7b-hf \
          
    --dataset_name timdettmers/openassistant-guanaco \
          
    --load_in_4bit \
          
    --use_peft \
          
    --batch_size 4 \
          
    --gradient_accumulation_steps 2

小结

通过本文介绍，希望能够帮助到对于希望赶紧将llama 2用到自己行业里的朋友，准备好机器资源和数据集就可以炼制自己的大模型了，大模型的安卓机混战时代已经来临！

picture.image

阅读至此了，分享、点赞、在看三选一吧🙏