使用Mistral-7B进行多标签分类：通过量化和LoRA技术在单个GPU上实现高效训练 - 文章 - 开发者社区

“

Multilabel Classification using Mistral-7B on a single GPU with quantization and LoRA
https://hf-mirror.com/blog/sirluk/multilabel-llm

LLMs（大型语言模型）以其解决各种任务的能力给人留下了深刻印象，不仅在自然语言处理领域，也在多模态设置中。由于它们的规模（即使是“较小”的LLMs也有超过10亿个参数）和硬件需求，对于没有大量计算预算的人来说，直接微调它们并不容易。然而，有一些技术可以减少参数数量并提高这些模型的效率，例如LoRA和量化。在这篇文章中，将演示如何使用Huggingface（HF）库的transformers、bitsandbytes和peft，这些库提供了这些方法的Python实现。 将展示如何将Mistral 7b，一个最先进的LLM，应用于多类分类任务。

这篇文章绝不是第一篇涉及此类主题的文章，还有其他优秀的资源。然而，没有找到任何关于 多标签分类 的具体资源，希望这篇文章能对大家有一些帮助。下面的代码示例Python脚本可以在这里找到：


        
            

          https://github.com/sirluk/llm\_finetuning/tree/main/llm\_multilabel\_clf

导入

所有下面代码片段所需的导入


          
import os
          
import random
          
import functools
          
import csv
          
import numpy as np
          
import torch
          
import torch.nn.functional as F
          
from sklearn.metrics import f1_score
          
from skmultilearn.model_selection import iterative_train_test_split
          
from datasets import Dataset, DatasetDict
          
from peft import (
          
    LoraConfig,
          
    prepare_model_for_kbit_training,
          
    get_peft_model
          
)
          
from transformers import (
          
    AutoModelForSequenceClassification,
          
    AutoTokenizer,
          
    BitsAndBytesConfig,
          
    TrainingArguments,
          
    Trainer
          
)

数据集

将使用这个Kaggle数据集：基于标题和摘要的研究文章的主题建模。

数据示例

ID : 5

标题：比较离散小波变换和波列张量分解在药用植物FTIR数据特征提取中的应用

摘要：使用来自7种植物样本的傅里叶变换红外（FTIR）光谱来探索预处理和特征提取对机器学习算法效率的影响。

计算机科学：1

物理：0

数学统计：0

定量生物学：0

定量金融：0

从train.csv创建一个HF数据集，因为稍后在使用HF库的其他函数/类时需要它。


          
# set random seed
          
random.seed(0)
          

          
# load data
          
with open('train.csv', newline='') as csvfile:
          
    data = list(csv.reader(csvfile, delimiter=','))
          
    header_row = data.pop(0)
          

          
# shuffle data
          
random.shuffle(data)
          

          
# reshape
          
idx, text, labels = list(zip(*[(int(row[0]), f'Title: {row[1].strip()}\n\nAbstract: {row[2].strip()}', row[3:]) for row in data]))
          
labels = np.array(labels, dtype=int)
          

          
# create label weights
          
label_weights = 1 - labels.sum(axis=0) / labels.sum()
          

          
# stratified train test split for multilabel ds
          
row_ids = np.arange(len(labels))
          
train_idx, y_train, val_idx, y_val = iterative_train_test_split(row_ids[:,np.newaxis], labels, test_size = 0.1)
          
x_train = [text[i] for i in train_idx.flatten()]
          
x_val = [text[i] for i in val_idx.flatten()]
          

          
# create hf dataset
          
ds = DatasetDict({
          
    'train': Dataset.from_dict({'text': x_train, 'labels': y_train}),
          
    'val': Dataset.from_dict({'text': x_val, 'labels': y_val})
          
})

在这个片段中，使用了skmultilearn这个稍微有些奇特的包，专门用它来实现这个功能。这为不平衡的多标签数据集创建了一个均匀的分割，正如你在这个例子中可以看到的那样。因此，还生成了标签权重，稍后将用于计算损失，因为希望给代表性不足的类别分配更高的权重。使用加权损失函数当然非常依赖于你的用例和全局准确率与个别类别准确率之间的权衡。

初始化模型

接下来，初始化模型和分词器。正如在介绍中提到的，将使用Mistral 7b，它在各种NLP基准测试中表现出色。下面的代码应该适用于HF中心的任何decoder-only LLM。

对于微调，使用LoRA来学习两个低维diff矩阵，而不是微调整个参数矩阵。由于不需要在LoRA微调期间改变预训练参数，所以可以使用HF的bitsandbytes库对它们进行量化。除了的模型，当然还需要初始化一个分词器来预处理数据集。


          
# model name
          
model_name = 'mistralai/Mistral-7B-v0.1'
          

          
# preprocess dataset with tokenizer
          
def tokenize_examples(examples, tokenizer):
          
    tokenized_inputs = tokenizer(examples['text'])
          
    tokenized_inputs['labels'] = examples['labels']
          
    return tokenized_inputs
          

          
tokenizer = AutoTokenizer.from_pretrained(model_name)
          
tokenizer.pad_token = tokenizer.eos_token
          
tokenized_ds = ds.map(functools.partial(tokenize_examples, tokenizer=tokenizer), batched=True)
          
tokenized_ds = tokenized_ds.with_format('torch')
          

          
# qunatization config
          
quantization_config = BitsAndBytesConfig(
          
    load_in_4bit = True, # enable 4-bit quantization
          
    bnb_4bit_quant_type = 'nf4', # information theoretically optimal dtype for normally distributed weights
          
    bnb_4bit_use_double_quant = True, # quantize quantized weights //insert xzibit meme
          
    bnb_4bit_compute_dtype = torch.bfloat16 # optimized fp format for ML
          
)
          

          
# lora config
          
lora_config = LoraConfig(
          
    r = 16, # the dimension of the low-rank matrices
          
    lora_alpha = 8, # scaling factor for LoRA activations vs pre-trained weight activations
          
    target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
          
    lora_dropout = 0.05, # dropout probability of the LoRA layers
          
    bias = 'none', # wether to train bias weights, set to 'none' for attention layers
          
    task_type = 'SEQ_CLS'
          
)
          

          
# load model
          
model = AutoModelForSequenceClassification.from_pretrained(
          
    model_name,
          
    quantization_config=quantization_config,
          
    num_labels=labels.shape[1]
          
)
          
model = prepare_model_for_kbit_training(model)
          
model = get_peft_model(model, lora_config)
          
model.config.pad_token_id = tokenizer.pad_token_id

正如你在LoraConfig中的target_modules中看到的，只对注意力权重进行微调。这效果很好，并且参数效率更高，因为Transformer层中参数的最大部分来自前馈网络，我们冻结并量化它。r是LoRA矩阵的维度，在我们的情况下是4096x16和16x4096，比Mistral注意力层中的完整4096x4096权重矩阵要小得多。

HF类AutoModelForSequenceClassification在最后一层token嵌入之上初始化了一个额外的（未训练的）线性分类层。这一层在量化时自动被排除，与LoRA权重一起微调它。

训练

在准备好数据集和设置模型配置之后，几乎准备好使用HF Trainer类来微调模型。但在这样做之前，必须定义一些自定义函数，训练器将使用这些函数。

数据Collator

需要告诉训练器在将数据集的批次传递给模型之前应该如何预处理它们。

指标

还需要传递一个函数给训练器，定义我们想要计算的评估指标，除了损失之外。


          
# define custom batch preprocessor
          
def collate_fn(batch, tokenizer):
          
    dict_keys = ['input_ids', 'attention_mask', 'labels']
          
    d = {k: [dic[k] for dic in batch] for k in dict_keys}
          
    d['input_ids'] = torch.nn.utils.rnn.pad_sequence(
          
        d['input_ids'], batch_first=True, padding_value=tokenizer.pad_token_id
          
    )
          
    d['attention_mask'] = torch.nn.utils.rnn.pad_sequence(
          
        d['attention_mask'], batch_first=True, padding_value=0
          
    )
          
    d['labels'] = torch.stack(d['labels'])
          
    return d
          

          
# define which metrics to compute for evaluation
          
def compute_metrics(p):
          
    predictions, labels = p
          
    f1_micro = f1_score(labels, predictions > 0, average = 'micro')
          
    f1_macro = f1_score(labels, predictions > 0, average = 'macro')
          
    f1_weighted = f1_score(labels, predictions > 0, average = 'weighted')
          
    return {
          
        'f1_micro': f1_micro,
          
        'f1_macro': f1_macro,
          
        'f1_weighted': f1_weighted
          
    }

此外，还需要定义一个自定义训练器类，以便能够计算多标签损失，它将每个输出神经元视为一个二分类实例。为了能够使用我们的标签权重进行损失计算，还需要在__init__方法中将其定义为类属性，以便compute_loss


            
# create custom trainer class to be able to pass label weights and calculate mutilabel loss
            
class CustomTrainer(Trainer):
            

            
    def __init__(self, label_weights, **kwargs):
            
        super().__init__(**kwargs)
            
        self.label_weights = label_weights
            
    
            
    def compute_loss(self, model, inputs, return_outputs=False):
            
        labels = inputs.pop("labels")
            
        
            
        # forward pass
            
        outputs = model(**inputs)
            
        logits = outputs.get("logits")
            
        
            
        # compute custom loss
            
        loss = F.binary_cross_entropy_with_logits(logits, labels.to(torch.float32), pos_weight=self.label_weights)
            
        return (loss, outputs) if return_outputs else loss

现在一切都准备好了，可以让HF发挥它的魔力。（根据你的GPU内存，你可能需要/想要调整批量大小，这是在

16GB RAM的GPU

上测试的）


            
# define training args
            
training_args = TrainingArguments(
            
    output_dir = 'multilabel_classification',
            
    learning_rate = 1e-4,
            
    per_device_train_batch_size = 8,
            
    per_device_eval_batch_size = 8,
            
    num_train_epochs = 10,
            
    weight_decay = 0.01,
            
    evaluation_strategy = 'epoch',
            
    save_strategy = 'epoch',
            
    load_best_model_at_end = True
            
)
            

            
# train
            
trainer = CustomTrainer(
            
    model = model,
            
    args = training_args,
            
    train_dataset = tokenized_ds['train'],
            
    eval_dataset = tokenized_ds['val'],
            
    tokenizer = tokenizer,
            
    data_collator = functools.partial(collate_fn, tokenizer=tokenizer),
            
    compute_metrics = compute_metrics,
            
    label_weights = torch.tensor(label_weights, device=model.device)
            
)
            

            
trainer.train()
            

            
# save model
            
peft_model_id = 'multilabel_mistral'
            
trainer.model.save_pretrained(peft_model_id)
            
tokenizer.save_pretrained(peft_model_id)

就是这样！刚刚微调了一个最先进的LLM用于多标签分类。可以使用以下代码片段从HF文档中加载保存的模型。


            
# load model
            
peft_model_id = 'multilabel_mistral'
            
model = AutoModelForSequenceClassification.from_pretrained(peft_model_id)

希望这篇文章能够帮助你了解如何利用HF实现的计算和内存高效技术，如LoRA和量化，来进行微调任务。对于最直接的用例有很多文档，但一旦你的需求稍有偏差，就需要一些调整，比如定义一个自定义的Trainer类。