LoRA 像 LLM 的风格化的补丁,轻量且可以热切换。
从前想看LoRA计算量比FineTune多,方向比FT少。
自然语言处理的一个重要范式包括在通用领域数据的大规模预训练和对特定任务或领域的adaptation。当我们预先训练较大的模型时,重新训练所有模型参数的full fine-tuning 变得不太可行。
以GPT-3为例,1750亿个参数,需要700GB来存储。 使用full fine-tuning 的模式需要 3x 的显存,即2100GB
(adam/adamW 每个参数有一个一阶和二阶统计量)。这么大的显存用GPU训练是很有挑战的。
学习过的 over-parametrized 模型实际上位于一个 low intrinsic 在维度上。我们假设在模型adaptation过程中,权重的变化也有一个较低的“intrinisic rank”,这导致了我们提出的低秩适应(LoRA)方法.
图1
我们提出了 低秩自适应(Low-Rank adaptive, LoRA)
,冻结预训练的模型权重,并将可训练的 秩分解矩阵
注入到Transformer架构的 每一层
中, 大大减少
了下游任务的 可训练参数数量
。
LoRA允许我们通过优化dense layer变化的秩分解矩阵来间接训练神经网络中的一些 dense layer, 而保持预训练的权重冻结,如图1所示。
与Adam调优的GPT-3 175B相比,LoRA 可训练参数
的数量 减少了10000倍
,对GPU 内存
的需求 减少了3倍
。
- 与 full fine-tuning 相比,具有较少的可训练参数和较高的训练吞吐量,在模型质量方面的性能与微调相当或更好。
- 与 adapters 相比,没有额外的
推理延迟
。
LoRA both storage- and compute-efficient
LoRA有几个关键的优点
- A pre-trained model can be shared and used to build many small LoRA modules for different tasks. We can freeze the shared model and efficiently switch tasks by replacing the matrices A and B in Figure 1, reducing the storage requirement and task-switching overhead significantly.
- LoRA makes training more efficient and lowers the hardware barrier to entry by up to 3 times when using adaptive optimizers since we do not need to calculate the gradients or maintain the optimizer states for most parameters. Instead, we only optimize the injected, much smaller low-rank matrices.
- Our simple linear design allows us to merge the trainable matrices with the frozen weights when deployed, introducing no inference latency compared to a fully fine-tuned model, by construction.
- LoRA is orthogonal to many prior methods and can be combined with many of them, such as prefix-tuning.
AREN’T EXISTING SOLUTIONS GOOD ENOUGH
以语言建模为例,当涉及到 efficient adaptations 时,有两种突出的策略:添加adapter层或优化某些形式的输入层激活.
Adapter Layers Introduce Inference Latency
显然,增加模型层数会增加inference的时长:
While one can reduce the overall latency by pruning layers or exploiting multi-task settings, there is no direct ways to bypass the extra compute in adapter layers.
从上图可以看出,对于线上batch size为1,sequence length比较短的情况,inference latency的变化比例会更明显。不过个人认为,绝对延迟的区别不大。
Directly Optimizing the Prompt is Hard
与Prefix-Tuning的难于训练相比,LoRA则更容易训练:
We observe that prefix tuning is difficult to optimize and that its performance changes non-monotonically in trainable parameters, confirming similar observations in the original paper.
OUR METHOD
LOW-RANK-PARAMETRIZED UPDATE MATRICES
图1
这种思想有点类似于残差连接,同时使用这个旁路的更新来模拟 full finetuning 的过程。并且,full finetuning 可以被看做是 LoRA 的特例(当r=k时):
This means that when applying LoRA to all weight matrices and training all biases, we roughly recover the expressiveness of full fine-tuning by setting the LoRA rank r to the rank of the pre-trained weight matrices.
In other words, as we increase the number of trainable parameters, training LoRA roughly converges to training the original model, while adapter-based methods converges to an MLP and prefix-based methods to a model that cannot take long input sequences.
No Additional Inference Latency
与通过构造fine-tuned模型相比,这种方法保证了我们在推断过程中不引入任何额外的延迟。
APPLYING LORA TO TRANSFORMER
Practical Benefits and Limitations
On GPT-3 175B, we reduce the VRAM consumption during training from 1.2TB to 350GB. With r = 4 and only the query and value projection matrices being adapted, the checkpoint size is reduced by roughly 10,000× (from 350GB to 35MB)4. This allows us to train with significantly fewer GPUs and avoid I/O bottlenecks.
在GPT-3 175B上,我们将训练期间的VRAM消耗从1.2TB减少到350GB。当r = 4且仅调整了query和value投影矩阵时,检查点大小大约减少了10,000×(从350GB减少到35MB)。这使得我们能够使用更少的gpu进行训练,并避免I/O瓶颈。
另一个好处是,我们可以在部署时以更低的成本在任务之间切换,只需要交换LoRA权重而不是所有参数。这允许创建许多定制模型,这些模型可以在存储在VRAM中的预训练权重的机器上动态地交换 。
我们还观察到,与full fine-tuning相比,GPT-3 175B的训练速度提高了25%,因为我们不需要计算绝大多数参数的梯度。
LoRA也有其局限性。例如,如果选择将A和B吸收到W中以消除额外的推理延迟,那么在一次前向传递中,将具有不同A和B的不同任务的输入批处理就不是那么简单了。在延迟不重要的情况下,可以不合并权重,动态地选择用于批量样本的LoRA模块。
EMPIRICAL EXPERIMENTS
Fine-Tuning (FT)
微调是一种常见的适应方法。在微调过程中,模型初始化为预先训练好的权重和偏差,所有模型参数进行梯度更新。一个简单的变体是在冻结其他图层的同时只更新一些图层。
Bias-only or BitFit
Bias-only 或 BitFit 是一个基线,我们只训练偏差向量,而冻结其他一切。