一种基于滑动层合并的高效深度修剪大模型的方法 - 文章 - 开发者社区

picture.image

向AI转型的程序员都关注公众号机器学习AI算法工程

原论文： A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs

Abstract

与宽度修剪 width-wise pruning相比，深度修剪 depth-wise pruning 可以显著加快资源受限场景下的推理速度。然而，将整个Transformer层作为最小修剪单元可能会由于不加区分地丢弃层的全部信息而降低模型的性能。本文通过分析再现核希尔伯特空间kernel Hilbert space中各层输出的相关性，揭示了大型语言模型中各层之间的“类patch”特征关系。在此基础上，我们提出了一种滑动层合并方法 sliding layer merging method，该方法根据预定义的相似度阈值从上到下动态地选择和融合连续层 consecutive layers，从而在保持模型性能的同时简化了模型结构。在不同体系结构和不同参数尺度的llm上进行的大量实验表明，我们的方法在zero-shot 推理性能和修剪后的再训练恢复质量方面都优于现有的修剪技术。特别是，在Vicuna-7B模型上进行35%剪枝的实验中，我们的方法在zero-shot 任务上的平均性能比现有方法提高了1.654%。此外，我们进一步揭示了深度修剪与宽度修剪相结合的潜力，以提高修剪效果。

Contribution：

• analyzed the inter-layer correlations in LLMs within a reproducing kernel Hilbert space, observing an interesting Patch-Like correlation distribution, which provides valuable insights for the design of model compression strategies.

• propose the Sliding Layer Merging method, which dynamically merges layers with strong rep-resentational similarity in LLMs. This method can be seamlessly applied to various existing LLM ar-chitectures.

Related Works

Pruning Method on LLMs

wide-wise approach

The width-wise approach reduces the network width by pruning coupled structures, such as attention heads and their associated weight connections, while preserving the number of layers

Voita et al., 2019] and [Michel et al., 2019] introduced pruning and attention head sharing techniques to reduce redundant 多余的 attention heads, thereby decreasing both computational complexity and parameter requirements.

[Nova et al., 2023] and [Santacroce et al., 2023] optimized the feed-forward network by reducing the dimension of the FFN hidden layer, thereby reducing the memory footprint and computational complexity.

Deep-wise approach

the depth-wise ap-proach reduces the network depth by completely removing certain layers，

drawback：1） remains underexplored in terms of analyzing the correlations between Transformer layers at different depths，2） arbitrary removing spe-cific layers may degrade the performance of the pruned model.

Shortened-LLM [Kim et al., 2024] selected Taylor+ and PPL indicators as the importance mea-sure of the Transformer layer, and directly deleted the unimportant Transformer layer

The layer-skipping strategy [Schuster et al., 2022; Del Corro et al., 2023; Raposo et al., 2024] dynamically selecting which layers to skip during execution.

[Song et al., 2024; Tang et al., 2024] reduce the model’s depth by eliminating redundant layers

Motivation

第一次见论文写这个节的欸

CKA vector similarity

Center Kernel Alignment (CKA) is a metric used to compare the internal representations of neural networks.

Its main advantages are its invariance to orthogonal transformations (e.g. changes in neuron arrangement) and its robustness to isotropic scaling achieved through a normalization term [Raghu et al., 2021].

suitable for studying the under-lying relationships between different Transformer layers within large language models.

Step1 计算

Calculate the Gram matrix (kernel matrix) of two representation matrices to measure the similarity of representations.

picture.image

感觉跟散度比较相似？好像我记得也有用kl散度来衡量layer之间的相似度的(好吧记错了）

核心概念

CKA 相似性：用于衡量两个特征表示（通常是高维向量或矩阵）之间的相似性。主要用于神经网络的表示学习分析。CKA 计算的是中心化核（如线性核或 RBF 核）之间的相似性，并归一化到 [0,1] 之间，其中 1 表示完全相同，0 表示完全不相关。对称的，即 CKA(X,Y)=CKA(Y,X)

picture.image

应用场景

CKA：

常用于深度学习中不同层的特征对比，分析神经网络不同层的表示是否相似。

可用于评估不同模型是否学到了相似的特征表示。

适用于非概率分布的数据，如神经网络隐藏层的表示。

KL 散度：

主要用于概率分布的比较，常用于信息论、贝叶斯推理和机器学习（如变分自编码器 VAE）。

在优化过程中，如最大化似然估计（MLE）或变分推断，KL 散度用于衡量目标分布与模型分布的偏差。

KL 散度也用于 GAN（Generative Adversarial Networks）和其他概率模型。

归一化

CKA：结果归一化在 [0,1] 之间，易于解释。

KL 散度：理论上无上界，最小值为 0（当 P=QP = QP=Q 时），但最大值依赖于具体分布。

picture.image

step1 Model initialization.

During initialization, we as-sign the original model M to the target compression model M∗, and set the compression layer range to [L, H], where H is the highest layer. L is the lowest level.

Set-ting the compression layer range helps us adopt appro-priate protection mechanisms to ensure that the com-pression model does not lose the functionality of key layers.

总之就是根据CKA结果排序，保护弱相关层

step2 Layer merge.

picture.image

We take the highest layer H as the initial upper bound of the sliding window, and set its

next layer H 1 as the initial lower bound.By merging mulitiple layers within the sliding window, we get the temporary model Mtmp. We adopt a layer merging strategy based on inter-layer differences (as shown in Fig.2 (b)), adding the differences between the parame-ters of adjacent layers and the base layer, and gradually integrating redundant information. This strategy can not only capture the correlation between layers, but also has the flexibility to adapt to the compression needs of different models.

step3 Iterative update.

picture.image

To measure the impact of the merging operation on the model output representation, we calculate the cosing similarity of the last hidden states of the original model M and the merged model Mtmp on the few-shot calibration dataset (see Fig.2 (c)).

If the representation similarity between Mtmp and the original model is greater than the set threshold T , the merging layer is considered to have a small impact on the model performance. At this time, the lower bound of the sliding window is moved down one layer to expand the merging range.

if the representation similar-ity between Mtmp and the original model is less than the threshold T , it is considered that the impact of the merging layer on the model performance is too great, and the sliding window should stop expanding.At this time, the compressed model M∗ is updated to M tmp, and the upper bound of the sliding window is updated to the current lower bound of the sliding window, and then enters the next round of layer merging (see Fig.2(a)).

picture.image

step4 Termination condition.

The process continues un-til the lowest level L is processed. Ultimately, the pruned model M∗ output by the algorithm reduces redundant computing and storage requirements by retaining the merged representation of key layers.

picture.image

Results

picture.image

Conclusion

通过分析再现核希尔伯特空间中各层输出之间的相关性，揭示了大型语言模型中各层之间的 ”patch-like" 关系模式。基于此，我们提出了一种动态选择和合并层参数的深度修剪方法。该方法设置相似阈值，并从上到下合并连续层，在有效保持性能的同时实现快速的模型压缩。实验结果表明，我们的方法在资源受限环境下显著加快了推理速度，并且在零采样任务上优于现有的剪枝技术。此外，我们的方法可以与宽度修剪技术无缝集成，从而产生具有增强性能的修剪模型。我们希望这项研究能够激发对深度修剪方法的进一步研究，并促进结合深度和宽度修剪策略的统一框架的发展，最终有助于在资源受限的环境中有效部署llm。

机器学习算法AI大数据技术

搜索公众号添加： datanlp

picture.image