一种基于滑动层合并的高效深度修剪大模型的方法

大模型机器学习算法

picture.image

向AI转型的程序员都关注公众号 机器学习AI算法工程

原论文: A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs

Abstract

与宽度修剪 width-wise pruning相比,深度修剪 depth-wise pruning 可以显著加快资源受限场景下的推理速度。然而,将整个Transformer层作为最小修剪单元可能会由于不加区分地丢弃层的全部信息而降低模型的性能。本文通过分析再现核希尔伯特空间kernel Hilbert space中各层输出的相关性,揭示了大型语言模型中各层之间的“类patch”特征关系。在此基础上,我们提出了一种滑动层合并方法 sliding layer merging method,该方法根据预定义的相似度阈值从上到下动态地选择和融合连续层 consecutive layers,从而在保持模型性能的同时简化了模型结构。在不同体系结构和不同参数尺度的llm上进行的大量实验表明,我们的方法在zero-shot 推理性能和修剪后的再训练恢复质量方面都优于现有的修剪技术。特别是,在Vicuna-7B模型上进行35%剪枝的实验中,我们的方法在zero-shot 任务上的平均性能比现有方法提高了1.654%。此外,我们进一步揭示了深度修剪与宽度修剪相结合的潜力,以提高修剪效果。

Contribution:

• analyzed the inter-layer correlations in LLMs within a reproducing kernel Hilbert space, observing an interesting Patch-Like correlation distribution, which provides valuable insights for the design of model compression strategies.

• propose the Sliding Layer Merging method, which dynamically merges layers with strong rep-resentational similarity in LLMs. This method can be seamlessly applied to various existing LLM ar-chitectures.

Related Works

Pruning Method on LLMs

wide-wise approach

The width-wise approach reduces the network width by pruning coupled structures, such as attention heads and their associated weight connections, while preserving the number of layers

Voita et al., 2019] and [Michel et al., 2019] introduced pruning and attention head sharing techniques to reduce redundant 多余的 attention heads, thereby decreasing both computational complexity and parameter requirements.

[Nova et al., 2023] and [Santacroce et al., 2023] optimized the feed-forward network by reducing the dimension of the FFN hidden layer, thereby reducing the memory footprint and computational complexity.

Deep-wise approach

the depth-wise ap-proach reduces the network depth by completely removing certain layers,

drawback:1) remains underexplored in terms of analyzing the correlations between Transformer layers at different depths,2) arbitrary removing spe-cific layers may degrade the performance of the pruned model.

Shortened-LLM [Kim et al., 2024] selected Taylor+ and PPL indicators as the importance mea-sure of the Transformer layer, and directly deleted the unimportant Transformer layer

The layer-skipping strategy [Schuster et al., 2022; Del Corro et al., 2023; Raposo et al., 2024] dynamically selecting which layers to skip during execution.

[Song et al., 2024; Tang et al., 2024] reduce the model’s depth by eliminating redundant layers

Motivation

第一次见论文写这个节的欸

CKA vector similarity

Center Kernel Alignment (CKA) is a metric used to compare the internal representations of neural networks.

Its main advantages are its invariance to orthogonal transformations (e.g. changes in neuron arrangement) and its robustness to isotropic scaling achieved through a normalization term [Raghu et al., 2021].

suitable for studying the under-lying relationships between different Transformer layers within large language models.

Step1 计算

Calculate the Gram matrix (kernel matrix) of two representation matrices to measure the similarity of representations.

picture.image

感觉跟散度比较相似?好像我记得也有用kl散度来衡量layer之间的相似度的(好吧记错了)

核心概念

CKA 相似性:用于衡量两个特征表示(通常是高维向量或矩阵)之间的相似性。主要用于神经网络的表示学习分析。CKA 计算的是中心化核(如线性核或 RBF 核)之间的相似性,并归一化到 [0,1] 之间,其中 1 表示完全相同,0 表示完全不相关。对称的,即 CKA(X,Y)=CKA(Y,X)

picture.image

应用场景

CKA:

常用于深度学习中不同层的特征对比,分析神经网络不同层的表示是否相似。

可用于评估不同模型是否学到了相似的特征表示。

适用于非概率分布的数据,如神经网络隐藏层的表示。

KL 散度:

主要用于概率分布的比较,常用于信息论、贝叶斯推理和机器学习(如变分自编码器 VAE)。

在优化过程中,如最大化似然估计(MLE)或变分推断,KL 散度用于衡量目标分布与模型分布的偏差。

KL 散度也用于 GAN(Generative Adversarial Networks)和其他概率模型。

归一化

CKA:结果归一化在 [0,1] 之间,易于解释。

KL 散度:理论上无上界,最小值为 0(当 P=QP = QP=Q 时),但最大值依赖于具体分布。

picture.image

picture.image

picture.image

picture.image

step1 Model initialization.

During initialization, we as-sign the original model M to the target compression model M∗, and set the compression layer range to [L, H], where H is the highest layer. L is the lowest level.

Set-ting the compression layer range helps us adopt appro-priate protection mechanisms to ensure that the com-pression model does not lose the functionality of key layers.

总之就是根据CKA结果排序,保护弱相关层

step2 Layer merge.

picture.image

We take the highest layer H as the initial upper bound of the sliding window, and set its

next layer H 1 as the initial lower bound.By merging mulitiple layers within the sliding window, we get the temporary model Mtmp. We adopt a layer merging strategy based on inter-layer differences (as shown in Fig.2 (b)), adding the differences between the parame-ters of adjacent layers and the base layer, and gradually integrating redundant information. This strategy can not only capture the correlation between layers, but also has the flexibility to adapt to the compression needs of different models.

step3 Iterative update.

picture.image

To measure the impact of the merging operation on the model output representation, we calculate the cosing similarity of the last hidden states of the original model M and the merged model Mtmp on the few-shot calibration dataset (see Fig.2 (c)).

If the representation similarity between Mtmp and the original model is greater than the set threshold T , the merging layer is considered to have a small impact on the model performance. At this time, the lower bound of the sliding window is moved down one layer to expand the merging range.

if the representation similar-ity between Mtmp and the original model is less than the threshold T , it is considered that the impact of the merging layer on the model performance is too great, and the sliding window should stop expanding.At this time, the compressed model M∗ is updated to M tmp, and the upper bound of the sliding window is updated to the current lower bound of the sliding window, and then enters the next round of layer merging (see Fig.2(a)).

picture.image

step4 Termination condition.

The process continues un-til the lowest level L is processed. Ultimately, the pruned model M∗ output by the algorithm reduces redundant computing and storage requirements by retaining the merged representation of key layers.

picture.image

picture.image

Results

picture.image

picture.image

picture.image

picture.image

picture.image

Conclusion

通过分析再现核希尔伯特空间中各层输出之间的相关性,揭示了大型语言模型中各层之间的 ”patch-like" 关系模式。基于此,我们提出了一种动态选择和合并层参数的深度修剪方法。该方法设置相似阈值,并从上到下合并连续层,在有效保持性能的同时实现快速的模型压缩。实验结果表明,我们的方法在资源受限环境下显著加快了推理速度,并且在零采样任务上优于现有的剪枝技术。此外,我们的方法可以与宽度修剪技术无缝集成,从而产生具有增强性能的修剪模型。我们希望这项研究能够激发对深度修剪方法的进一步研究,并促进结合深度和宽度修剪策略的统一框架的发展,最终有助于在资源受限的环境中有效部署llm。

机器学习算法AI大数据技术

搜索公众号添加: datanlp

picture.image

长按图片,识别二维码

阅读过本文的人还看了以下文章:

实时语义分割ENet算法,提取书本/票据边缘

整理开源的中文大语言模型,以规模较小、可私有化部署、训练成本较低的模型为主

《大语言模型》PDF下载

动手学深度学习-(李沐)PyTorch版本

YOLOv9电动车头盔佩戴检测,详细讲解模型训练

TensorFlow 2.0深度学习案例实战

基于40万表格数据集TableBank,用MaskRCNN做表格检测

《基于深度学习的自然语言处理》中/英PDF

Deep Learning 中文版初版-周志华团队

【全套视频课】最全的目标检测算法系列讲解,通俗易懂!

《美团机器学习实践》_美团算法团队.pdf

《深度学习入门:基于Python的理论与实现》高清中文PDF+源码

《深度学习:基于Keras的Python实践》PDF和代码

特征提取与图像处理(第二版).pdf

python就业班学习视频,从入门到实战项目

2019最新《PyTorch自然语言处理》英、中文版PDF+源码

《21个项目玩转深度学习:基于TensorFlow的实践详解》完整版PDF+附书代码

《深度学习之pytorch》pdf+附书源码

PyTorch深度学习快速实战入门《pytorch-handbook》

【下载】豆瓣评分8.1,《机器学习实战:基于Scikit-Learn和TensorFlow》

《Python数据分析与挖掘实战》PDF+完整源码

汽车行业完整知识图谱项目实战视频(全23课)

李沐大神开源《动手学深度学习》,加州伯克利深度学习(2019春)教材

笔记、代码清晰易懂!李航《统计学习方法》最新资源全套!

《神经网络与深度学习》最新2018版中英PDF+源码

将机器学习模型部署为REST API

FashionAI服装属性标签图像识别Top1-5方案分享

重要开源!CNN-RNN-CTC 实现手写汉字识别

yolo3 检测出图像中的不规则汉字

同样是机器学习算法工程师,你的面试为什么过不了?

前海征信大数据算法:风险概率预测

【Keras】完整实现‘交通标志’分类、‘票据’分类两个项目,让你掌握深度学习图像分类

VGG16迁移学习,实现医学图像识别分类工程项目

特征工程(一)

特征工程(二) :文本数据的展开、过滤和分块

特征工程(三):特征缩放,从词袋到 TF-IDF

特征工程(四): 类别特征

特征工程(五): PCA 降维

特征工程(六): 非线性特征提取和模型堆叠

特征工程(七):图像特征提取和深度学习

如何利用全新的决策树集成级联结构gcForest做特征工程并打分?

Machine Learning Yearning 中文翻译稿

蚂蚁金服2018秋招-算法工程师(共四面)通过

全球AI挑战-场景分类的比赛源码(多模型融合)

斯坦福CS230官方指南:CNN、RNN及使用技巧速查(打印收藏)

python+flask搭建CNN在线识别手写中文网站

中科院Kaggle全球文本匹配竞赛华人第1名团队-深度学习与特征工程

不断更新资源

深度学习、机器学习、数据分析、python

搜索公众号添加: datayx

picture.image

0
0
0
0
关于作者

文章

0

获赞

0

收藏

0

相关资源
火山引擎大规模机器学习平台架构设计与应用实践
围绕数据加速、模型分布式训练框架建设、大规模异构集群调度、模型开发过程标准化等AI工程化实践,全面分享如何以开发者的极致体验为核心,进行机器学习平台的设计与实现。
相关产品
评论
未登录
看完啦,登录分享一下感受吧~
暂无评论