上海AI Lab发布InternLM2-200K、长序列LLM预训练微调框架InternEvo！ - 文章 - 开发者社区

2024年1月17号，上海AI Lab发布了书生·浦语（InternLM）新一代大模型InternLM2，并开源了2种参数规格（ 7B、20B ），9个模型。

picture.image

此次InternLM2一大特点是它支持 200K 上下文，同时上海AI Lab开源了长序列LLM预训练微调框架 InternEvo ，另外也提供了InternLM2-200K推理部署的方案： LMDeploy ；可谓干货满满！

InternLM2-200K推理：LMDeploy

安装：

pip install lmdeploy

LMDeploy 实现了 dynamic ntk，支持长文本外推。使用如下代码，可以把 InternLM2 的文本外推到 200K ：

from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(session_len=200000,
rope_scaling_factor=2.0)
pipe = pipeline("internlm/internlm2-chat-7b", backend_engine=engine_config)
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
response = pipe(prompt, gen_config=gen_config)
print(response)

InternLM2-200K训练：InternEvo


          
论文题目：InternEvo: Efficient Long-sequence Large Language Model Training viaHybrid Parallelism and Redundant Sharding
          
论文链接：https://arxiv.org/pdf/2401.09149.pdf
          
已开源Github：https://github.com/InternLM/InternEvo

研究背景：

大型语言模型（LLMs）在 处理长序列 时需要大量的内存资源，而现有的训练方法在效率和兼容性方面存在不足。例如，DeepSpeed Ulysses和Megatron-LM等方法在训练性能和内存使用上存在限制，尤其是在使用FlashAttention等高效自注意力优化算法时。

方案设计：

为了解决这些问题，论文提出了 InternEvo ，这是一个用于训练基于Transformer的LLMs的并行化框架。InternEvo通过将所有分片维度解耦到一个新的层次空间中，并系统地分析LLM训练的内存和通信成本，然后生成有效的混合并行策略。InternEvo还设计了新的选择性重叠机制来减轻混合并行引入的通信开销，并实现了内存管理技术以减少GPU内存碎片化。

InternEvo架构与工作流

picture.image

实验结论：

评估结果表明，InternEvo生成的并行化策略在模型FLOPs利用率上匹配或超过了现有方法。在训练具有数十亿参数和 高达256k序列长度 的大型模型时，与DeepSpeed Ulysses和Megatron-LM相比， InternEvo 实现的MFU（模型FLOPs利用率）表现突出，分别超过了它们高达 4.8倍 和 2.29倍 。

picture.image