用4GB显存推理700亿参数大模型？ - 文章 - 开发者社区

2024年1月8日，幻方量化旗下DeepSeek发布了一份48页的DeepSeek LLM技术报告至Arxiv，并在文中对其进行了深入的技术解析。论文地址：

https://arxiv.org/abs/2401.02954

DeepSeek LLM 关键技术细节：

数据有2万亿个中英文词元（2T Tokens）
架构主要参照了LLaMA，相同点是：FFN激活函数为SwiGLU、归一化函数是RMSNorm、位置编码使用了RoPE，以及7B模型采用了MHA（Multi-Head Attention)，67B模型则采用了GQA（Grouped-Query Attention)
DeepSeek LLM和LLaMA的架构区别主要是模型深度的不同，DeepSeek 7B是30层，DeepSeek 67B是95层，而LLaMA2 7B和70B则分别是32层和80层。

一个多月前，DeepSeek LLM 67B，一个拥有670亿参数的模型被开源。在近40个中英文排行榜上，它的表现全面超越了拥有700亿参数的LLaMA 2。

该系列模型已在HuggingFace平台上开放源代码，用户无需申请即可免费用于商业用途，至今下载次数已超过5.8万次。

而对于670亿参数的大模型如何在有限的资源下进行推理尼？这里介绍一个开源框架：AirLLM


          
AirLLM: scaling large language models on low-end commodity computers
          
https://github.com/lyogavin/Anima/tree/main/air_llm

A irLLM

AirLLM于2023年11月20发布，它优化inference内存，  **4GB单卡** GPU可以运行  **70B** 大语言模型推理。不需要任何损失模型性能的  **量化和蒸馏** ，  **剪枝** 等模型压缩。但同时也支持模型压缩，速度提升3倍。
ChatGLM、QWen、Baichuan、 Mistral、 InternLM、safetensor系列、AirLLMMixtral、open llm leaderboard前10的模型都支持了！
  **MacOS** 也能跑70B大模型了，太优秀了！！！

picture.image

推理

直接安装：pip install airllm 注意：推理过程会首先 将原始模型按层分拆 ，转存。请保证huggingface cache目录有足够的磁盘空间。


          
from airllm import AutoModel
          

          
MAX_LENGTH = 128
          
# could use hugging face model repo id:
          
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")
          

          
# or use model's local path...
          
#model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")
          

          
input_text = [
          
        'What is the capital of United States?',
          
        #'I like',
          
    ]
          

          
input_tokens = model.tokenizer(input_text,
          
    return_tensors="pt", 
          
    return_attention_mask=False, 
          
    truncation=True, 
          
    max_length=MAX_LENGTH, 
          
    padding=False)
          
           
          
generation_output = model.generate(
          
    input_tokens['input_ids'].cuda(), 
          
    max_new_tokens=20,
          
    use_cache=True,
          
    return_dict_in_generate=True)
          

          
output = model.tokenizer.decode(generation_output.sequences[0])
          

          
print(output)