量化130亿参数LLaMA模型的llama.cpp，推理仅需4GB内存 - 文章 - 开发者社区

自从GPT3和ChatGPT发布以来，在文本续写和对话领域掀起了一场AGI的革命，然而他们是收费的，而且模型巨大，很难在本地或者内存小的机器上运行。最近，Georgi Gerganov发布了一个名为「llama.cpp」的项目，号称没有GPU也能跑LLaMA，关于这个项目的介绍，参考计算机科学家Simon Willison的博文《Large language models are having their Stable Diffusion moment》(https://simonwillison.net/2023/Mar/11/llama/)

译文如下 ：

大型语言模型正在有其稳定的扩散时刻

早在2022年8月，稳定扩散图像生成模型的公开发布是一个关键时刻。我当时写道，[稳定扩散是一件大事]（参考链接：https://simonwillison.net/2022/Aug/29/stable-diffusion/）。

人们现在可以在自己的硬件上从文本中生成图像！更重要的是，开发人员可以搞乱正在发生的事情。

由此产生的创新爆炸式增长至今仍在继续。最近，就其能力而言，[ControlNet]（https://github.com/lllyasviel/ControlNet/blob/main/README.md）似乎领先于Midjourney和DALL-E的稳定扩散。对我来说，8月份的稳定传播时刻引发了对生成性人工智能的整个新兴趣浪潮——然后，由于11月底ChatGPT的发布，生成人工智能被推向了过度驱动。

对于大型语言模型来说，这种稳定的扩散时刻正在再次发生——ChatGPT本身背后的技术。今天早上，我第一次在自己的个人笔记本电脑上[运行GPT-3类语言模型]（https://til.simonwillison.net/llms/llama-7b-m2）！人工智能的东西已经很奇怪了。它即将变得更奇怪。

LLaMA

有点令人惊讶的是，像GPT-3这样的语言模型，像ChatGPT这样的电动工具比图像生成模型更大，构建和操作成本更高。

这些模型中最好的主要由OpenAI等私人组织构建，并受到严格控制——可以通过其API和Web界面访问，但不会发布给任何人在自己的机器上运行。

这些型号也很大。即使你可以获得GPT-3型号，你也无法在商品硬件上运行它——这些东西通常需要几个A100级GPU，每个GPU的零售价超过8000美元。

这项技术显然太重要了，不能完全由一小群公司控制。

在过去的几年里，已经发布了数十个开放的大型语言模型，但就以下方面而言，它们都没有达到我的甜蜜点：

易于在自己的硬件上运行
大到足以有用——在能力上等同于GPT-3
足够开源，以至于他们可以修补

由于Facebook的[LLaMA模型]（https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/）和Georgi Gerganov的[llama.cpp]（https://github.com/ggerganov/llama.cpp）的组合，这一切昨天发生了变化。

这是[LLaMA论文]（https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/）的摘要：

我们介绍了LLaMA，这是一个从7B到65B参数的建立语言模型的集合。我们在数万亿个令牌上训练我们的模型，并表明可以完全使用公开可用的数据集来训练最先进的模型，而无需诉诸专有和不可访问的数据集。特别是，LLaMA-13B在大多数基准测试上的表现优于GPT-3（175B），LLaMA-65B与最好的型号Chinchilla-70B和PaLM-540B竞争。我们向研究界发布所有模型。

需要注意的是，LLaMA并不完全“开放”。您必须同意一些严格的条款才能访问模型。它旨在作为研究预览，不是可用于商业目的的东西。

在完全的赛博朋克行动中，在发布后的几天内，有人将此PR提交给了LLaMA存储库，链接到模型文件的非官方BitTorrent下载链接！

所以他们现在在野外。你可能无法合法地在他们身上制造商业产品，但精灵已经从瓶子里出来了。你可以听到的愤怒的打字声是，世界各地的数千名黑客开始挖掘并弄清楚当你可以在自己的硬件上运行GPT-3类模型时，生活是什么样子

llama.cpp

如果LLaMA本身在个人笔记本电脑上运行仍然太难，那么它本身就不是很好。

进入Georgi Gerganov。

Georgi是位于保加利亚索非亚的开源开发人员（根据他的GitHub个人资料）。他之前发布了whisper.cpp，这是OpenAI的Whisper自动语音识别模型到C++的移植版。该项目使Whisper适用于大量新用例。

他刚刚和LLaMA做了同样的事情。

Georgi的llama.cpp项目前天首次发布。来自README：

主要目标是在MacBook上使用4位量化来运行模型。

4位量化是一种减小模型大小的技术，以便它们可以在功能较弱的硬件上运行。它还将磁盘上的型号大小减少到7B型号的4GB，13B型号的型号略低于8GB。完全有效！

今晚，我用它在我的笔记本电脑上运行了7B LLaMA型号，然后今天早上升级到了13B型号——Facebook声称该型号与GPT-3具有竞争力。

以下是我关于我如何做到的详细说明——我需要的大部分信息已经在README中了。

当我的笔记本电脑开始向我吐出短信时，我真的有一种感觉，世界又要改变了。

picture.image

我以为再过几年，我才能在我拥有的硬件上运行GPT-3类模型。我错了：未来已经在这里了。

这是有史以来最糟糕的事情吗？

我不担心这里的科幻场景。我的笔记本电脑上运行的语言模型不是将挣脱并接管世界的AGI。

但有很多非常真实的方法可以将这项技术用于伤害。只是几个：

生成垃圾邮件
自动浪漫诈骗
挑術和仇恨言论
假新闻和虚假信息
自动激进化（我非常担心这个）

更不用说这项技术就像鹦鹉事实信息一样容易制造事情，并且无法分辨区别。

在此之前，像OpenAI这样的公司控制人们如何与这些模型互动的能力有限，就存在一层薄弱的防御层。

现在我们可以在自己的硬件上运行这些，甚至这些控件也消失了。

我们如何善用这个？

我认为这会对社会产生巨大影响。我的首要任务是试图将这种影响引向积极的方向。

很容易陷入一个愤世嫉俗的陷阱，认为这里根本没有什么好东西，所有生成的人工智能要么是积极的伤害，要么是浪费时间。

我现在每天都在使用生成人工智能工具，用于各种不同的目的。他们提高了我的物质生产力，但更重要的是，他们扩大了我承担的项目的雄心壮志。

就在上周，我使用ChatGPT学习了足够的AppleScript，在不到一个小时的时间内发布一个新项目！

我将继续探索和分享这项技术的真正积极应用。它不会是未经发明的，所以我认为我们的首要任务应该是找出最有建设性的可能使用它的方法。

接下来要寻找什么

假设Facebook不放宽许可条款，LLaMA最终可能会更多地证明本地语言模型在消费者硬件上是可行的，而不是人们未来使用的新基础模型。

比赛即将发布第一个完全开放的语言模型，该模型为人们在自己的设备上提供类似ChatGPT的功能。

引用稳定扩散支持者Emad Mostaque的话：如果有一个完全开放的版本，那就不好了。

下面介绍在M1 Mac上运行llama.cpp的步骤

第一步：下载模型

首先要做的就是下载LLaMA模型。

你可以通过官方的表格向Meta提交申请，或者从网友分享的链接里直接获取。

总之，完成后你会看到下面这堆东西：

picture.image

正如你所看到的，不同的模型都在不同的文件夹里。每个模型都有一个params.json，包含关于该模型的细节。比如：

picture.image

第二步：安装依赖项

首先，你需要安装Xcode来编译C++项目。


      
 

 
 `xcode-select --install`

接下来，是构建C++项目的依赖项（pkgconfig和cmake）。


      
 

 
 `brew install pkgconfig cmake`

在环境的配置上，假如你用的是Python 3.11，则可以创建一个虚拟环境：


      
 

 
 `/opt/homebrew/bin/python3.11 -m venv venv`

然后激活venv。（如果是fish以外的shell，只要去掉.fish后缀即可）


      
 

 
 `. venv/bin/activate.fish`

最后，安装Torch。


        
            

          pip3 install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu

如果你对利用新的Metal性能着色器（MPS）后端进行GPU训练加速感兴趣，可以通过运行以下程序来进行验证。但这不是在M1上运行LLaMA的必要条件。


      
 

 
  `python`
  `Python 3.11.2 (main, Feb 16 2023, 02:55:59) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin`
  `Type "help", "copyright", "credits" or "license" for more information.`
  `>>> import torch; torch.backends.mps.is_available()True`

第三步：编译LLaMA CPP


      
 

 
 `git clone git@github.com:ggerganov/llama.cpp.git`

在安装完所有的依赖项后，你可以运行make：


          
make
          
I llama.cpp build info:
          
I UNAME_S:  Darwin
          
I UNAME_P:  arm
          
I UNAME_M:  arm64
          
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
          
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
          
I LDFLAGS:   -framework Accelerate
          
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)
          
cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
          
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.o
          
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main  -framework Accelerate
          
./main -h
          
usage: ./main [options]
          
options:
          
  -h, --help            show this help message and exit
          
  -s SEED, --seed SEED  RNG seed (default: -1)  
          
  -t N, --threads N     number of threads to use during computation (default: 4)  
          
  -p PROMPT, --prompt PROMPT
          
                        prompt to start generation with (default: random)  
          
  -n N, --n_predict N   number of tokens to predict (default: 128)  
          
  --top_k N             top-k sampling (default: 40)  
          
  --top_p N             top-p sampling (default: 0.9)  
          
  --temp N              temperature (default: 0.8)  
          
  -b N, --batch_size N  batch size for prompt processing (default: 8)  
          
  -m FNAME, --model FNAME
          
                        model path (default: models/llama-7B/ggml-model.bin)
          
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize  -framework Accelerate

第四步：转换模型

假设你已经把模型放在llama.cpp repo中的models/下。


      
 

 
 `python convert-pth-to-ggml.py models/7B 1`

那么，应该会看到像这样的输出：


          
{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-06, 'vocab_size': 32000}n_parts =  1Processing part  0Processing variable: tok_embeddings.weight with shape:  torch.Size([32000, 4096])  and type:  torch.float16
          
Processing variable: norm.weight with shape:  torch.Size([4096])  and type:  torch.float16
          
  Converting to float32
          
Processing variable: output.weight with shape:  torch.Size([32000, 4096])  and type:  torch.float16
          
Processing variable: layers.0.attention.wq.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
          
loat16
          
Processing variable: layers.0.attention.wk.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
          
loat16
          
Processing variable: layers.0.attention.wv.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
          
loat16
          
Processing variable: layers.0.attention.wo.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
          
loat16
          
Processing variable: layers.0.feed_forward.w1.weight with shape:  torch.Size([11008, 4096])  and type:  tor
          
ch.float16
          
Processing variable: layers.0.feed_forward.w2.weight with shape:  torch.Size([4096, 11008])  and type:  tor
          
ch.float16
          
Processing variable: layers.0.feed_forward.w3.weight with shape:  torch.Size([11008, 4096])  and type:  tor
          
ch.float16
          
Processing variable: layers.0.attention_norm.weight with shape:  torch.Size([4096])  and type:  torch.float
          
16...
          
Done. Output file: models/7B/ggml-model-f16.bin, (part  0 )

下一步将是进行量化处理：


        
            

          ./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4\_0.bin 2

输出如下：


      
 

 
  `llama_model_quantize: loading model from './models/7B/ggml-model-f16.bin'llama_model_quantize: n_vocab = 32000llama_model_quantize: n_ctx = 512llama_model_quantize: n_embd = 4096llama_model_quantize: n_mult = 256llama_model_quantize: n_head = 32llama_model_quantize: n_layer = 32llama_model_quantize: f16 = 1...`
  `layers.31.attention_norm.weight - [ 4096, 1], type = f32 size = 0.016 MB`
  `layers.31.ffn_norm.weight - [ 4096, 1], type = f32 size = 0.016 MB`
  `llama_model_quantize: model size = 25705.02 MB`
  `llama_model_quantize: quant size = 4017.27 MB`
  `llama_model_quantize: hist: 0.000 0.022 0.019 0.033 0.053 0.078 0.104 0.125 0.134 0.125 0.104 0.078 0.053 0.033 0.019 0.022`
 
  `main: quantize time = 29389.45 ms`
  `main: total time = 29389.45 ms`

第五步：运行模型


      
 

 
  `./main -m ./models/7B/ggml-model-q4_0.bin \`
  `-t 8 \`
  `-n 128 \`
  `-p 'The first president of the USA was '`


      
 

 
  `main: seed = 1678615879llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...`
  `llama_model_load: n_vocab = 32000llama_model_load: n_ctx = 512llama_model_load: n_embd = 4096llama_model_load: n_mult = 256llama_model_load: n_head = 32llama_model_load: n_layer = 32llama_model_load: n_rot = 128llama_model_load: f16 = 2llama_model_load: n_ff = 11008llama_model_load: n_parts = 1llama_model_load: ggml ctx size = 4529.34 MB`
  `llama_model_load: memory_size = 512.00 MB, n_mem = 16384llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'llama_model_load: .................................... donellama_model_load: model size = 4017.27 MB / num tensors = 291`
  `main: prompt: 'The first president of the USA was 'main: number of tokens in prompt = 9 1 -> '' 1576 -> 'The' 937 -> ' first' 6673 -> ' president' 310 -> ' of' 278 -> ' the' 8278 -> ' USA' 471 -> ' was' 29871 -> ' '`
  `sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000`
 
  `The first president of the USA was 57 years old when he assumed office (George Washington). Nowadays, the US electorate expects the new president to be more young at heart. President Donald Trump was 70 years old when he was inaugurated. In contrast to his predecessors, he is physically fit, healthy and active. And his fitness has been a prominent theme of his presidency. During the presidential campaign, he famously said he`
  `would be the “most active president ever” — a statement Trump has not yet achieved, but one that fits his approach to the office. His tweets demonstrate his physical activity.`
 
  `main: mem per token = 14434244 bytes`
  `main: load time = 1311.74 ms`
  `main: sample time = 278.96 ms`
  `main: predict time = 7375.89 ms / 54.23 ms per token`
  `main: total time = 9216.61 ms`

参考文献 ：

[1] https://github.com/ggerganov/llama.cpp

[2] https://dev.l1x.be/posts/2023/03/12/using-llama-with-m1-mac/

[3] https://til.simonwillison.net/llms/llama-7b-m2

[4] https://simonwillison.net/2023/Mar/11/llama/

[5] https://hub.baai.ac.cn/view/24793

[6] https://mp.weixin.qq.com/s/OjtjIVTNiXbDA1wTao4JVQ