一文探秘LLM应用开发(17)-模型部署与推理(框架工具-ggml、mlc-llm、ollama) - 文章 - 开发者社区

动手点关注

干货不迷路

picture.image

本文旨在让无大模型开发背景的工程师或者技术爱好者无痛理解大语言模型应用开发的理论和主流工具，因此会先从与LLM应用开发相关的基础概念谈起，并不刻意追求极致的严谨和完备，而是从直觉和本质入手，结合笔者调研整理及消化理解，帮助大家能够更容易的理解LLM技术全貌，大家可以基于本文衍生展开，结合自己感兴趣的领域深入研究。若有不准确或者错误的地方也希望大家能够留言指正。

本文体系完整，内容丰富，由于内容比较多，分多次连载 。

第一部分基础概念

1.机器学习场景类别

2.机器学习类型(LLM相关)

3.深度学习的兴起

4.基础模型

第二部分应用挑战

1.问题定义与基本思路

2.基本流程与相关技术

1）Tokenization与Embbeding

2）向量数据库

3）finetune（微调）

4）模型部署与推理

5）prompt

6）编排与集成

7）预训练

第三部分场景案例

常用参考

第二部分应用挑战

2.基本流程与相关技术

4）模型部署与推理

模型服务层相关工具和框架

前文对模型服务层做了简单的分类，基于封装程度不同，有大量的框架和工具，笔者结合目前趋势，在每个类别中挑选了几个常见的项目和大家分享，它们各有特点，将从内向外一一介绍，最后给出一些选型建议以供参考。

picture.image

下面介绍三款以端设备为主的推理引擎，分别是ggml、mlc-llm、ollama，其核心便是降低运行门槛，让个人电脑手机等设备也能运行大模型。这也是未来商业化非常有潜力的方向。

ggml

picture.image

没有听说过ggml，那也一定听说过llama.cpp。ChatGPT惊艳的效果让开发者为之惊讶，但大模型训练的高成本和周期让大部分开发者望而却步，今年2月meta开源了llama，让众多开发者和中小企业为之兴奋，然而，其高昂的部署成本仍然将大部分人挡在门外。而这时候，Georgi Gerganov 3月11日悄悄地在github开源了 llama.cpp项目，通过它可以让开发者在没有 GPU 的条件下也能运行 LLaMA 模型，一下子成为当时github最火爆的项目，当前已经获得4万收藏，很多开发者成功的在 MacBook 甚至在树莓派上运行了LLaMA。Georgi Gerganov也一鼓作气在6月份成立了ggml.ai,并且近日凭借其潜在的巨大商业价值进入AIGrant第二批次资助名单。

而ggml可以说是llama.cpp和 whisper.cpp 沉淀内化的产物，现在也成为这两个项目的内核。它是c++实现的张量库，目标是在消费级硬件上高效运行大型模型。核心优化手段就是量化，支持了 2 位、3 位、4 位、5 位、6 位和 8 位整数量化，除此之外，在编译和硬件上也做了针对性的优化，比如面向Apple Silicon的优化等。llama.cpp支持当前主流的模型，包括中文大模型 Baichuan ，且社区非常活跃，配套了很多不同语言的bind，如java，python，go等，开发者可以按照需求选择。

如果想基于ggml开发LLM应用，给大家介绍一个比较常用的python bind——ctransformers，量化模型可以直接使用TheBloke（https://huggingface.co/TheBloke）提供的模型。

下面是一个简单的示例：

1）推理：


            
# for GPU use
            
!CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers
            

            
from ctransformers import AutoModelForCausalLM
            

            
# check ctransformers doc for more configs
            
config = {'max_new_tokens': 256, 'repetition_penalty': 1.1, 
            
          'temperature': 0.1, 'stream': True}
            

            
llm = AutoModelForCausalLM.from_pretrained(
            
      "TheBloke/Llama-2-13B-chat-GGML", 
            
      model_type="llama",                                           
            
      #lib='avx2', for cpu use
            
      gpu_layers=130, #110 for 7b, 130 for 13b
            
      **config
            
      )
            

            
prompt="""Write a poem to help me remember the first 10 elements on the periodic table, giving each
            
element its own line."
            
""

‍ ‍ ‍

2.1）pipeline执行：


            
llm(prompt, stream=False)
            

            
'\n\nI. Hydrogen (H)\nII. Helium (He)\nIII. Lithium (Li)\nIV. Beryllium (Be)\nV. Boron (B)\nVI. Carbon (C)\nVII. Nitrogen (N)\nVIII. Oxygen (O)\nIX. Fluorine (F)\nX. Neon (Ne)\n\nEach element has its own unique properties and characteristics,\nFrom the number of protons in their nucleus to how they bond with other elements.\nHydrogen is lightest, helium is second, lithium is third,\nBeryllium is toxic, boron is a vital nutrient,\nCarbon is the basis of life, nitrogen is in the air we breathe,\nOxygen is what makes water wet, fluorine is a poisonous gas,\nNeon glows with an otherworldly light.'

2.2）流式执行：


            
#tokenize
            
tokens = llm.tokenize(prompt)
            
# LlAMA-2-7b-chat execution
            
import time
            
start = time.time()
            
NUM_TOKENS=0
            
print('-'*4+'Start Generation'+'-'*4)
            
for token in llm.generate(tokens):
            
    print(llm.detokenize(token), end='', flush=True)
            
    NUM_TOKENS+=1
            
time_generate = time.time() - start
            
print('\n')
            
print('-'*4+'End Generation'+'-'*4)
            
print(f'Num of generated tokens: {NUM_TOKENS}')
            
print(f'Time for complete generation: {time_generate}s')
            
print(f'Tokens per secound: {NUM_TOKENS/time_generate}')
            
print(f'Time per token: {(time_generate/NUM_TOKENS)*1000}ms')
            

            

            

            
----Start Generation----
            

            

            
I. Hydrogen (H)
            
II. Helium (He)
            
III. Lithium (Li)
            
IV. Beryllium (Be)
            
V. Boron (B)
            
VI. Carbon (C)
            
VII. Nitrogen (N)
            
VIII. Oxygen (O)
            
IX. Fluorine (F)
            
X. Neon (Ne)
            

            
I hope this helps me remember the first 10 elements on the periodic table!
            

            
----End Generation----
            
Num of generated tokens: 110
            
Time for complete generation: 7.801689863204956s
            
Tokens per secound: 14.099509456123355
            
Time per token: 70.92445330186324m

mlc-llm

picture.image

在前面介绍tensorRT时提到了IR，其暗示了一种优化推理的一种方式，那就是通过编译技术，将原本低效的面向开发者效率设计的语言转换为面向设备的提升执行效率的语言。这便是 TVM，它是一个深度学习编译器，初衷是为了让各种训练框架训练好的模型能够在不同的硬件平台上面进行快速的推理。而 mlc-llm就是这个方向的一种尝试。

mlc-llm是由 TVM、MXNET、XGBoost 作者，CMU 助理教授，OctoML CTO 陈天奇等多位研究者共同开发的开源项目，旨在在各类硬件上原生部署任意大型语言模型提供了解决方案，可将大模型应用于移动端（例如 iPhone）、消费级电脑端（例如 Mac）和 Web 浏览器，一切都在本地运行，无需服务器支持，并通过手机和笔记本电脑上的本地 GPU 加速。

mlc-llm建立在 Apache TVM Unity之上，并在此基础上进行了进一步优化，包括：

Dynamic shape: 将语言模型转换为具有原生动态形状支持的 TVM IRModule，避免了对最大长度进行额外填充的需要，并减少了计算量和内存使用量，为了优化动态形状输入，首先应用循环切分技术，即将一个大循环切分成两个小循环操作；然后应用张量自动化技术，即TVM中的Ansor或者Meta Scheduler技术。
组合编译优化:进行了许多模型部署优化，例如更好的编译代码转换、融合、内存规划、库卸载和手动代码优化，这些都可以轻松地整合为 TVM 的 IRModule 转换，并以 Python API 的形式公开。

3）量化：利用低位量化来压缩模型权重，并利用 TVM 的循环级 TensorIR 快速定制代码生成，以适应不同的压缩编码方案。

4）运行时：最终生成的库可在本地环境中运行，TVM 运行时的依赖性极低，支持各种 GPU 驱动程序 API 和本地语言绑定（C、JavaScript 等）。

在开发流程上分为三步：

picture.image

1）用 Python 定义模型。MLC 提供多种预定义架构，如 Llama（如 Llama2、Vicuna、OpenLlama、Wizard）、GPT-NeoX（如 RedPajama、Dolly）、RNNs（如 RWKV）和 GPT-J（如 MOSS）。模型开发人员只需在纯 Python 中定义模型，无需接触代码生成和运行时。

2）用 Python 编译模型。模型由 TVM Unity 编译器编译，编译配置为纯 Python。MLC LLM 将基于 Python 的模型量化并导出为模型库和量化模型权重。可使用纯 Python 开发量化和优化算法，以针对特定用例压缩和加速 LLM。

3）平台原生运行时。MLCChat 在每个平台上都有不同的版本：用于命令行的 C++、用于 Web 的 Javascript、用于 iOS 的 Swift 和用于 Android 的 Java，可通过 JSON 聊天配置进行配置。应用程序开发人员只需熟悉平台原生运行时，即可将 MLC 编译的 LLM 集成到他们的项目中。

picture.image

基本使用流程：

1）安装依赖，并编译模型，以macbook上运行llama2模型为例，参考：https://mlc.ai/mlc-llm/docs/compilation/compile\_models.html。


            
# clone the repository
            
git clone git@github.com:mlc-ai/mlc-llm.git --recursive
            
# enter to root directory of the repo
            
cd mlc-llm
            
# install mlc-llm
            
pip install .
            

            
# x86 Mac 
            
python3 -m mlc_llm.build --model Llama-2-7b-chat-hf --target metal_x86_64 --quantization q4f16_1

编译出的libs将放置在./dist/Llama-2-7b-chat-hf-q4f16_1/下，权重和配置在 ./dist/Llama-2-7b-chat-hf-q4f16_1/params下。

也可以不编译直接使用预编译好的模型，将libs放置在 ./dist/prebuilt/lib，权重和配置在 ./dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1

2）在runtime上执行，以python为例

在mlc_chat目录下创建sample_mlc_chat.py。


            
from mlc_chat import ChatModule
            
from mlc_chat.callback import StreamToStdout
            

            
# From the mlc-llm directory, run
            
# $ python sample_mlc_chat.py
            

            
# Create a ChatModule instance
            
cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1")
            
# You can change to other models that you downloaded, for example,
            
# cm = ChatModule(model="Llama-2-13b-chat-hf-q4f16_1")  # Llama2 13b model
            

            
output = cm.generate(
            
   prompt="What is the meaning of life?",
            
   progress_callback=StreamToStdout(callback_interval=2),
            
)
            

            
# Print prefill and decode performance statistics
            
print(f"Statistics: {cm.stats()}\n")
            

            
output = cm.generate(
            
   prompt="How many points did you list out?",
            
   progress_callback=StreamToStdout(callback_interval=2),
            
)
            

            
# Reset the chat module by
            
# cm.reset_chat()

执行即可：


          
              

            python sample\_mlc\_chat.py

不过据部分开发者反馈，mlc-llm还处于早期阶段，虽然能够正确运行，但效果、性能和稳定性都不尽如人意，可能还需要假以时日，才能真的在终端设备上看到它的身影。

ollama

picture.image

ollama是一个比较新的项目，目标是可以本地运行的LLM的推理服务，目前支持macos，未来将支持windows和linux。该项目在今年八月份刚刚发布，在社区被受欢迎，已获得7k的收藏量，还保持着不错的增长趋势。整体设计新颖简洁，使用也非常简单，类似docker风格，操作比mlc-llm要简单不少，相信很快会被大家所熟知。

默认模型配置运行：


            
#拉取
            
ollama pull llama2
            
#执行
            
ollama run llama2
            
>>> hi
            
Hello! How can I help you today?

自定义模型配置运行：


            
ollama pull llama2
            

            
#Create a Modelfile:
            
FROM llama2
            

            
# set the temperature to 1 [higher is more creative, lower is more coherent]
            
PARAMETER temperature 1
            

            
# set the system prompt
            
SYSTEM """
            
You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
            
"""
            

            
# create and run the model:
            

            
ollama create mario -f ./Modelfile
            
ollama run mario
            
>>> hi
            
Hello! It's your friend Mario.

它的设计处处体现了docker风格，也能体现他们想要成为大模型时代的docker的雄心，这里介绍一下它的Modefile，一种类似于Dockerfile的模型定义文件。其结构如图：

picture.image

指令一览：

指令	描述
`FROM (required)`	`定义要使用的基础模型。`
`PARAMETER`	`设`
`置 Ollama 运行模型的参数。`
`TEMPLATE`	`将发送给模型的完整提示模板。`
`SYSTEM`	`指定将在模板中设置的系统提示。`
`ADAPTER`	`定义应用于模型的 (Q)LoRA 适配器。`
`LICENSE`	`指定合法许可证。`

其中PARAMETER指令可以定义运行模型时的参数。

格式为

：PARAMETER <参数> <参数值>

| 参数 | 描述 | 参数值类型 | 例子 | | --- | --- | --- | --- | | mirostat | 启用 Mirostat 采样以控制复杂度。 (缺省: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) | int | mirostat 0 | | mirostat_eta | 影响算法对生成文本的反馈做出反应的速度。学习率越低，算法的调整速度越慢，而学习率越高，算法的反应速度越快。(默认值：0.1） | float | mirostat_eta 0.1 | | mirostat_tau | 控制输出的连贯性和多样性之间的平衡。数值越小，文本越集中、连贯。(默认值：5.0） | float | mirostat_tau 5.0 | | num_ctx | 设置用于生成下一个标记的上下文窗口的大小。(默认值：2048） | int | num_ctx 4096 | | num_gqa | transformer层中 GQA 组的数量。某些模型需要，例如 llama2:70b 为 8 个 | int | num_gqa 1 | | num_gpu | 要使用的 GPU 数量。在 macOS 上，默认值为 1 表示启用metal支持，0 表示禁用。 | int | num_gpu 1 | | num_thread | 设置计算时使用的线程数。默认情况下，Ollama 会检测线程数以获得最佳性能。建议将此值设置为系统的物理 CPU 内核数（而不是逻辑内核数）。 | int | num_thread 8 | | repeat_last_n | 设置模型回溯多远以防止重复。(默认值：64, 0 = disabled, -1 = num_ctx) | int | repeat_last_n 64 | | repeat_penalty | 设置对重复的惩罚力度。数值越大（如 1.5），对重复的惩罚力度越大，而数值越小（如 0.9），惩罚力度越轻。(默认值：1.1） | float | repeat_penalty 1.1 | | temperature | 模型的温度。温度越高，模型的回答就越有创意。(默认值：0.8） | float | temperature 0.7 | | stop | 设置要使用的停止序列。 | string | stop "AI assistant:" | | tfs_z | 尾部自由采样用于减少输出中可能性较低的标记的影响。数值越大（如 2.0），影响就越小，而数值为 1.0 则会禁用此设置。(默认值：1） | float | tfs_z 1 | | top_k | 降低产生无意义答案的概率。数值越大（如 100），答案就越多样化，而数值越小（如 10），答案就越保守。(默认值：40） | int | top_k 40 | | top_p | 与 top-k 一起使用。较高的值（如 0.95）会产生更多样化的文本，而较低的值（如 0.5）会产生更集中和保守的文本。(默认值：0.9） | float | top_p 0.9 |

Modefile例子：


            
FROM llama2
            
# sets the temperature to 1 [higher is more creative, lower is more coherent]
            
PARAMETER temperature 1
            
# sets the context window size to 4096, this controls how many tokens the LLM can use as context to generate the next token
            
PARAMETER num_ctx 4096
            

            
# sets a custom system prompt to specify the behavior of the chat assistant
            
SYSTEM You are Mario from super mario bros, acting as an assistant.

使用：

1.将其保存为Modelfile

2.ollama create NAME -f <文件位置，如 ./Modelfile>'

ollama run NAME
enjoying

另外，也支持直接以server模式启动，服务默认端口为11434。


          
              

            ./ollama serve

然后就可以通过API访问：


            
curl -X POST http://localhost:11434/api/generate -d '{
            
  "model": "llama2",
            
  "prompt":"Why is the sky blue?"
            
}'

在模型支持上，支持常见模型，并且其中包含中文模型（ llama2-chinese ）。

在生态支持上，ollama和langchain无缝集成，两者配合起来很方便可以打造本地版本的LLM应用。


            
from langchain.llms import Ollama
            
from langchain.callbacks.manager import CallbackManager
            
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler                                  
            
llm = Ollama(base_url="http://localhost:11434", 
            
             model="llama2", 
            
             callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]))
            
             
            
 llm("Tell me about the history of AI")
            
 
            
 
            
 Great! The history of Artificial Intelligence (AI) is a fascinating and complex topic that spans several decades. Here's a brief overview:
            
    
            
    1. Early Years (1950s-1960s): The term "Artificial Intelligence" was coined in 1956 by computer scientist John McCarthy. However, the concept of AI dates back to ancient Greece, where mythical creatures like Talos and Hephaestus were created to perform tasks without any human intervention. In the 1950s and 1960s, researchers began exploring ways to replicate human intelligence using computers, leading to the development of simple AI programs like ELIZA (1966) and PARRY (1972).
            
    2. Rule-Based Systems (1970s-1980s): As computing power increased, researchers developed rule-based systems, such as Mycin (1976), which could diagnose medical conditions based on a set of rules. This period also saw the rise of expert systems, like EDICT (1985), which mimicked human experts in specific domains.
            
    3.

同时，服务也提供本地embeddings服务。


            
from langchain.embeddings import OllamaEmbeddings
            
oembed = OllamaEmbeddings(base_url="http://localhost:11434", model="llama2")
            

            
oembed.embed_query("Llamas are social animals and live with others as a herd.")

基于这些，配合chroma，可以轻松本地搭建完整的LLM应用。

未完待续。。。

鉴于篇幅原因，在下一篇中将继续介绍一些主流的推理服务器及chatServer，欢迎关注。

合集目录：

1)Tokenization与Embbeding

一文探密LLM应用开发(3)