部署满血DeepSeek R1的避坑指南-vLLM 0.7.1

大模型向量数据库云存储

大家好,今天给大家带来一篇曹宇兄的文章,给大家一些vllm部署满血deepseek r1的避坑指南。


        
          
知乎:https://zhuanlan.zhihu.com/p/21064432691  

      

今天看到vLLM的朋友圈发布了DeepSeek R1的PP支持,立刻开始我的捣鼓之旅,假如我训练的超大MoE上线了,也得做好技术准备工作是不嘛。把踩坑经验给大家分享一下,希望能够相比于官方文档更白话一点。


        
          
Distributed Inference and Serving: https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes  

      

知乎@游凯超 说一定要让整个过程变得丝滑无比,我俩配合做了几个验证,现在应该只需要 Step0 和 Step3 就可以run起来了,如果遇到autoscalar的相关问题可以看Step1可以解决。

Step 0 Prepare weights & Environment

由于权重太大了,即使你网速可以,也不建议直连下载了。大家可以先从HF及或代理弄一份权重回来,直连大概率直接超时或者把公网IP打爆。我们今天展示的多机多卡8xH20 (x2) 部署,对应TP size 8,PP size 2,所以要搞两台这样的机器过来。同时有一个假设:两机的网络互通,不一定需要IB,储存需要共享(NAS或OSS均可),完成准备工作之后便可以做第一步。

Step 1 Setup up Ray & Cluster

官方文档里面简单带过了这一部分,但这个是我被卡时间太久的问题。首先我说一下官方文档的意思,就是让你准备好两个节点,之间用ray start这个CLI去建立好ray集群。因为后面要用,但是比较坑的有两点,第一点是启动的命令似乎有点点问题,我在前几次尝试的时候都遇到了Ray的autoscaler报错的问题:


        
          
(autoscaler +1m19s) Error: No available node types can fulfill resource request {'node:33.18.26.153': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.  
(autoscaler +1m54s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'node:33.18.26.153': 0.001}. Add suitable node types to this cluster to resolve this issue.  
(autoscaler +2m29s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'node:33.18.26.153': 0.001}. Add suitable node types to this cluster to resolve this issue.  
INFO 02-02 09:39:14 ray_utils.py:212] Waiting for creating a placement group of specs for 150 seconds. specs=[{'node:33.18.26.153': 0.001, 'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.  

      

这看起来就很奇怪,因为vLLM找Ray集群要的Resource是custom resource,'node:33.18.26.153':0.001,这可以理解成vLLM优先要driver节点。但是这个东西我印象中是需要启动ray的时候自己设置的:


        
          
https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources  

      

像这样才会有这种resource。背后的原因是对于多(虚拟)网卡的机器会有多个网段,vLLM assume使用POD IP来做Ray的master寻址。

解法1:设置 VLLM_HOST_IP


        
          
# Get local IP address and set on every node before Ray start  
VLLM_HOST_IP=$(hostname -I | awk '{print $1}')  
export VLLM_HOST_IP  

      

解法2:魔改Ray启动逻辑


        
          
def get_actual_ip():  
    """Get the actual IP address of the current machine."""  
    try:  
        # Create a socket to connect to an external server (doesn't actually connect)  
        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)  
        s.connect(('8.8.8.8', 80))  
        ip = s.getsockname()[0]  
        s.close()  
        return ip  
    except Exception:  
        # Fallback to hostname-based IP resolution  
        return socket.gethostbyname(socket.gethostname())  
  
def start_ray_cluster():  
    free_ports = get_free_ports()  
    port = free_ports[0]  
    node_manager_port = free_ports[1]  
    master_addr = get_master_addr()  
    rank = get_rank()  
    node_ip = get_actual_ip()  # Use the new function to get actual IP  
      
    # Define custom resource based on node IP  
    resource_spec = f'--resources=\'{{"node:{node_ip}": 1}}\''  
      
    if rank == 0:  
        cmd = f"ray start --head --port={port} --node-ip-address={master\_addr} --node-manager-port {node\_manager\_port} --node-name={master\_addr} {resource\_spec}"  
    else:  
        cmd = f"ray start --address={master\_addr}:{port} --node-manager-port {node\_manager\_port} --node-name={get\_addr()} {resource\_spec}"  
      
    if ray.is_initialized():  
        print("Ray is already initialized, skipping node level init.")  
    else:  
        stop_cmd = "ray stop"  
        execute(stop_cmd, check=True)  
        print(f"Executing Ray start command: {cmd}")  
        execute(cmd, check=True)  

      

其中execute可以这样写,


        
          
import time  
import subprocess  
  
def execute(cmd, check=False, retry=1):  
    ret = subprocess.run(cmd, shell=True, capture_output=True, text=True, check=check)  
    state = ret.returncode == 0  
    msg = ret.stdout if state else ret.stderr  
    if not state and retry > 1:  
        print(f"execute {cmd} got error {msg}, retry...")  
        time.sleep(1)  
        return execute(cmd, check, retry-1)  
    return state, msg  

      

然后这里我稍微提一下ray的一些基础玩法:大家在使用Ray的时候一般都不是在裸机上面的,大部分深度学习的资源都是k8s结合kubeflow或者volcano这样的插件分发出来的。环境变量里面会有当前是第几个rank,头结点master_addr这样的信息,大家可以根据自己的需要把这些函数实现一下。比较坑的 {resource_spec} 这里我已经替大家把坑给填了。

Step 2 Other small bugs

期间又报了两个错误,花了一点时间修复:


        
          
Traceback (most recent call last):  
  File "/usr/local/bin/vllm", line 5, in <module>  
    from vllm.scripts import main  
  File "/usr/local/lib/python3.10/dist-packages/vllm/\_\_init\_\_.py", line 4, in <module>  
    from vllm.engine.async_llm_engine import AsyncLLMEngine  
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async\_llm\_engine.py", line 15, in <module>  
    from vllm.engine.llm_engine import (DecoderPromptComponents, LLMEngine,  
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm\_engine.py", line 24, in <module>  
    from vllm.engine.output_processor.interfaces import (  
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/output\_processor/interfaces.py", line 6, in <module>  
    from vllm.engine.output_processor.stop_checker import StopChecker  
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/output\_processor/stop\_checker.py", line 6, in <module>  
    from vllm.transformers_utils.tokenizer import AnyTokenizer  
  File "/usr/local/lib/python3.10/dist-packages/vllm/transformers\_utils/tokenizer.py", line 13, in <module>  
    from vllm.transformers_utils.tokenizers import (BaichuanTokenizer,  
  File "/usr/local/lib/python3.10/dist-packages/vllm/transformers\_utils/tokenizers/\_\_init\_\_.py", line 2, in <module>  
    from vllm.transformers_utils.tokenizers.mistral import MistralTokenizer  
  File "/usr/local/lib/python3.10/dist-packages/vllm/transformers\_utils/tokenizers/mistral.py", line 9, in <module>  
    from mistral_common.tokens.tokenizers.mistral import ChatCompletionRequest  
  File "/usr/local/lib/python3.10/dist-packages/mistral\_common/tokens/tokenizers/mistral.py", line 32, in <module>  
    from mistral_common.tokens.tokenizers.multimodal import (  
  File "/usr/local/lib/python3.10/dist-packages/mistral\_common/tokens/tokenizers/multimodal.py", line 6, in <module>  
    import cv2  
  File "/usr/local/lib/python3.10/dist-packages/cv2/\_\_init\_\_.py", line 181, in <module>  
    bootstrap()  
  File "/usr/local/lib/python3.10/dist-packages/cv2/\_\_init\_\_.py", line 175, in bootstrap  
    if __load_extra_py_code_for_module("cv2", submodule, DEBUG):  
  File "/usr/local/lib/python3.10/dist-packages/cv2/\_\_init\_\_.py", line 28, in __load_extra_py_code_for_module  
    py_module = importlib.import_module(module_name)  
  File "/usr/lib/python3.10/importlib/\_\_init\_\_.py", line 126, in import_module  
    return _bootstrap._gcd_import(name[level:], package, level)  
  File "/usr/local/lib/python3.10/dist-packages/cv2/typing/\_\_init\_\_.py", line 171, in <module>  
    LayerId = cv2.dnn.DictValue  
AttributeError: module 'cv2.dnn' has no attribute 'DictValue'  

      

一个opencv封建余孽的问题,pin住opencv的版本来解决


        
          
pip install opencv-python-headless==4.5.4.58  

      

还有一个load之后报TypeError的问题


        
          
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model\_executor/models/deepseek\_v3.py", line 472, in forward  
[rank0]:     kv_c, k_pe = self.kv_a_proj_with_mqa(hidden_states)[0].split(  
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl  
[rank0]:     return self._call_impl(*args, **kwargs)  
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl  
[rank0]:     return forward_call(*args, **kwargs)  
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model\_executor/layers/linear.py", line 246, in forward  
[rank0]:     output = self.quant_method.apply(self, x, bias)  
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model\_executor/layers/quantization/fp8.py", line 357, in apply  
[rank0]:     return apply_w8a8_block_fp8_linear(  
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model\_executor/layers/quantization/utils/fp8\_utils.py", line 61, in apply_w8a8_block_fp8_linear  
[rank0]:     output = w8a8_block_fp8_matmul(q_input,  
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model\_executor/layers/quantization/utils/fp8\_utils.py", line 470, in w8a8_block_fp8_matmul  
[rank0]:     configs = get_w8a8_block_fp8_configs(N, K, block_size[0], block_size[1])  
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model\_executor/layers/quantization/utils/fp8\_utils.py", line 407, in get_w8a8_block_fp8_configs  
[rank0]:     device_name = current_platform.get_device_name().replace(" ", "\_")  
[rank0]: TypeError: a bytes-like object is required, not 'str'  

      

通过升级 pynvml 解决


        
          
pip install pynvml -U  

      

Step 3 Run the model

这一步反而是最简单的:


        
          
vllm serve /your/path/to_checkpoint_deepseek-r1/ --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust-remote-code --host 0.0.0.0  

      

由于有了PP加持,没有IB的同学也可以尝试把sequence length和bsz给稍微拉大一些拉。用gaoce哥哥贡献的Reasoning Output,在同一台机器来试一把,或者换一台机器把localhost改了:


        
          
from openai import OpenAI  
  
# Modify OpenAI's API key and API base to use vLLM's API server.  
openai_api_key = "EMPTY"  
openai_api_base = "http://localhost:8000/v1"  
  
client = OpenAI(  
    api_key=openai_api_key,  
    base_url=openai_api_base,  
)  
  
models = client.models.list()  
model = models.data[0].id  
  
# Round 1  
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]  
response = client.chat.completions.create(model=model, messages=messages)  
  
reasoning_content = response.choices[0].message.reasoning_content  
content = response.choices[0].message.content  
  
print("reasoning\_content:", reasoning_content)  
print("content:", content)  

      

对,你不是卡主了,是你的钱包不够厚。切到后台可以看到,这个prompt里面


        
          
INFO 02-02 14:18:52 metrics.py:453] Avg prompt throughput: 1.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.  
INFO 02-02 14:18:57 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.  
INFO 02-02 14:19:02 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.  
INFO 02-02 14:19:07 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.  
INFO 02-02 14:19:12 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.0%, CPU KV cache usage: 0.0%.  
INFO 02-02 14:19:17 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.1%, CPU KV cache usage: 0.0%.  
INFO 02-02 14:19:22 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.1%, CPU KV cache usage: 0.0%.  
INFO 02-02 14:19:27 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cacheusage: 0.1%, CPU KV cache usage: 0.0%.  

      

稍等一会他就会告诉你9.8更大了。

祝大家捣鼓顺利,感谢vLLM社区的工作。


        
          
https://github.com/vllm-project/vllm/pull/12679  

      

凯超真 nb 春节在这做贴身客服,哈哈,RL仔现在不管原来是主修文还是主修理的,都先修infra吧。

PS:看到这里,如果觉得不错,可以来个点赞在看关注 。给公众号添加【星标⭐️】不迷路!您的支持是我坚持的最大动力!

欢迎多多关注公众号「NLP工作站」,加入交流群(3群也满了,等开4群吧),交个朋友吧,一起学习,一起进步!

0
0
0
0
关于作者
关于作者

文章

0

获赞

0

收藏

0

相关资源
CV 技术在视频创作中的应用
本次演讲将介绍在拍摄、编辑等场景,我们如何利用 AI 技术赋能创作者;以及基于这些场景,字节跳动积累的领先技术能力。
相关产品
评论
未登录
看完啦,登录分享一下感受吧~
暂无评论