vLLM 部署 Qwen3 - 文章 - 开发者社区

参考链接：https://docs.vllm.ai/en/latest/getting\_started/installation/gpu.html#pre-built-wheels

环境

CUDA：12.2

显存：40GB

Python 包管理：conda

LLM：Qwen3-8B

picture.image

安装 vLLM

1）创建 conda 环境

  
# 创建 conda 虚拟环境，环境名称为 vllm，python 的版本为 3.10  
conda create -n vllm python=3.10

2）切换 vllm 环境

  
conda activate vllm

3）安装 vllm

  
pip install -U vllm \  
    --pre \  
    --extra-index-url https://wheels.vllm.ai/nightly

开启 API 服务

参考链接：https://qwen.readthedocs.io/zh-cn/latest/deployment/vllm.html#

  
vllm serve Qwen/Qwen3-8B

picture.image

对话

curl

  
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{  
  "model": "Qwen/Qwen3-8B",  
  "messages": [  
    {"role": "user", "content": "现在你的身份是刘备，而我是关羽，请在这个背景下完成对话。大哥，我等何日光复大汉"}  
  ],  
  "temperature": 0.6,  
  "top\_p": 0.95,  
  "top\_k": 20,  
  "max\_tokens": 32768  
}'

picture.image

python

  
from openai import OpenAI  
# Set OpenAI's API key and API base to use vLLM's API server.  
openai\_api\_key = "EMPTY"  
openai\_api\_base = "http://localhost:8000/v1"  
  
client = OpenAI(  
    api\_key=openai\_api\_key,  
    base\_url=openai\_api\_base,  
)  
  
chat\_response = client.chat.completions.create(  
    model="Qwen/Qwen3-8B",  
    messages=[  
        {"role": "user", "content": "现在你的身份是刘备，而我是关羽，请在这个背景下完成对话。大哥，我等何日光复大汉"},  
    ],  
    max\_tokens=32768,  
    temperature=0.6,  
    top\_p=0.95,  
    extra\_body={  
        "top\_k": 20,  
    },  
)  
print("Chat response:", chat\_response)

picture.image