参考链接:https://docs.vllm.ai/en/latest/getting\_started/installation/gpu.html#pre-built-wheels
环境
CUDA:12.2
显存:40GB
Python 包管理:conda
LLM:Qwen3-8B
安装 vLLM
1)创建 conda 环境
# 创建 conda 虚拟环境,环境名称为 vllm,python 的版本为 3.10
conda create -n vllm python=3.10
2)切换 vllm 环境
conda activate vllm
3)安装 vllm
pip install -U vllm \
--pre \
--extra-index-url https://wheels.vllm.ai/nightly
开启 API 服务
参考链接:https://qwen.readthedocs.io/zh-cn/latest/deployment/vllm.html#
vllm serve Qwen/Qwen3-8B
对话
curl
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "user", "content": "现在你的身份是刘备,而我是关羽,请在这个背景下完成对话。大哥,我等何日光复大汉"}
],
"temperature": 0.6,
"top\_p": 0.95,
"top\_k": 20,
"max\_tokens": 32768
}'
python
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai\_api\_key = "EMPTY"
openai\_api\_base = "http://localhost:8000/v1"
client = OpenAI(
api\_key=openai\_api\_key,
base\_url=openai\_api\_base,
)
chat\_response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role": "user", "content": "现在你的身份是刘备,而我是关羽,请在这个背景下完成对话。大哥,我等何日光复大汉"},
],
max\_tokens=32768,
temperature=0.6,
top\_p=0.95,
extra\_body={
"top\_k": 20,
},
)
print("Chat response:", chat\_response)