开始之前,先把上一篇 QwQ-32B 性能测试的结果补充一下。这里补充的是两卡4090部署 QwQ-32B-AWQ 的性能压测结果。推理启动命令:
VLLM_LOG_LEVEL=DEBUG PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
nohup vllm serve "/models/qwen/QwQ-32B-AWQ" \
--host 0.0.0.0 --port 7899 --served-model-name "QwQ-32B" \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.82 \
--max-model-len 16584 \
--max-num-batched-tokens 32768 \
--block-size 16 \
--kv-cache-dtype fp8 \
--enable-chunked-prefill \
--trust-remote-code \
--dtype auto > qwq-32b-awq.log 2>&1 &
测试脚本也作了一点儿更新,增加了max_tokens 支持逗号分隔多个值的功能。 这样不同 max_tokens 不同的并发值一把就测完了,更省事。压测命令如下:
# 同服务器上执行: 分别测 max_tokens 为100,1024,16384时,1,5,10,15,20,30个并发:
nohup python3 -u simple-bench-to-api.py --url http://10.9.30.34:7899/v1 \
--model QwQ-32B \
--concurrencys 1,10,20,30,40,50 \
--prompt "Introduce the history of China" \
--max_tokens 100,1024,16384 \
--duration_seconds 30 \
> benth-qwq32b-2-4090.log 2>&1 &
测试结果:
max_tokens=100 压测结果汇总
max_tokens=1024 压测结果汇总
max_tokens=16384 压测结果汇总
可见该部署方案下,并发超过40后,总的吞吐量反而下降了,说明40并发是性价比较好的一个平衡点。
以上压测脚本 simple-bench-to-api.py 及使用方法需要的同学可以参见我的上一篇文章 单卡4090上部署的DeepSeek-R1小模型的并发性能 自取 。
下面进入正题
正文
最初知道 lm-evaluation-harness 这个工具是从李飞飞的S1论文中(https://arxiv.org/html/2501.19393v3)。最近抽时间使用了下。
lm-evaluation-harness 是一个用于评估生成式语言模型的项目,提供了一个统一的框架,可在大量不同的评估任务上测试语言模型。
最近更新:
- 2025 年 3 月:增加了对引导 Hugging Face(HF)模型的支持。
- 2025 年 2 月:增加了对 SGLang 的支持。
- 2024 年 9 月:开始原型设计,允许用户创建并评估文本 + 图像多模态输入、文本输出任务,新增了 hf-multimodal 和 vllm-vlm 模型类型以及 mmmu 任务作为原型功能。
- 2024 年 7 月:更新并重构了 API 模型支持,引入了对批量和异步请求的支持,使其更易于定制和使用;新增了 Open LLM Leaderboard 任务。
参考资料:
- 源码地址:https://github.com/EleutherAI/lm-evaluation-harness
- 本地数据集配置文档: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new\_task\_guide.md
我的目标是对本地私有化部署好的大模型做能力评估测试,并且鉴于国内的网络环境,我需要完全不联外网来本地运行。而这个项目默认方式还是从 HuggingFace 拉模型,拉数据,自己启动大模型来评测。而且还有些小bug。所以需要略微修改适配下。
开始之前,先把用来评估的数据集准备好。本文以 hellaswag 数据集作为例子。这也是 lm-evaluation-harness 官网样例常用的一个数据集。hellaswag 是一个常识推理任务数据集,旨在评估模型对上下文的理解和生成能力。
下载数据源到本地 /data/ai/datasets/Rowan/hellaswag 目录:
# ls -l /data/ai/datasets/Rowan/hellaswag/
-rw-r--r-- 1 root root 261 Mar 5 06:46 checksums.sha
-rw-r--r-- 1 root root 2526 Mar 5 03:30 dataset_infos.json
-rw-r--r-- 1 root root 1174 Mar 5 03:30 .gitattributes
-rw-r--r-- 1 root root 4689 Mar 5 06:46 hellaswag.py
-rw-r--r-- 1 root root 11752147 Mar 5 06:38 hellaswag_test.jsonl
-rw-r--r-- 1 root root 47496131 Mar 5 06:33 hellaswag_train.jsonl
-rw-r--r-- 1 root root 12246618 Mar 5 06:38 hellaswag_val.jsonl
-rw-r--r-- 1 root root 6845 Mar 5 03:30 README.md
-rw-r--r-- 1 root root 253 Mar 5 06:47 test_hellaswag.py
注意其中 3 个 jsonl 可能要手工下载。 用 huggingface-cli 下载数据集后,这3个文件是空的,实际上是有内容的。
数据集内容样例:
# head -1 hellaswag_test.jsonl
{"ind": 14, "activity_label": "Wakeboarding", "ctx_a": "A man is being pulled on a water ski as he floats in the water casually.", "ctx_b": "he", "ctx": "A man is being pulled on a water ski as he floats in the water casually. he", "split": "test", "split_type": "indomain", "endings": ["mounts the water ski and tears through the water at fast speeds.", "goes over several speeds, trying to stay upright.", "struggles a little bit as he talks about it.", "is seated in a boat with three other people."], "source_id": "activitynet~v_-5KAycAQlC4"}
# head -1 hellaswag_train.jsonl
{"ind": 4, "activity_label": "Removing ice from car", "ctx_a": "Then, the man writes over the snow covering the window of a car, and a woman wearing winter clothes smiles.", "ctx_b": "then", "ctx": "Then, the man writes over the snow covering the window of a car, and a woman wearing winter clothes smiles. then", "split": "train", "split_type": "indomain", "label": 3, "endings": [", the man adds wax to the windshield and cuts it.", ", a person board a ski lift, while two men supporting the head of the person wearing winter clothes snow as the we girls sled.", ", the man puts on a christmas coat, knitted with netting.", ", the man continues removing the snow on his car."], "source_id": "activitynet~v_-1IBHYS3L-Y"}
# head -1 hellaswag_val.jsonl
{"ind": 24, "activity_label": "Roof shingle removal", "ctx_a": "A man is sitting on a roof.", "ctx_b": "he", "ctx": "A man is sitting on a roof. he", "split": "val", "split_type": "indomain", "label": 3, "endings": ["is using wrap to wrap a pair of skis.", "is ripping level tiles off.", "is holding a rubik's cube.", "starts pulling up roofing on a roof."], "source_id": "activitynet~v_-JhWjGDPHMY"}
下载要测试的模型的配置文件到本地,因为工具要使用模型中的 tokenizer 等配置:
# ll /data/ai/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
-rw-r--r-- 1 root root 680 Feb 12 07:08 config.json
-rw-r--r-- 1 root root 73 Feb 12 07:08 configuration.json
-rw-r--r-- 1 root root 181 Feb 12 07:08 generation_config.json
-rw-r--r-- 1 root root 28090 Feb 12 07:45 model.safetensors.index.json
-rw-r--r-- 1 root root 3071 Feb 12 07:46 tokenizer_config.json
-rw-r--r-- 1 root root 7031660 Feb 12 07:46 tokenizer.json
# ll /data/ai/models/qwen/QwQ-32B-AWQ/
-rw-r--r-- 1 root root 707 Mar 6 14:33 added_tokens.json
-rw-r--r-- 1 root root 864 Mar 6 14:34 config.json
-rw-r--r-- 1 root root 73 Mar 6 14:34 configuration.json
-rw-r--r-- 1 root root 243 Mar 6 14:34 generation_config.json
-rw-r--r-- 1 root root 1671853 Mar 6 14:34 merges.txt
-rw-r--r-- 1 root root 136515 Mar 6 14:43 model.safetensors.index.json
-rw-r--r-- 1 root root 613 Mar 6 14:43 special_tokens_map.json
-rw-r--r-- 1 root root 8081 Mar 6 14:43 tokenizer_config.json
-rw-r--r-- 1 root root 7031645 Mar 6 14:43 tokenizer.json
-rw-r--r-- 1 root root 2776833 Mar 6 14:44 vocab.json
注意不需要 safetensors 等权重大文件及其他无关文件。
拉取代码到干净目录:
git clone https://github.com/EleutherAI/lm-evaluation-harness
然后做如下3点修改:
1.解决apikey不生效的问题
在工程根目录 lm-evaluation-harness 下,修改代码:lm_eval/models/api_models.py 将 api-key 直接返回:
def header(self) -> dict:
"""Override this property to return the headers for the API request."""
#return {"Authorization": f"Bearer {self.api_key}"}
return {"Authorization": "Bearer 服务端配置的API-KEY"}
当然这是偷懒的改法。通用的话可以从参数传递,一直到这里都改好。
2.兼容ollama
vLLM推理接口返回中有 "logprobs" , ollama推理接口返回中没有这个字段。 所以需要修改代码做兼容:
--- a/lm_eval/models/openai_completions.py
+++ b/lm_eval/models/openai_completions.py
@@ -76,14 +76,20 @@ class LocalCompletionsAPI(TemplateAPI):
sorted(out["choices"], key=itemgetter("index")), ctxlens
):
assert ctxlen > 0, "Context length must be greater than 0"
- logprobs = sum(choice["logprobs"]["token_logprobs"][ctxlen:-1])
- tokens_logprobs = choice["logprobs"]["token_logprobs"][ctxlen:-1]
- top_logprobs = choice["logprobs"]["top_logprobs"][ctxlen:-1]
+ logprobs = 0.0
is_greedy = True
- for tok, top in zip(tokens_logprobs, top_logprobs):
- if tok != max(top.values()):
- is_greedy = False
- break
+
+ # 检查是否存在 "logprobs" 字段
+ if "logprobs" in choice:
+ # 如果存在,正常处理
+ logprobs = sum(choice["logprobs"]["token_logprobs"][ctxlen:-1])
+ tokens_logprobs = choice["logprobs"]["token_logprobs"][ctxlen:-1]
+ top_logprobs = choice["logprobs"]["top_logprobs"][ctxlen:-1]
+ for tok, top in zip(tokens_logprobs, top_logprobs):
+ if tok != max(top.values()):
+ is_greedy = False
+ break
res.append((logprobs, is_greedy))
3.数据集配置为本地目录
以前文下载好的 hellaswag 路径 /data/ai/datasets/Rowan/hellaswag 为例,后面会将 /data/ai/datasets 目录挂载为容器中的 /datasets 目录,所以容器中的数据集目录变为 /datasets/Rowan/hellaswag,对应配置文件做如下修改
--- a/lm_eval/tasks/hellaswag/hellaswag.yaml
+++ b/lm_eval/tasks/hellaswag/hellaswag.yaml
@@ -1,7 +1,7 @@
tag:
- multiple_choice
task: hellaswag
-dataset_path: hellaswag
+dataset_path: /datasets/Rowan/hellaswag
dataset_name: null
output_type: multiple_choice
training_split: train
@@ -21,4 +21,8 @@ metric_list:
metadata:
version: 1.0
dataset_kwargs:
+ data_dir: /datasets/Rowan/hellaswag
trust_remote_code: true
# 基础镜像是 Python 3.11.10
FROM pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel
# 复制 lm-evaluation 工程代码到镜像中
COPY lm-evaluation-harness /home/lm-evaluation-harness
# 设置工作目录
WORKDIR /home/lm-evaluation-harness
# 设置pip源为阿里云pip源
ENV PIP_INDEX_URL=https://mirrors.aliyun.com/pypi/simple/
# 安装基础工具
RUN --mount=type=cache,target=/var/cache/apt \
apt update && \
apt install -y apt-utils cmake build-essential ninja-build && \
apt install -y vim net-tools telnet curl wget netcat lsof git
# pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# pip3 install packaging ninja cpufeature numpy
# 其他的基础镜像中已经有了
RUN --mount=type=cache,target=/root/.cache/pip ls -l /root/.cache/pip && \
pip install -e . && \
pip install tenacity openai
# 清空父镜像的 ENTRYPOINT
ENTRYPOINT []
CMD []
保存为: lm-evaluation.dockerfile, 和源码目录 lm-evaluation-harness 并列。执行如下构建命令:
nohup docker build -f lm-evaluation.dockerfile -t lm-evaluation:20250306_cu121 . > build.log 2>&1 &
tail -100f build.log
用构建好的镜像启动docker容器:
# docker run --name lm-evaluation -itd \
-v /data/ai/models:/models \
-v /data/ai/datasets:/datasets \
-v /data/ai/workspace/lm-evaluation:/workspace \
lm-evaluation:250306_4890_cu121 bash
# docker exec -it lm-evaluation2 bash
root@f21a5d52da6c:/home/lm-evaluation-harness# pip list | grep -E 'lm|tenacity|openai'
lm_eval 0.4.8 /home/lm-evaluation-harness
openai 1.65.5
tenacity 9.0.0
DeepSeek-R1-Distill-Qwen-7B
在启动好的容器中,执行如下测试命令,测试事先启动好的模型推理服务。在这个例子中,DeepSeek-R1-Distill-Qwen-7B 模型是用 vLLM 在单卡4090上启动推理的
LOGLEVEL=DEBUG PYTHONPATH=. python -m lm_eval --model local-completions \
--model_args model=DeepSeek-R1-Distill-Qwen-7B,\
base_url=http://10.96.2.221:7869/v1/completions,\
tokenizer=/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B,\
config=/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/config.json,\
use_fast_tokenizer=True,\
num_concurrent=1,max_retries=3,tokenized_requests=False \
--tasks hellaswag \
--batch_size 1
其中
-
--model local-completions 保证 lm_eval 走调用本地api的逻辑
-
--tasks hellaswag 指定了任务列表,每个任务对应 lm_eval/tasks 下的一个子目录。这里我们只跑 hellaswag 一个任务。
-
--batch_size 1 设为1就够了,大了容易造成服务端OOM
注意 --model_args 参数的内容中不要有任何空格。 --model_args 的子项含义为:
-
base_url :指向推理服务的地址。这里是 http://10.96.2.221:7869 是事先部署好的本地vllm推理服务接口的地址
-
model:模型名称。base_url/v1/models 返回内容中可以看到的名称
-
tokenizer:指向模型根目录
-
config:指向模型的 config.json 文件
评测期间资源消耗:
|=========================================+======================+======================|
| 3 NVIDIA GeForce RTX 4090 Off | 00000000:61:00.0 Off | Off |
| 31% 57C P2 317W / 450W | 23454MiB / 24564MiB | 93% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
正常跑完的一个结果:
Requesting API: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40168/40168 [28:05<00:00, 23.83it/s]
2025-03-05:09:14:56,529 INFO [lm_eval.loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
local-completions (model=DeepSeek-R1-Distill-Qwen-7B,base_url=http://10.96.2.221:7869/v1/completions,api_key=sk-马赛克,tokenizer=/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B,config=/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/config.json,use_fast_tokenizer=True,num_concurrent=1,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag| 1|none | 0|acc |↑ |0.4592|± |0.0050|
| | |none | 0|acc_norm|↑ |0.5945|± |0.0049|
可以看到这个数据集有40168个逻辑推理问题,DeepSeek-R1-Distill-Qwen-7B 跑完的分数并不高,只有59分
DeepSeek-R1-Distill-Qwen-32B
测试命令和日志:
root@a56244238f05:/home/lm-evaluation-harness# LOGLEVEL=DEBUG PYTHONPATH=. python -m lm_eval --model local-completions \
--model_args model=deepseek-r1:32b,\
base_url=http://10.96.0.188:11434/v1/completions,\
tokenizer=/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B,\
config=/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/config.json,\
use_fast_tokenizer=True,\
num_concurrent=1,max_retries=3,tokenized_requests=False \
--tasks hellaswag \
--batch_size 1
...
2025-03-05:10:09:41,506 DEBUG [lm_eval.evaluator:488] Task: hellaswag; number of requests on this rank: 40168
2025-03-05:10:09:41,511 INFO [lm_eval.evaluator:517] Running loglikelihood requests
Requesting API: 24%|████████████████████████████████████▋ | 9818/40168 [23:21<1:10:49, 7.14it/s]
其中 http://10.96.0.188:11434 是本地ollama服务的地址。在这个例子中,deepseek-r1:32b 对应的模型是 DeepSeek-R1-Distill-Qwen-32B 的 ollama 官方gguf 量化版本。在 ollama仓库中的名称就是 deepseek-r1:32b,是 INT4量化的。
评测期间资源消耗:
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 30% 57C P2 276W / 450W | 23052MiB / 24564MiB | 66% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
正常跑完的一个结果:
2025-03-05:10:09:41,511 INFO [lm_eval.evaluator:517] Running loglikelihood requests
Requesting API: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40168/40168 [1:19:39<00:00, 8.40it/s]
2025-03-05:11:29:43,939 INFO [lm_eval.loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
local-completions (model=deepseek-r1:32b,base_url=http://10.96.0.188:11434/v1/completions,tokenizer=/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B,config=/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/config.json,use_fast_tokenizer=True,num_concurrent=1,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag| 1|none | 0|acc |↑ |0.2504|± |0.0043|
| | |none | 0|acc_norm|↑ |0.2504|± |0.0043|
这个分数比 DeepSeek-R1-Distill-Qwen-7B 还要低,可能是 ollama gguf INT4 量化的影响比较大。
QwQ:32B
ollama int4
模型:ollama 官方的 qwq:32b 模型,应该是 INT4 量化的,磁盘大小是19G
root@ollama-59b75d4cc4-8jgpx:/# ollama list
NAME ID SIZE MODIFIED
qwq:32b 38ee5094e51e 19 GB 20 hours ago
执行命令:
root@a56244238f05:/home/lm-evaluation-harness# cat eval.sh
#!/bin/bash
LOGLEVEL=DEBUG PYTHONPATH=. python -m lm_eval --model local-completions \
--model_args model=qwq:32b,\
base_url=http://10.96.0.188:11434/v1/completions,\
tokenizer=/models/qwen/QwQ-32B,\
config=/models/qwen/QwQ-32B/config.json,\
use_fast_tokenizer=True,\
num_concurrent=1,max_retries=3,tokenized_requests=False \
--tasks hellaswag \
--batch_size 1
root@a56244238f05:/home/lm-evaluation-harness# nohup ./eval.sh > lm_eval.log 2>&1 &
评估结果:
root@a56244238f05:/home/lm-evaluation-harness# tail -100f lm_eval.log
...
Selected Tasks: ['hellaswag']
100%|██████████| 10042/10042 [00:02<00:00, 3652.75it/s]
2025-03-06:22:51:01,405 DEBUG [lm_eval.evaluator:488] Task: hellaswag; number of requests on this rank: 40168
2025-03-06:22:51:01,410 INFO [lm_eval.evaluator:517] Running loglikelihood requests
Requesting API: 100%|██████████| 40168/40168 [1:14:14<00:00, 9.02it/s]
2025-03-07:00:05:38,708 INFO [lm_eval.loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
local-completions (model=qwq:32b,base_url=http://10.96.0.188:11434/v1/completions,tokenizer=/models/qwen/QwQ-32B,config=/models/qwen/QwQ-32B/config.json,use_fast_tokenizer=True,num_concurrent=1,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag| 1|none | 0|acc |↑ |0.2504|± |0.0043|
| | |none | 0|acc_norm|↑ |0.2504|± |0.0043|
连 qwq:32b 都只有 25 分,这个结果不太对。不应该这么低。 也可能这个评估工具和数据集并不适合 ollama int4 模型,判断结果有误?
vllm awq
用 vLLM 跑 qwen/QwQ-32B-AWQ ,模型磁盘大小也是19G。vLLM启动命令:
VLLM_LOG_LEVEL=DEBUG PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve "/models/qwen/QwQ-32B-AWQ" \
--host 0.0.0.0 --port 7899 --served-model-name "QwQ-32B" \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95 \
--max-model-len 1024 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--dtype bfloat16
注意这里这个参数有点儿问题,max-model-len太小, 在测试 max_tokens=1024 的请求时,会报错:
Unexpected error: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 1024 tokens. However, you requested 1040 tokens (16 in the messages, 1024 in the completion). Please reduce the length of the messages or completion.", 'type': 'BadRequestError', 'param': None, 'code': 400}
但是下面的 lm-evaluation-harness 评估竟然通过了,评估命令:
root@a56244238f05:/home/lm-evaluation-harness# LOGLEVEL=DEBUG PYTHONPATH=. nohup python -m lm_eval --model local-completions \
--model_args model=QwQ-32B,\
base_url=http://10.9.30.34:7899/v1/completions,\
tokenizer=/models/qwen/QwQ-32B-AWQ,\
config=/models/qwen/QwQ-32B-AWQ/config.json,\
use_fast_tokenizer=True,\
num_concurrent=1,max_retries=3,tokenized_requests=False \
--tasks hellaswag \
--batch_size 1 > lm_eval.log 2>&1 &
评估日志:
root@a56244238f05:/home/lm-evaluation-harness# tail -f lm_eval.log
...
Selected Tasks: ['hellaswag']
100%|██████████| 10042/10042 [00:02<00:00, 3669.27it/s]
2025-03-09:03:54:00,108 DEBUG [lm_eval.evaluator:488] Task: hellaswag; number of requests on this rank: 40168
2025-03-09:03:54:00,113 INFO [lm_eval.evaluator:517] Running loglikelihood requests
Requesting API: 100%|██████████| 40168/40168 [56:18<00:00, 11.89it/s]
2025-03-09:04:50:42,168 INFO [lm_eval.loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
local-completions (model=QwQ-32B,base_url=http://10.9.30.34:7899/v1/completions,tokenizer=/models/qwen/QwQ-32B-AWQ,config=/models/qwen/QwQ-32B-AWQ/config.json,use_fast_tokenizer=True,num_concurrent=1,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag| 1|none | 0|acc |↑ |0.6423|± |0.0048|
| | |none | 0|acc_norm|↑ |0.8302|± |0.0037|
这个分数果然比 ollama 的32B 要高很多, 能达到83分。这才像个SOTA大模型的样子嘛。也可能说明 vLLM + AWQ 的能力损失,相比于 ollama + gguf int4 的能力损失要小得多。
资源消耗:
|=========================================+======================+======================|
| 3 NVIDIA GeForce RTX 4090 On | 00000000:61:00.0 Off | Off |
| 78% 65C P2 422W / 450W | 22856MiB / 24564MiB | 95% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
vllm awq 2卡
下面来测试下同样用vLLM推理QwQ-32B-AWQ,资源给加到两卡4090的情况:
VLLM_LOG_LEVEL=DEBUG PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
nohup vllm serve "/models/qwen/QwQ-32B-AWQ" \
--host 0.0.0.0 --port 7899 --served-model-name "QwQ-32B" \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.82 \
--max-model-len 16584 \
--max-num-batched-tokens 32768 \
--block-size 16 \
--kv-cache-dtype fp8 \
--enable-chunked-prefill \
--trust-remote-code \
--dtype auto > qwq-32b-awq.log 2>&1 &
评估结果:
root@a56244238f05:/home/lm-evaluation-harness# tail -100f lm_eval.log
...
Selected Tasks: ['hellaswag']
100%|██████████| 10042/10042 [00:02<00:00, 3713.42it/s]
2025-03-09:14:20:42,409 DEBUG [lm_eval.evaluator:488] Task: hellaswag; number of requests on this rank: 40168
2025-03-09:14:20:42,414 INFO [lm_eval.evaluator:517] Running loglikelihood requests
Requesting API: 100%|██████████| 40168/40168 [1:17:16<00:00, 8.66it/s]
2025-03-09:15:38:21,991 INFO [lm_eval.loggers.evaluation_tracker:272] Output path not provided, skipping saving results aggregated
local-completions (model=QwQ-32B,base_url=http://10.9.30.34:7899/v1/completions,tokenizer=/models/qwen/QwQ-32B-AWQ,config=/models/qwen/QwQ-32B-AWQ/config.json,use_fast_tokenizer=True,num_concurrent=1,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag| 1|none | 0|acc |↑ |0.6413|± |0.0048|
| | |none | 0|acc_norm|↑ |0.8296|± |0.0038|
可见单纯推理能力分数来看,和单卡跑也差不多。毕竟评估的是功能,不是性能。只要能跑起来,资源对生成质量没有影响。
本文最大的结论可能是: ollama部署的32B,相比同尺寸的AWQ量化,可能推理能力损失大了很多,谨慎选择。
