QVQ-72B，如期而至！继QWQ后，通义千问又开源视觉推理大模型！ - 文章 - 开发者社区

大家好，我是刘聪NLP。

没错，是的，对的，很棒，千问！

QWQ之后，千问团队又开源了视觉推理大模型QVQ，是72B的呦。

圣诞快乐，如期而至！


        
          
HF: https://huggingface.co/Qwen/QVQ-72B-Preview

为啥是72B，可想而知，这个QVQ就是基于前一段时间开源的Qwen2-VL-72B模型上进一步训练得来的 。

picture.image

有个7B的为啥没出QVQ-7B，估计是参数来太少，做o1式推理效果不行，QWQ也是32B起步的，所以模型参数量很关键 。

在榜单上的效果，QVQ在MMMU是突破了70，并且整体效果相较于Qwen2-VL-72B还是好了很多，同时也是对标了闭源模型，QVQ依旧能打。

picture.image

但QVQ-72B依然存在一些问题：

可能存在语言混乱的现象，最明显的就是中英文夹杂
模型容易陷入循环推理，导致回复结果冗长，甚至可能无法返回最终答案
安全性可能有些问题，估计这个版本在安全上应该没来及的做太多，甚至是没做
QVQ不能完全替代Qwen2-VL-72B，随着推理步骤的验证，模型可能逐渐失去对图像内容的关注，从而产生幻觉。

模型，我还在下载，测试完，再写评测文章！

用法跟Qwen2-VL-72B一样，HF代码如下：


        
          
  
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor  
from qwen_vl_utils import process_vision_info  
  
# 模型通过HF Repo加载  
model = Qwen2VLForConditionalGeneration.from_pretrained(  
    "Qwen/QVQ-72B-Preview", torch_dtype="auto", device_map="auto"  
)  
  
# 加载processor  
processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview")  
  
# 这里的系统提示词跟之前有差别  
messages = [  
    {  
        "role": "system",  
        "content": [  
            {"type": "text", "text": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."}  
        ],  
    },  
    {  
        "role": "user",  
        "content": [  
            {  
                "type": "image",  
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/QVQ/demo.png",  
            },  
            {"type": "text", "text": "What value should be filled in the blank space?"},  
        ],  
    }  
]  
  
# 输入准备  
text = processor.apply_chat_template(  
    messages, tokenize=False, add_generation_prompt=True  
)  
image_inputs, video_inputs = process_vision_info(messages)  
inputs = processor(  
    text=[text],  
    images=image_inputs,  
    videos=video_inputs,  
    padding=True,  
    return_tensors="pt",  
)  
inputs = inputs.to("cuda")  
  
# 模型推理  
generated_ids = model.generate(**inputs, max_new_tokens=8192)  
generated_ids_trimmed = [  
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)  
]  
output_text = processor.batch_decode(  
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False  
)  
print(output_text)

PS：看到这里，如果觉得不错，可以来个点赞、在看、关注。给公众号添加【星标⭐️】不迷路！您的支持是我坚持的最大动力！

欢迎多多关注公众号「NLP工作站」，加入交流群，交个朋友吧，一起学习，一起进步！