Agent/智能体的概念一度泛滥,甚至有这种观点出现:当一个人和你说他在做Agent/智能体的时候,他等于什么都没说。 太笼统,没有具体的含义。就像当年 Cloud 云计算的概念刚出来一样。
我一度觉得,其实就应该单纯从字面上来理解:Agent 本意就是代理人。当你通过对话来使用 LLM 大模型时,你是在直接使用。当你要处理的任务很复杂,需要写巨多Prompt,还要搜集其他外部信息,做巨多轮交互的时候,且中间会有各种模式化的 Prompt 需要重复输入,这时你还直接使用 LLM 就不明智了。这时就需要一个工具,来帮你搞定模式化的Prompt,以及外部信息的采集和组装,以及最后结果的输出或执行。你通过这个工具间接来使用 LLM,那么这个工具就叫 Agent:代理人。第一能代理,第二代理的是人。能代理人的工作,当然很智能,所以又叫智能体。实质上就是使用 LLM 的自动化。相对于你直接使用 LLM 而言,现在有了Agent,你通过 Agent 这个当差的,间接使用 LLM,省心多了。从这个角度来看 Agent 这个概念,就简单明了了。
所谓的感知-规划-行动,当你在你开发的网页里,展示下当前气温,调用下天气服务,或者仅仅展示下当前时间,你会把这,叫做感知吗?不会对不对。感知是相对于大模型而言的,因为大模型是用静态历史数据训练的,他不知道今夕何夕,此系何地。所以对他来说,获取个实时的时间、位置以及新闻等信息,就算感知了。而规划判断是大模型固有的能力,本质上就是往概率大的方向走,这是大模型的强项。而行动又是对外部世界的对接,实际上是替代你原来的手工操作的。
- 感知 - 帮你自动调用工具、服务获取现实世界实时辅助信息
- 规划 - 把这些信息和你的意图,交给大模型分析规划,一步步来执行。中间可以继续感知,行动迭代。
- 行动 - 把结果或判断付诸实施。或者中间需要的和外界的交互动作,自动帮你执行了。
这3个事情可以杂糅在一起,反复迭代直到拿到结果,或宣称失败。说白了,这不还是帮你封装了下大模型的使用吗? 不知是哪个天才造出“感知-规划-行动”这样文绉绉的词,让人觉得高大上(一头雾水)的,我觉得完全没必要啊!
那么为什么仅仅就是帮你封装了下 LLM 的使用,就能流行起来这么一个新的概念呢?我觉得这除了 LLM 太火太出圈以外,还有更深层的原因。我们一起来分析下当下流行的 Agent: OpenManus,应该会有更好的理解。
下面是对 OpenManus 做 debug 的细节分析。有细节恐惧症的同学,可以直接跳到总结部分看结论
安装依赖
从上一波 Manus 营销大火之后,OpenManus 就成了近期最火的开源项目了。我们拉下来看看。首先 clone 项目:
# 克隆仓库并进入工程根目录
git clone git@github.com:ThinkInAIXYZ/deepchat.git && cd deepchat
# 用 conda 创建 python 隔离环境,并激活:
conda create -n open_manus python=3.12
conda activate open_manus
# 在隔离环境下安装依赖:
pip install -r requirements.txt
本文获取代码时的最新一个 commit 是
ommit 896a5d8d8d6c808c9911ecca234600f6acbb93f4 (HEAD -> main, origin/main, origin/HEAD)
Author: Jiayi Zhang <84363704+didiforgithub@users.noreply.github.com>
Date: 2025-04-01 20:42:31 +0800
Add Sponsors to ReadMe.
后面的内容都基于这个版本。后来看了下 ,直到4月10号才有一个更新,只是在大模型Prompt中加了一行“If you want to stop the interaction at any point, use the ‘terminate’ tool/function call.”, 告诉大模型可以随时终止迭代,对本文的测试没有影响,也可以直接用10号这个版本。(1号到10号都没有更新,因为他们在举办一个基于 OpenManus 的黑客马拉松项目比赛。)
因为浏览器工具需要用到 playwright ,需要事先安装好。并执行
playwright install
安装 playwright 默认浏览器
配置LLM
环境安装好后,拷贝项目自带的配置模版:
cp config/config.example.toml config/config.toml
修改 config/config.toml 中的 llm 内容为 deepseek 官网 api 接口,
[llm]
model = "deepseek-chat"
base_url = "https://api.deepseek.com/v1/"
api_key = "在deepseek 官网 https://platform.deepseek.com/api_keys 创建好的API Key"
max_tokens = 8192
temperature = 0.0
启动进程
直接 python main.py 运行之后,即可命令行输入问题来交互:
。。。
INFO [browser_use] BrowserUse logging setup complete with level info
INFO [root] Anonymized telemetry enabled. See https://docs.browser-use.com/development/telemetry for more information.
Enter your prompt: 预测下明天的温度
这个问题看起来比较简单和没必要,只是作为例子说明运作机制。换其他更复杂的问题,比如预测明天股票走势等,也一样。
配置Debug
创建 vscode 的 debug 环境:新建 .vscode/launch.json 文件,内容如下:
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python Debugger: Current File",
"type": "debugpy",
"request": "launch",
"program": "${file}",
"python": "/Users/你的用户名/miniconda3/envs/open_manus/bin/python",
"console": "integratedTerminal"
}
]
}
选中工程的 main.py , 运行 Run -》 Start Debugging 这样就可以 debug OpenManus,看看它做了什么,以及如何做的。
随便运行调试了几次,看了下 token 消耗,上百万已经用掉了,消耗还是蛮快的。 agent 将 token 消耗放大了能有10倍。 因此寻求直接调用本地部署的LLM,这样就可以随便调试,不用有 token 焦虑了。而且也能方便查看请求日志来分析
本地启动模型推理
用预制好的 docker 镜像,一键启动模型推理:
docker run --name vllm-qwq -itd \
--gpus '"device=0,1"' -p 7869:7869 \
--shm-size=12gb --entrypoint vllm \
-v /data/ai/models:/models \
vllm/vllm-openai:v0.8.3 serve \
/models/qwen/QwQ-32B-AWQ --port 7869 \
--served-model-name QwQ-32B-AWQ \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.82 \
--max-model-len 16584 \
--max-num-batched-tokens 32768 \
--enable-chunked-prefill \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--uvicorn-log-level debug \
--dtype auto \
--api-key sk-xxx
其中需要注意几点:
- 镜像 vllm/vllm-openai:v0.8.3 是前几天才出的最新版本,只有这个版本才刚刚支持:tool_choice='required' 选项,OpenManus会用到这个选项
- 模型在 host 的路径是 /data/ai/models/qwen/QwQ-32B-AWQ,/data/ai/models 挂载为容器的 /models 目录,对应容器中的模型路径就是 /models/qwen/QwQ-32B-AWQ
- --tensor-parallel-size 2 用 4090 跑 QwQ-32B-AWQ 两卡比较流畅,max-model-len 也可以开大些
- --enable-auto-tool-choice 支持 tool call 必须开启这个参数,告知 vLLM 启用模型的自主工具调用能力,允许模型在判定适当时自动生成工具调用请求。
- --tool-call-parser hermes 是 qwq支持的 tool call(或者说函数调用)的解析方式
启动好后测试:
curl -X POST https://your-host-ip-or-domain:7869/v1/chat/completions \
-H "Authorization: Bearer sk-xxx" \
-H "Content-Type: application/json" -d '{
"model": "QwQ-32B-AWQ",
"messages": [{"role": "user",
"content": "特沃兹道是谁"}]}'
没问题的话就可以用来做 OpenManus 的后端模型了
配置LLM
修改 config/config.toml 中的 llm 部分,改为自己私有化部署的 QwQ 模型:
[llm]
model = "QwQ-32B-AWQ"
base_url = "https://your-host-ip-or-domain:7869/v1/"
api_key = "sk-xxx"
查看推理日志
docker logs -f vllm-qwq 可以看到推理日志,主要关注其中的 Received request 后面的 prompt 内容
启动 debug , 输入问题观察执行过程:
。。。
INFO [browser_use] BrowserUse logging setup complete with level info
INFO [root] Anonymized telemetry enabled. See https://docs.browser-use.com/development/telemetry for more information.
Enter your prompt: 预测下明天的温度
执行入口主循环: app/agent/base.py
async def run(self, request: Optional[str] = None) -> str:
。。。
if request:
self.update_memory("user", request)
results: List[str] = []
async with self.state_context(AgentState.RUNNING):
while (
self.current_step < self.max_steps and self.state != AgentState.FINISHED
):
self.current_step += 1
logger.info(f"Executing step {self.current_step}/{self.max_steps}")
step_result = await self.step()
# Check for stuck state
if self.is_stuck():
self.handle_stuck_state()
results.append(f"Step {self.current_step}: {step_result}")
if self.current_step >= self.max_steps:
self.current_step = 0
self.state = AgentState.IDLE
results.append(f"Terminated: Reached max steps ({self.max_steps})")
await SANDBOX_CLIENT.cleanup()
return "\n".join(results) if results else "No steps executed"
这里可以看到:
- 请求存入记忆体,应该是一个简单的本地内存数据结构,应该做了持久化的
- max_steps ( 默认设置的是 20 步) 内完成就提交结束了。到 max_steps 还没完成也自动结束,避免步骤太长或无限循环。
单步 step 的执行体比较简洁:app/agent/react.py
async def step(self) -> str:
"""Execute a single step: think and act."""
should_act = await self.think()
if not should_act:
return "Thinking complete - no action needed"
return await self.act()
每一步先思考,再执行。执行后根据当前上下文,决定下一步做什么
思考入口:app/agent/manus.py
async def think(self) -> bool:
"""Process current state and decide next actions with appropriate context."""
original_prompt = self.next_step_prompt
recent_messages = self.memory.messages[-3:] if self.memory.messages else []
browser_in_use = any(
tc.function.name == BrowserUseTool().name
for msg in recent_messages
if msg.tool_calls
for tc in msg.tool_calls
)
if browser_in_use:
self.next_step_prompt = (
await self.browser_context_helper.format_next_step_prompt()
)
result = await super().think()
# Restore original prompt
self.next_step_prompt = original_prompt
return result
思考主体方法:app/agent/toolcall.py
async def think(self) -> bool:
"""Process current state and decide next actions using tools"""
if self.next_step_prompt:
user_msg = Message.user_message(self.next_step_prompt)
self.messages += [user_msg]
try:
# Get response with tool options
response = await self.llm.ask_tool(
messages=self.messages,
system_msgs=(
[Message.system_message(self.system_prompt)]
if self.system_prompt
else None
),
tools=self.available_tools.to_params(),
tool_choice=self.tool_choices,
)
except ValueError:
。。。
请求 LLM ,告知上下文和工具列表,让其选择一个工具:app/llm.py
async def ask_tool(
self,
messages: List[Union[dict, Message]],
system_msgs: Optional[List[Union[dict, Message]]] = None,
timeout: int = 300,
tools: Optional[List[dict]] = None,
tool_choice: TOOL_CHOICE_TYPE = ToolChoice.AUTO, # type: ignore
temperature: Optional[float] = None,
**kwargs,
) -> ChatCompletionMessage | None:
。。。
params["stream"] = False # Always use non-streaming for tool requests
response: ChatCompletion = await self.client.chat.completions.create(
**params
)
。。。
step 1
response: ChatCompletion = await self.client.chat.completions.create
这行执行完后,大模型推理服务端收到的请求日志:
INFO 04-09 11:20:22 [logger.py:39] Received request chatcmpl-fb6eda78-8a01-9643-837e-9d21b5520bab: prompt: '。。。见下文', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=8192, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 04-09 11:20:22 [engine.py:310] Added request chatcmpl-fb6eda78-8a01-9643-837e-9d21b5520bab.
其中主题内容 prompt ,整理格式后内容如下:
<|im_start|>system
You are OpenManus, an all-capable AI assistant, aimed at solving any task presented by the user. You have various tools at your disposal that you can call upon to efficiently complete complex requests. Whether it\'s programming, information retrieval, file processing, or web browsing, you can handle it all.The initial directory is: /Users/你的路径/OpenManus/workspace
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "python_execute", "description": "Executes Python code string. Note: Only print outputs are visible, function return values are not captured. Use print statements to see results.", "parameters": {"type": "object", "properties": {"code": {"type": "string", "description": "The Python code to execute."}}, "required": ["code"]}}}
{"type": "function", "function": {"name": "browser_use", "description": "A powerful browser automation tool that allows interaction with web pages through various actions.\
* This tool provides commands for controlling a browser session, navigating web pages, and extracting information\
* It maintains state across calls, keeping the browser session alive until explicitly closed\
* Use this when you need to browse websites, fill forms, click buttons, extract content, or perform web searches\
* Each action requires specific parameters as defined in the tool\'s dependencies\
\
Key capabilities include:\
* Navigation: Go to specific URLs, go back, search the web, or refresh pages\
* Interaction: Click elements, input text, select from dropdowns, send keyboard commands\
* Scrolling: Scroll up/down by pixel amount or scroll to specific text\
* Content extraction: Extract and analyze content from web pages based on specific goals\
* Tab management: Switch between tabs, open new tabs, or close tabs\
\
Note: When using element indices, refer to the numbered elements shown in the current browser state.\
", "parameters": {"type": "object", "properties": {"action": {"type": "string", "enum": ["go_to_url", "click_element", "input_text", "scroll_down", "scroll_up", "scroll_to_text", "send_keys", "get_dropdown_options", "select_dropdown_option", "go_back", "web_search", "wait", "extract_content", "switch_tab", "open_tab", "close_tab"], "description": "The browser action to perform"}, "url": {"type": "string", "description": "URL for \'go_to_url\' or \'open_tab\' actions"}, "index": {"type": "integer", "description": "Element index for \'click_element\', \'input_text\', \'get_dropdown_options\', or \'select_dropdown_option\' actions"}, "text": {"type": "string", "description": "Text for \'input_text\', \'scroll_to_text\', or \'select_dropdown_option\' actions"}, "scroll_amount": {"type": "integer", "description": "Pixels to scroll (positive for down, negative for up) for \'scroll_down\' or \'scroll_up\' actions"}, "tab_id": {"type": "integer", "description": "Tab ID for \'switch_tab\' action"}, "query": {"type": "string", "description": "Search query for \'web_search\' action"}, "goal": {"type": "string", "description": "Extraction goal for \'extract_content\' action"}, "keys": {"type": "string", "description": "Keys to send for \'send_keys\' action"}, "seconds": {"type": "integer", "description": "Seconds to wait for \'wait\' action"}}, "required": ["action"], "dependencies": {"go_to_url": ["url"], "click_element": ["index"], "input_text": ["index", "text"], "switch_tab": ["tab_id"], "open_tab": ["url"], "scroll_down": ["scroll_amount"], "scroll_up": ["scroll_amount"], "scroll_to_text": ["text"], "send_keys": ["keys"], "get_dropdown_options": ["index"], "select_dropdown_option": ["index", "text"], "go_back": [], "web_search": ["query"], "wait": ["seconds"], "extract_content": ["goal"]}}}}
{"type": "function", "function": {"name": "str_replace_editor", "description": "Custom editing tool for viewing, creating and editing files\
* State is persistent across command calls and discussions with the user\
* If `path` is a file, `view` displays the result of applying `cat -n`. If `path` is a directory, `view` lists non-hidden files and directories up to 2 levels deep\
* The `create` command cannot be used if the specified `path` already exists as a file\
* If a `command` generates a long output, it will be truncated and marked with `<response clipped>`\
* The `undo_edit` command will revert the last edit made to the file at `path`\
\
Notes for using the `str_replace` command:\
* The `old_str` parameter should match EXACTLY one or more consecutive lines from the original file. Be mindful of whitespaces!\
* If the `old_str` parameter is not unique in the file, the replacement will not be performed. Make sure to include enough context in `old_str` to make it unique\
* The `new_str` parameter should contain the edited lines that should replace the `old_str`\
", "parameters": {"type": "object", "properties": {"command": {"description": "The commands to run. Allowed options are: `view`, `create`, `str_replace`, `insert`, `undo_edit`.", "enum": ["view", "create", "str_replace", "insert", "undo_edit"], "type": "string"}, "path": {"description": "Absolute path to file or directory.", "type": "string"}, "file_text": {"description": "Required parameter of `create` command, with the content of the file to be created.", "type": "string"}, "old_str": {"description": "Required parameter of `str_replace` command containing the string in `path` to replace.", "type": "string"}, "new_str": {"description": "Optional parameter of `str_replace` command containing the new string (if not given, no string will be added). Required parameter of `insert` command containing the string to insert.", "type": "string"}, "insert_line": {"description": "Required parameter of `insert` command. The `new_str` will be inserted AFTER the line `insert_line` of `path`.", "type": "integer"}, "view_range": {"description": "Optional parameter of `view` command when `path` points to a file. If none is given, the full file is shown. If provided, the file will be shown in the indicated line number range, e.g. [11, 12] will show lines 11 and 12. Indexing at 1 to start. Setting `[start_line, -1]` shows all lines from `start_line` to the end of the file.", "items": {"type": "integer"}, "type": "array"}}, "required": ["command", "path"]}}}
{"type": "function", "function": {"name": "terminate", "description": "Terminate the interaction when the request is met OR if the assistant cannot proceed further with the task.\
When you have finished all the tasks, call this tool to end the work.", "parameters": {"type": "object", "properties": {"status": {"type": "string", "description": "The finish status of the interaction.", "enum": ["success", "failure"]}}, "required": ["status"]}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
<|im_end|>
<|im_start|>user
预测下明天的温度
<|im_end|>
<|im_start|>user
Based on user needs, proactively select the most appropriate tool or combination of tools. For complex tasks, you can break down the problem and use different tools step by step to solve it. After using each tool, clearly explain the execution results and suggest the next steps.
<|im_end|>
<|im_start|>assistant
<think>
其中 <|im_start|> 和 <|im_end|> 是 Qwen 模型特有的模版分隔符,分隔不同角色的提示词。这里 system 角色的提示词告诉 llm 你是 OpenManus,一个全能的AI助理,你有各种工具可以调用来协助你完成用户的问题。然后给出了工具列表,放在 和 <\tools> 标签中。每一个工具用一个 json 结构表示,包含了工具名称,用途、详细参数用法等信息。这么火的OpenManus,目前也就做了这4个工具:
- python_execute 可以用来执行 python 脚本
- browser_use 可以用来执行浏览器访问
- str_replace_editor 可以用来查看和编辑文本文件
- terminate 调用这个工具用来结束任务
当然每个工具里都包含了支持的详细参数及取值。
然后告诉大模型,当需要函数调用时,返回一个json对象,包裹在 <tool_call> 和 </tool_call> 标签内。格式如下:
<tool_call>
{"name": <函数名称>, "arguments": <参数JSON对象>}
</tool_call>
接下来是 user 角色。第一个 user 角色的提示词放置了用户的原始问题。第二个 user 角色的提示词,告诉大模型,根据用户需求主动选择最合适的工具或工具组合。对于复杂任务,可分解问题并分步骤调用不同工具解决。每次工具调用后需清晰说明执行结果,并建议后续操作步骤。
这就是向大模型发出的第一个请求。
大模型对这个请求的响应, 从 response: ChatCompletion = await self.client.chat.completions.create 这行代码的断点,得到的 response ,是经过封装了 openai api 接口调用的 sdk 处理过的结构化对象,拷贝如下 :
ChatCompletion(
id='chatcmpl-fb6eda78-8a01-9643-837e-9d21b5520bab',
choices=[。。。],
created=1744197622,
model='QwQ-32B-AWQ',
object='chat.completion',
service_tier=None,
system_fingerprint=None,
usage=CompletionUsage(completion_tokens=408, prompt_tokens=1768, total_tokens=2176, completion_tokens_details=None, prompt_tokens_details=None),
prompt_logprobs=None)
其中choices只有一个:
Choice(
finish_reason='tool_calls', index=0, logprobs=None, message=
ChatCompletionMessage(
content='。。。',
refusal=None, role='assistant', annotations=None, audio=None, function_call=None,
tool_calls=[ChatCompletionMessageToolCall(
id='chatcmpl-tool-e4f374dd1a974d14ba6f06cde0a81e02',
function=Function(arguments='{"action": "web_search", "query": "tomorrow\'s weather forecast"}', name='browser_use'), type='function')],
reasoning_content=None), stop_reason=None
)
可见大模型返回的 tool_call 是:
- 工具名称:browser_use
- 调用动作:web_search
- 动作参数:
- query:tomorrow's weather forecast
其中 content 内容是推理模型的思考过程:
好的,用户让我预测明天的温度。首先,我需要确定如何获取天气数据。用户可能期望我使用网络浏览器来查找天气信息,因为这是最常见的方法。我需要调用浏览器工具来访问天气网站或者使用搜索引擎查找明天的温度。\n\n接下来,我应该使用browser\_use函数,执行web\_search动作,查询“明天的天气”。不过可能需要更具体的地理位置,但用户没有提供,所以可能需要假设用户所在的位置,或者让搜索引擎根据IP定位。不过作为AI,我可能需要直接执行搜索。\n\n然后,执行搜索后,浏览器工具会返回结果,我需要解析返回的网页内容,提取温度数据。可能需要进一步使用extract\_content动作,指定目标是获取温度信息。或者,在搜索结果中直接获取摘要信息。\n\n不过,可能需要分步骤:首先进行web\_search,然后根据返回的链接进入相关网站,比如天气网或者气象局网站,然后提取具体温度数据。例如,先搜索“明天天气”,然后点击第一个结果,再在页面中查找温度相关的元素,比如通过scroll\_to\_text找到温度部分,或者直接提取内容。\n\n但考虑到工具的参数限制,可能需要先执行web\_search,然后根据返回的链接打开新标签页,再导航到具体页面,再提取内容。不过可能步骤较多,需要逐步尝试。\n\n现在,首先调用browser\_use的web\_search动作,查询“明天天气预报”,然后等待结果。之后可能需要进一步操作,比如查看返回的链接,打开详细页面,再提取温度数据。但可能第一次调用就能返回足够的信息,比如搜索结果中的摘要温度数据。\n\n因此,第一步是执行web\_search,参数query设为“明天天气预报”,然后获取结果。之后根据返回的内容,可能需要进一步处理,但先执行这一步看看结果。\n</think>\n\n
调用工具
从大模型返回结果,解析得到的 ChatCompletionMessage 对象,作为 ask_tool 函数的返回,赋值给相关对象:
self.tool_calls = tool_calls = (
response.tool_calls if response and response.tool_calls else []
)
之后控制台打印出日志:
2025-04-09 22:37:57.998 | INFO | app.agent.toolcall:think:82 - 🛠️ Manus selected 1 tools to use
2025-04-09 22:38:08.564 | INFO | app.agent.toolcall:think:86 - 🧰 Tools being prepared: ['browser_use']
2025-04-09 22:38:09.142 | INFO | app.agent.toolcall:think:89 - 🔧 Tool arguments: {"action": "web_search", "query": "tomorrow's weather forecast"}
第1步的 think 结束,返回 true 表示要执行动作, 进入 act 阶段. act 的入口代码:app/agent/toolcall.py
async def act(self) -> str:
"""Execute tool calls and handle their results"""
。。。
results = []
for command in self.tool_calls:
# Reset base64_image for each tool call
self._current_base64_image = None
result = await self.execute_tool(command)
if self.max_observe:
result = result[: self.max_observe]
logger.info(
f"🎯 Tool '{command.function.name}' completed its mission! Result: {result}"
)
# Add tool response to memory
tool_msg = Message.tool_message(
content=result,
tool_call_id=command.id,
name=command.function.name,
base64_image=self._current_base64_image,
)
self.memory.add_message(tool_msg)
results.append(result)
return "\n\n".join(results)
result = await self.execute_tool(command) 这行,进入 act 主体代码:
async def execute_tool(self, command: ToolCall) -> str:
"""Execute a single tool call with robust error handling"""
...
result = await self.available_tools.execute(name=name, tool_input=args)
...
# Format result for display (standard case)
observation = (
f"Observed output of cmd `{name}` executed:\n{str(result)}"
if result
else f"Cmd `{name}` completed with no output"
)
return observation
except json.JSONDecodeError:
...
result = await self.available_tools.execute(name=name, tool_input=args) 进入执行具体工具的入口:app/tool/tool_collection.py
async def execute(
self, *, name: str, tool_input: Dict[str, Any] = None
) -> ToolResult:
tool = self.tool_map.get(name)
if not tool:
return ToolFailure(error=f"Tool {name} is invalid")
try:
result = await tool(**tool_input)
return result
except ToolError as e:
return ToolFailure(error=e.message)
这里可以看到 tool_map 的内容,对应之前告诉大模型的4工具 :
然后从 map 中获取到 'browser_use' 对应的工具对象,并执行。result = await tool(**tool_input) 这行执行过程会调用 playwright 打开浏览器,执行结果(result 对象内容)为:
SearchResponse(output="Search results for 'tomorrow's weather forecast':
1. Result 1
URL: https://support.google.com/websearch/answer/13687874
Description: /
Content: Weather data - Google Search Help Skip to main content Weather data ...
Metadata:
- Total results: 1
- Language: en
- Country: us", error=None, base64_image=None, system=None, query="tomorrow's weather forecast", results=[SearchResult(position=1, url='https://support.google.com/websearch/answer/13687874', title='Result 1', description='/', source='google', raw_content="Weather data - 。。。")], metadata=SearchMetadata(total_results=1, language='en', country='us'))
然后对这个结果转化为字符串,并做了最大长度截断,控制台输出:
2025-04-10 07:45:08.477 | INFO | app.agent.toolcall:act:150 - 🎯 Tool 'browser\_use' completed its mission! Result: Observed output of 。。。截断后的内容
保存到记忆体
self.memory.add\_message(tool\_msg)
完成这一步的 act 并返回得到的 resulte ,结束本次主循环,进入下一轮循环,即进入下一个 step ;对应控制台的打印:
2025-04-10 07:48:13.338 | INFO | app.agent.base:run:140 - Executing step 2/20
step 2
第2个 step 执行的是每个step相同的 think 入口代码:
async def think(self) -> bool:
"""Process current state and decide next actions with appropriate context."""
original_prompt = self.next_step_prompt
recent_messages = self.memory.messages[-3:] if self.memory.messages else []
browser_in_use = any(
tc.function.name == BrowserUseTool().name
for msg in recent_messages
if msg.tool_calls
for tc in msg.tool_calls
)
if browser_in_use:
self.next_step_prompt = (
await self.browser_context_helper.format_next_step_prompt()
)
result = await super().think()
# Restore original prompt
self.next_step_prompt = original_prompt
return result
和第1步不同的是,在最近的3个 messages 中,发现有 browser_in_use ,则用 self.browser_context_helper.format_next_step_prompt() 更新 prompt ,然后同样执行到
- app/agent/toolcall.py 的 self.llm.ask_tool 询问大模型需要调用哪个工具
- app/llm.py 中的 response: ChatCompletion = await self.client.chat.completions.create 大模型返回工具调用请求
这里用更新后的 Promt 调用大模型。
response = await self.llm.ask_tool(
messages=self.messages,
这时 self.messages 的内容为:
我们直接看更新之后的 Prompt 长什么样。
这是从大模型推理服务端收到的,整理后的第二次对大模型的请求日志:
<|im_start|>system
内如如step1第一次问大模型完全相同
<|im_end|>
<|im_start|>user
预测下明天的温度<|im_end|>
<|im_start|>user
与第一次完全相同
<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "browser_use", "arguments": {"action": "web_search", "query": "tomorrow\'s weather forecast"}}
</tool_call><|im_end|>
<|im_start|>user
<tool_response>
Observed output of cmd `browser_use` executed:
Search results for \'tomorrow\'s weather forecast\':
1. Result 1
URL: https://support.google.com/websearch/answer/13687874
Description: /
Content: Weather data - Google Search Help Skip to...
Metadata:
- Total results: 1
- Language: en
- Country: us
</tool_response><|im_end|>
<|im_start|>user
Current browser screenshot:<|im_end|>
<|im_start|>user
What should I do next to achieve my goal?
When you see [Current state starts here], focus on the following:
- Current URL and page title
URL: https://support.google.com/websearch/answer/13687874
Title: 天气数据 - Google 搜索帮助
- Available tabs
1 tab(s) available
- Interactive elements and their indices
- Content above or below the viewport (if indicated)
- Any action results or errors
For browser interactions:
- To navigate: browser_use with action="go_to_url", url="..."
- To click: browser_use with action="click_element", index=N
- To type: browser_use with action="input_text", index=N, text="..."
- To extract: browser_use with action="extract_content", goal="..."
- To scroll: browser_use with action="scroll_down" or "scroll_up"
Consider both what\'s visible and what might be beyond the current viewport.
Be methodical - remember your progress and what you\'ve learned so far.
<|im_end|>
<|im_start|>assistant
<think>
可见保留了第一次的 system 角色和2个user 角色提示词之后, 接下来的几个新追加的 message 是:
- <|im_start|>assistant 角色,代表大模型本身。回传了上一次问题大模型返回的 tool call 内容,仍然放在 <tool_call> <\tool_call> 标签中原样返回。
- <|im_start|> user 角色,记录了上一步中工具调用的结果。放在 <tool_response> 和 <\tool_response> 之间。
- <|im_start|>user 角色,试图放 Current browser screenshot 当前浏览器的屏幕截图,可能因为当前模型不支持多模态,没有放实际内容。
- <|im_start|>user 角色,告诉大模型现在浏览器的状态是如此这般的,接下来该怎么办呢?如果要操作浏览器,你可以调用 browser_use 并传入 blabla 不同的 action 和参数等。
然后大模型对新的请求,做出了应答:
ChatCompletion(id='chatcmpl-0f1bebaa-d28d-924f-a493-a633df672c60', choices=[Choice(。。。), stop\_reason=None)], created=1744267355, model='QwQ-32B-AWQ', object='chat.completion', service\_tier=None, system\_fingerprint=None, usage=CompletionUsage(completion\_tokens=685, prompt\_tokens=2305, total\_tokens=2990, completion\_tokens\_details=None, prompt\_tokens\_details=None), prompt\_logprobs=None)
仍然只有1个Choice,内容为:
finish\_reason='tool\_calls', index=0, logprobs=None, message=ChatCompletionMessage(content='。。。', refusal=None, role='assistant', annotations=None, audio=None, function\_call=None, tool\_calls=[ChatCompletionMessageToolCall(id='chatcmpl-tool-4bd9220fcfa5481eb9939f6d2ccc2b04', function=Function(arguments='{"action": "go\_back"}', name='browser\_use'), type='function')], reasoning\_content=None
可见继续返回了 tool_calls, 仍然调用 browser_use 工具, action 是 go_back , 大模型想后退看看
其中 content 的内容是推理大模型的思考过程:
Okay, let\'s see. The user wants to predict tomorrow\'s temperature. I first did a web search using the browser tool, but the result led to a Google Support page explaining their weather data sources, not the actual forecast. So the user is probably expecting the actual temperature, not just the data sources.
Hmm, the current page isn\'t showing the forecast. Maybe I need to go back and try a different search. The current URL is the support page, so perhaps the initial search query wasn\'t specific enough. Let me check the search query again. The user\'s query was "tomorrow\'s weather forecast", which should work, but maybe the first result was just the help page. Maybe I should try clicking on another result, but the metadata shows only one result was returned. Wait, that\'s odd. Maybe the search didn\'t return the actual forecast page because the query was too generic?
Alternatively, maybe I should navigate directly to a weather website. For example, going to weather.com or accuweather.com. Since the browser tool can go to a URL, I can try that. Let me try going to "https://www.google.com/search?q=tomorrow\'s+weather" to see if that gives a more direct result. Or perhaps use a specific location? Wait, the user didn\'t specify a location. Oh right, the user might be in a different area. Since the initial search didn\'t work, maybe I need to adjust the query to include a location. But without knowing the user\'s location, that\'s tricky. Alternatively, maybe the Google Weather widget on the main page shows the forecast. Let me try going to Google\'s weather search directly.
Alternatively, maybe I should extract content from the current page to see if there\'s any hidden info. But the current page is about data sources. The user probably needs to go back and try a different search term. Let me use the "go_back" action to return to the search results and maybe try a different query. Wait, the current page is the first result. Maybe the actual forecast is on another tab or another result. But the metadata shows only one result. Hmm, perhaps the search was in Chinese? The title shows "天气数据 - Google 搜索帮助", which is Chinese. Maybe the user\'s search was in Chinese? The original query was in English, but the result is in Chinese. Maybe the browser\'s language settings are causing this. Alternatively, maybe I should try a more specific search term in English. Let me try modifying the search query to include the location. For example, "tomorrow\'s weather forecast in [location]". But without knowing the location, this is hard. Alternatively, maybe the user is in the default country (US as per metadata), so I can assume a location like New York. Let me try that.
Alternatively, maybe I should use the extract_content action on the current page to see if there\'s any temperature data. But looking at the content provided, it\'s just about data sources. So that won\'t help. The best approach is probably to go back to the search results and try a different search. Let me first go back using the browser\'s go_back action. Then, perform a more specific search. Let me try that step by step.
</think>
调用工具
客户端(OpenManus)日志打印:
2025-04-10 15:07:14.800 | INFO | app.agent.toolcall:think:82 - 🛠️ Manus selected 1 tools to use
2025-04-10 15:07:24.133 | INFO | app.agent.toolcall:think:86 - 🧰 Tools being prepared: ['browser_use']
2025-04-10 15:07:25.041 | INFO | app.agent.toolcall:think:89 - 🔧 Tool arguments: {"action": "go_back"}
因为模型返回了 tool_calls , 所以继续进行工具调用:
result = await self.execute_tool(command)
。。。
result = await self.available_tools.execute(name=name, tool_input=args)
这时的 command 对象是:
ChatCompletionMessageToolCall(id='chatcmpl-tool-4bd9220fcfa5481eb9939f6d2ccc2b04', function=Function(arguments='{"action": "go\_back"}', name='browser\_use'), type='function')
args对象是:
{'action': 'go\_back'}
执行大模型的 go_back指令,执行后 OpenManus 控制台打印:
2025-04-10 15:30:07.112 | INFO | app.agent.toolcall:think:82 - 🛠️ Manus selected 1 tools to use
2025-04-10 15:30:09.474 | INFO | app.agent.toolcall:think:86 - 🧰 Tools being prepared: ['browser_use']
2025-04-10 15:30:10.220 | INFO | app.agent.toolcall:think:89 - 🔧 Tool arguments: {"action": "go_back"}
2025-04-10 15:32:53.330 | INFO | app.agent.toolcall:execute_tool:180 - 🔧 Activating tool: 'browser_use'...
2025-04-10 15:32:54.812 | INFO | app.agent.toolcall:act:150 - 🎯 Tool 'browser_use' completed its mission! Result: Observed output of cmd `browser_use` executed:
Navigated back
2025-04-10 15:32:54.812 | INFO | app.agent.base:run:140 - Executing step 3/20
进入第3步
step 3
第3次问大模型的 Prompt:(。。。省略的是历史会话消息)
<|im_start|>system
<|im_start|>user
预测下明天的温度<|im_end|>
<|im_start|>user
。。。
<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "browser_use", "arguments": {"action": "web_search", "query": "tomorrow\'s weather forecast"}}
</tool_call><|im_end|>
<|im_start|>user
。。。
<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "browser_use", "arguments": {"action": "go_back"}}
</tool_call><|im_end|>
<|im_start|>user
<tool_response>
Observed output of cmd `browser_use` executed:
Navigated back
</tool_response><|im_end|>
<|im_start|>user
Current browser screenshot:<|im_end|>
<|im_start|>user
What should I do next to achieve my goal?
。。。
Consider both what\'s visible and what might be beyond the current viewport.
Be methodical - remember your progress and what you\'ve learned so far.
<|im_end|>
<|im_start|>assistant
<think>
前面会附上历次对话历史,包括历史浏览器状态,最近一个的工具调用和浏览器最新状态。
调用工具
根据大模型返回的 tool_calls 指令:'browser_use' {'action': 'go_to_url', 'url': 'https://www.weather.com'} 执行 browser_use 工具调用。也就是大模型觉得之前 web_search query tomorrow's weather forecast 工具请求,默认访问的网页不行,要求指定访问 www.weather.com 这个地址。
2025-04-10 15:34:15.268 | INFO | app.agent.toolcall:think:82 - 🛠️ Manus selected 1 tools to use
2025-04-10 15:34:15.268 | INFO | app.agent.toolcall:think:86 - 🧰 Tools being prepared: ['browser_use']
2025-04-10 15:34:15.268 | INFO | app.agent.toolcall:think:89 - 🔧 Tool arguments: {"action": "go_to_url", "url": "https://www.weather.com"}
2025-04-10 16:12:33.943 | INFO | app.agent.toolcall:execute_tool:180 - 🔧 Activating tool: 'browser_use'...
2025-04-10 16:14:40.577 | INFO | app.agent.toolcall:act:150 - 🎯 Tool 'browser_use' completed its mission! Result: Observed output of cmd `browser_use` executed:
Navigated to https://www.weather.com
2025-04-10 16:14:40.579 | INFO | app.agent.base:run:140 - Executing step 4/20
step 4
访问新地址后,进入第4个迭代,把访问结果,及浏览器状态,再交给大模型,这次最终的 Prompt:
。。。前面几轮的历史记录
<|im_start|>assistant
<tool_call>
{"name": "browser_use", "arguments": {"action": "go_to_url", "url": "https://www.weather.com"}}
</tool_call><|im_end|>
<|im_start|>user
<tool_response>
Observed output of cmd `browser_use` executed:
Navigated to https://www.weather.com
</tool_response><|im_end|>
<|im_start|>user
Current browser screenshot:<|im_end|>
<|im_start|>user
What should I do next to achieve my goal?
When you see [Current state starts here], focus on the following:
- Current URL and page title
URL: https://weather.com/zh-CN/weather/today/l/CHXX0008:1:CH?Goto=Redirected
Title: 东城区天气预报和情况 - The Weather Channel | Weather.com
- Available tabs
1 tab(s) available
- Interactive elements and their indices
- Content above or below the viewport (if indicated)
- Any action results or errors
For browser interactions:
- To navigate: browser_use with action="go_to_url", url="..."
- To click: browser_use with action="click_element", index=N
- To type: browser_use with action="input_text", index=N, text="..."
- To extract: browser_use with action="extract_content", goal="..."
- To scroll: browser_use with action="scroll_down" or "scroll_up"
Consider both what\'s visible and what might be beyond the current viewport.
Be methodical - remember your progress and what you\'ve learned so far.
<|im_end|>
<|im_start|>assistant
<think>
大模型响应:
ChatCompletionMessage(content='...', refusal=None, role='assistant', annotations=None, audio=None, function\_call=None, tool\_calls=[ChatCompletionMessageToolCall(id='chatcmpl-tool-c5adc064c18540058dd98b728556dd65', function=Function(arguments='{"action": "extract\_content", "goal": "find tomorrow\'s temperature forecast"}', name='browser\_use'), type='function')], reasoning\_content=None)
思考过程:
Okay, the user wants to predict tomorrow\'s temperature. Let me see what steps I\'ve taken so far. First, I did a web search for "tomorrow\'s weather forecast," which led to a Google Support page, but that didn\'t have the actual forecast. Then I went back and navigated to weather.com. Now I\'m on the Weather Channel page for Dongcheng District.
Hmm, the current page title mentions a weather forecast. I need to check if the temperature for tomorrow is visible here. Since the user is probably in China (given the URL\'s location code CHXX0008), the page might display the local forecast.
The next step is to extract the content related to tomorrow\'s temperature. The browser tool\'s extract_content action can be used with a goal parameter specifying the temperature. Let me call that function. I\'ll set the goal to "find tomorrow\'s temperature forecast" to get the relevant information from the page.
Wait, maybe I should check if scrolling is needed? The content might be below the viewport. But since extract_content can handle that by analyzing the page\'s content, perhaps it\'s better to try extracting first. If that doesn\'t work, I can scroll. Let me try the extract_content action now.
</think>
大模型分析了下,觉得应该调用 浏览器工具的 extract_content 来提取页面内容,所以回复了 ToolCall 为 browser_use 参数为 {"action": "extract_content", "goal": "find tomorrow's temperature forecast"}
2025-04-10 16:26:25.660 | INFO | app.agent.toolcall:think:82 - 🛠️ Manus selected 1 tools to use
2025-04-10 16:26:25.660 | INFO | app.agent.toolcall:think:86 - 🧰 Tools being prepared: ['browser_use']
2025-04-10 16:26:25.660 | INFO | app.agent.toolcall:think:89 - 🔧 Tool arguments: {"action": "extract_content", "goal": "find tomorrow's temperature forecast"}
2025-04-10 16:26:33.056 | INFO | app.agent.toolcall:execute_tool:180 - 🔧 Activating tool: 'browser_use'...
这次的act,因为要网页提取信息,所以在 act 执行过程中,直接去问大模型了, 让大模型帮忙提取内容。代码走到这里:app/tool/browser_use_tool.py
# Use LLM to extract content with required function calling
response = await self.llm.ask_tool(
messages,
tools=[extraction_function],
tool_choice="required",
)
大模型接收到的 Prompt:
<|im_start|>system
Your task is to extract the content of the page. You will be given a page and a goal, and you should extract all relevant information around this goal from the page. If the goal is vague, summarize the page. Respond in json format.
Extraction goal: find tomorrow\'s temperature forecast
Page content:
东城区天气预报和情况 - The Weather Channel | Weather.com
AdvertisementAdvertisement[跳至主要内容](#MainContent)[辅助功能帮助](https://www.essentialaccessibility.com/the-weather-channel?utm_source=theweatherchannelhomepage&utm_medium=iconlarge&utm_term=eachannelpage&utm_content=header&utm_campaign=theweatherchannel)[The Weather Channel](/zh-CN)键入至少三个字符以开始自动完成。如果没有搜索查询,将显示最近搜索的位置。将自动选择第一个选项。使用向上和向下箭头更改选择。使用 escape 键清除。搜索城市或邮编Search
recents
-------
全部清除您没有最近使用过的位置Globe
CN°CArrow down
* °F
* °C
* 混合
攝氏 / 毫米 / 公里 / km/h / 毫巴美洲
Add
[安提瓜和巴布达 | English](/en-AG/weather/today/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[阿根廷 | Español](/es-AR/tiempo/hoy/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[巴哈马 | English](/en-BS/weather/today/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[巴巴多斯 | English](/en-BB/weather/today/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[伯利兹 | English](/en-BZ/weather/today/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[玻利维亚 | Español](/es-BO/tiempo/hoy/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[巴西 | Português](/pt-BR/clima/hoje/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[加拿大 | English](/en-CA/weather/today/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[加拿大 | Français](/fr-CA/temps/aujour/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[智利 | Español](/es-CL/tiempo/hoy/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[哥伦比亚 | Español](/es-CO/tiempo/hoy/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[哥斯达黎加 | Español](/es-CR/tiempo/hoy/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[多米尼加 | English](/en-DM/weather/today/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[多米尼加共和国 | Español](/es-DO/tiempo/hoy/l/b03e06bc789c91465cb39e738b85c9e48d4ee723896c6bcec3a9ef6a4ca37e86)[厄瓜多尔 | Español](/es-EC/t
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "extract_content", "description": "Extract specific information from a webpage based on a goal", "parameters": {"type": "object", "properties": {"extracted_content": {"type": "object", "description": "The content extracted from the page according to the goal", "properties": {"text": {"type": "string", "description": "Text content extracted from the page"}, "metadata": {"type": "object", "description": "Additional metadata about the extracted content", "properties": {"source": {"type": "string", "description": "Source of the extracted content"}}}}}}, "required": ["extracted_content"]}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call><|im_end|>
<|im_start|>assistant
<think>
大模型的返回:
ChatCompletion(id='chatcmpl-03be5fae-b64a-9158-b006-e1174de3e05c', choices=[Choice(finish\_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='', refusal=None, role='assistant', annotations=None, audio=None, function\_call=None, tool\_calls=[ChatCompletionMessageToolCall(id='chatcmpl-tool-c88a4a9bf6cb41c791b76a8ebbd0ec69', function=Function(arguments='{"extracted\_content": {"text": "明天的温度预报没有在提供的页面内容中明确给出。页面主要显示了东城区天气预报和情况的标题,以及一些广告和国际位置的链接,但没有具体的温度数据或预报信息。", "metadata": {"source": "东城区天气预报和情况 - The Weather Channel | Weather.com"}}}', name='extract\_content'), type='function')], reasoning\_content=None), stop\_reason=None)], created=1744273652, model='QwQ-32B-AWQ', object='chat.completion', service\_tier=None, system\_fingerprint=None, usage=CompletionUsage(completion\_tokens=97, prompt\_tokens=1525, total\_tokens=1622, completion\_tokens\_details=None, prompt\_tokens\_details=None), prompt\_logprobs=None)
因为 OpenManus 并没有把页面内容完整传递给大模型,实际的页面是:
image.png
把主体内容温度部分传丢了,大模型当然没找到温度了。至此应该是 OpenManus 炫技失败。 可能还要结合屏幕截图和多模态视觉模型来更高质量的完成网页信息提取任务。
最终这次工具执行的结果:
"Observed output of cmd `browser_use` executed:
Extracted from page:
{'text': '明天的温度预报没有在提供的页面内容中明确给出。页面主要显示了东城区天气预报和情况的标题,以及一些广告和国际位置的链接,但没有具体的温度数据或预报信息。', 'metadata': {'source': '东城区天气预报和情况 - The Weather Channel | Weather.com'}}"
之后的 OpenManus 控制台日志:
2025-04-10 16:26:25.660 | INFO | app.agent.toolcall:think:82 - 🛠️ Manus selected 1 tools to use
2025-04-10 16:26:25.660 | INFO | app.agent.toolcall:think:86 - 🧰 Tools being prepared: ['browser_use']
2025-04-10 16:26:25.660 | INFO | app.agent.toolcall:think:89 - 🔧 Tool arguments: {"action": "extract_content", "goal": "find tomorrow's temperature forecast"}
2025-04-10 16:26:33.056 | INFO | app.agent.toolcall:execute_tool:180 - 🔧 Activating tool: 'browser_use'...
2025-04-10 16:47:06.109 | INFO | app.llm:update_token_count:250 - Token usage: Input=1525, Completion=97, Cumulative Input=11028, Cumulative Completion=1913, Total=1622, Cumulative Total=12941
2025-04-10 16:50:06.830 | INFO | app.agent.toolcall:act:150 - 🎯 Tool 'browser_use' completed its mission! Result: Observed output of cmd `browser_use` executed:
Extracted from page:
{'text': '明天的温度预报没有在提供的页面内容中明确给出。页面主要显示了东城区天气预报和情况的标题,以及一些广告和国际位置的链接,但没有具体的温度数据或预报信息。', 'metadata': {'source': '东城区天气预报和情况 - The Weather Channel | Weather.com'}}
2025-04-10 16:50:15.303 | INFO | app.agent.base:run:140 - Executing step 5/20
step5
把这个结果再交给大模型,包括所有的历史记录,大模型返回:
ChatCompletion(id='chatcmpl-cb34dc04-1002-957c-835d-cea32a553c32', choices=[Choice(finish\_reason='tool\_calls', index=0, logprobs=None, message=ChatCompletionMessage(content='。。。', function=Function(arguments='{"action": "scroll\_down", "scroll\_amount": 500}', name='browser\_use'), type='function')], reasoning\_content=None), stop\_reason=None)], created=1744275119, model='QwQ-32B-AWQ', object='chat.completion', service\_tier=None, system\_fingerprint=None, usage=CompletionUsage(completion\_tokens=222, prompt\_tokens=3246, total\_tokens=3468, completion\_tokens\_details=None, prompt\_tokens\_details=None), prompt\_logprobs=None)
思考过程:
Okay, let\'s see. The user wants to predict tomorrow\'s temperature. I tried searching using the browser, but the first result was a Google help page which didn\'t have the actual forecast. Then I went back and navigated to weather.com. The extracted content from that page said there was no explicit temperature forecast for tomorrow. Hmm, maybe the current page only shows today\'s weather. I should check if there\'s a tab or link to the 3-day or weekly forecast. Let me look for elements on the page. Since the current URL is the "today" page, perhaps clicking on a "tomorrow" or "5-day" option would help. Alternatively, maybe scrolling down reveals more details. Let me try scrolling down to see if the temperature info is further down the page. If that doesn\'t work, maybe I need to click a specific element like a tab for the next day\'s forecast. I\'ll try scrolling first.\n</think>\n\n', refusal=None, role='assistant', annotations=None, audio=None, function\_call=None, tool\_calls=[ChatCompletionMessageToolCall(id='chatcmpl-tool-960b329d2c3345e9bbd6c358d4fbd594
因为大模型被误导,开始乱猜了,让滚动页面什么的。 至此算任务失败吧。结束调试
Prompt 工程
我们可以看到 OpenManus 和大模型的交互方式,和两年前,OpenAI ChatGPT 3.5 刚出来时的 Prompt 工程,没有什么区别。2年前我们给一个法国客户做的客服项目,就是这么干的。通过 Prompt 告诉 ChatGPT 3.5 我们有什么函数(工具)可以调用(当时 OpenAI API 的这个功能叫 Function Calling ),让模型分析自然语言,决定需要调用什么函数,提取处理参数,按要求的结构化格式返回,以便于我们从结构化返回中提取函数名和参数,调用函数拿到结果数据,再交给模型组织回复语言。 那么这 2 年来有哪些变化呢? 一个是大模型越来越强大了,另一个是各种 RAG 和 工作流等都出来了,Agent 流行起来了。但是直到今天为止,可以说所有的 Agent、RAG、工作流等,本质上都是 Prompt 工程。只不过加入了更多外部信息存储、检索、及外部系统操作的支持。并且这个过程也越来越规范了, 出现了 MCP 及最新的 A2A 协议。
Agent 概念
前面讲了从字面含义来理解 Agent,就是代理人,封装人手工使用LLM的工作过程来让你直接用。回到那个问题:为什么仅仅就是帮你封装了下 LLM 的使用,就能流行起来这么一个新的概念呢?
因为时代变了。大模型让软件进入了新世代。
在大模型出现之前,软件的实现方式,就是丁是丁卯是卯,所有函数调用,从最外部的调用,到内部各模块之间的调用,参数和取值范围都是明确的,规范的。输入确定的数据,输出确定的结果。因此无法直接处理自然语言这种,可以有各种变化组合的,无法穷举的输入。
- 从外部使用来看,哪怕一个软件的功能已经实现完备了,当把一段使用描述,丢给一个软件的时候,需要有人,来把这段描述的意图(需要调用什么函数/功能)和参数提取出来,通过软件的UI界面或命令行界面或API接口等,输入给软件,软件才能完成任务,返回结果。
- 从内部协作来看,当需要把不同的系统整合起来完成一个集成度更高的产品时,不同系统之间的调用,都需要软件工程师去按照每个系统、模块的功能说明,输入要求等,编写代码来对接。
当大模型出现后,这两个过程理论上都不需要人了,都可以自动化了。
- 从外部使用来看,相当于突然出现了一个无比强大的,非常弹性的UI界面,可以直接输入自然语言,大模型帮你翻译成背后系统需要的规范输入。所有软件都能直接处理自然语言输入了
- 从内部集成来看,之前不同系统的对接开发工作,可能直接就不需要了,大模型作为超级弹性粘合剂,能兼容各种不同情况。原本的一个需求,一个开发周期,可能变成一段 Prompt 就解决了。而且这个实现会非常弹性,会比直接编码能覆盖的情况还要多。
除了深刻改变了软件实现方式和使用方式外,大模型还拓展了软件的能力边界。因为大模型一劳永逸地解决了语言穷举问题,不管什么形式的语言或者说信息,都能抓住其语义本质,让之前很多软件自动化上难以实现的事情,需要很多巨量工程或人工操作的事情,现在理论上都可以很好实现了,不需要人工操作了。在某个细分场景中,不需要人工操作的完全自动化,就是智能化。世界上有无数种细分场景,都可以从之前的传统自动化,改造成AI时代的智能化。这就像为整个软件演化生态中加入了催化剂,给传统软件插上了翅膀、打通了结界。在大模型出现之前的所有软件,都可以归类为软件1.0,大模型出现之后,进入了软件2.0新世代。
这就是 Agent 背后的意义。 Agent 就是大模型对软件世界变革后的新事物。所有发挥了大模型能力,拓展了智能边界的新软件,都可以统统称为 Agent。这种变革,配得上一个新名字,或者说配的上把一个已有的单词在一个新的领域专有化,就像当年把 Cloud 这个单词在计算领域专有化为云计算范式; 美中不足的是,Agent 这个词,相比 Cloud,缺了点儿文采和想象力。还是中文的翻译,智能体相对更贴切。
传统软件1.0 + AI大模型 = 软件2.0 = 智能体
这是我对Agent/智能体的定义。不知道算不算业界共识?还是有失偏颇,欢迎指正。
Agent 成熟度
大模型催化剂在加速软件生态智能化演进的过程中,需要有非常多的工程建设,伴随着大模型能力的逐步增强,可以说几乎是要重新洗牌一遍的。从前面的 debug 我们看到, OpenManus 第一次问模型的时候,模型推理过程有这么一句:“我应该使用browser_use函数,执行web_search动作,查询“明天的天气”。不过可能需要更具体的地理位置,但用户没有提供,所以可能需要假设用户所在的位置,或者让搜索引擎根据IP定位。不过作为AI,我可能需要直接执行搜索” 可见 OpenManus 现在还缺一个本地定位服务的工具,如果有了这个,模型就会先获取用户位置了。这就是智能体智能的地方。
我用腾讯元宝试了下同样的问题,直接浏览器弹出请求获取定位服务:
然后给出了准确的回答:
根据最新的天气预报,明天(2025年4月11日,星期五)余杭区的气温预计为:
最高气温:32~33℃(白天多云,天气炎热)
最低气温:16~18℃(早晨较凉爽)
重要提醒:
强对流天气:明天夜间将转为阴有阵雨或雷雨,可能伴随8~10级雷雨大风、短时暴雨、强雷电和局地冰雹,需提前关闭门窗并减少外出。
断崖式降温:12日起气温骤降7~10℃,周末最高温仅22℃左右,注意及时添衣。
建议白天注意防暑,夜间防范强对流,周末备好保暖衣物。
可见开源 OpenManus 及其他所有智能体的建设,还有很长的路要走。。。
未来展望
就在刚刚,谷歌在Google Cloud Next大会上开源了Agent2Agent(A2A)协议, 通过标准化通信接口,首次实现跨框架、跨平台、跨供应商的智能体互操作,再加上已经流行的 MCP 协议,未来互联网将会变成无数个智能体编织成的智联网,并且可以和现实世界互操作,就像科幻片中的天网。到时以及现在开始,真正危险的并不是什么天网之类AI诞生意识控制人类的幼稚幻想,而是考验人类的政-治文明。什么样的政-治-体-制,能够让AI带来的生产力飞跃,普惠大众,而不是仅仅制造失业潮,造就终产者,什么样的政-治-体-制,才能代表人类文明的方向。
