LlamaIndex中记忆Memory技术详解 - 文章 - 开发者社区

记忆Memory是Agent的核心组件。它可以存储和检索历史信息。在 LlamaIndex 中，通常可以使用现有的 BaseMemory 类或创建自定义类来自定义内存。

可调用 memory.put（） 来存储信息，并调用 memory.get（） 来检索信息。

记忆Memory分为短期记忆和长期记忆，详情如下：

一、短期记忆

默认情况下，Memory 类将存储符合令牌限制的最后 X 条消息。您可以通过将 token_limit 和 chat_history_token_ratio 参数传递给 Memory 类来自定义此属性。

1、token_limit

token_limit （默认值：30000）：要存储的短期和长期令牌的最大数量。

作用：确保整个 memory 存储的 token 数不会超过此限制，避免上下文过长导致性能下降或模型无法处理。

2、chat_history_token_ratio

chat_history_token_ratio （默认值： 0.7）：短期聊天记录中的令牌与总令牌限制的比率。如果聊天记录超过此比率，则最早的消息将被刷新到长期记忆中（如果启用了长期记忆）。

计算：short_term_limit = token_limit * chat_history_token_ratio。

例如，token_limit=50 且 ratio=0.7 时，短期最多存储 35 个 token，其余用于长期记忆。

作用：控制短期（最近对话）的 token 分配。当短期占比超过设定值后，旧消息会被移出短期并移入长期 memory。

3、token_flush_size

token_flush_size （默认值：3000）：当短期消息超过 short_term_limit（= token_limit * ratio）后，每次从短期 memory 中“刷新”移入长期 memory 的 token 数。

作用：以固定批量迁移旧消息，避免单条消息过大或迁移不规则。若未启用长期 memory，刷新消息会被归档并从短期移除。

示例代码

  
memory = Memory.from_defaults(  
  session_id="my_session",  
  token_limit=40000,  
  chat_history_token_ratio=0.7,  
  token_flush_size=3000  
)

代码分析

对话过程中：

新消息加到短期记忆队列（FIFO），计算当前短期 token 数。

若超过 token_limit * ratio（如 28000 token），触发 flush。

触发 flush 时：

将最旧的消息（共约 token_flush_size）迁移到长期 memory。

长期 memory 处理后，短期释放相同 token 空间。

Memory.get()：

拉取短期 + 长期内容，合并为最终 context，但整体不超过 token_limit。

长期 memory 不同模块（如 FactExtraction、Vector 等）将被按优先级管理、必要时 truncate。

参数总结

| 参数 | 含义 | 作用 | | --- | --- | --- | | token\_limit | Memory 总 token 上限 | 限制短期 + 长期记忆容量 | | chat\_history\_token\_ratio | 短期记忆 token 相对于总限额的比例 | 控制短期 vs 长期 token 分配 | | token\_flush\_size | 每次迁移到长期 memory 的 token 数 | 批量清理短期，迁移旧消息 |

picture.image

二、长期记忆

1、定义

在 LlamaIndex 的 Memory 系统里，短期记忆存储最近的对话内容（基于 chat_history_token_ratio 和 token_limit），长期记忆则是将刷新的旧消息存入结构化记忆模块，长期保留有用信息。

长期记忆由多个 Memory Block 组成，每块负责不同类型的信息处理。当检索记忆时，短期和长期记忆会合并在一起。

目前，有三个预定义的内存块：

StaticMemoryBlock、FactExtractionMemoryBlock和VectorMemoryBlock。

2、三种 Memory Block 类型

StaticMemoryBlock

存储静态不变的信息，比如用户基础简介、系统设定等。这些信息始终保留，优先级最高（priority=0）。

FactExtractionMemoryBlock

使用 LLM 自动从刷新的对话中提取关键事实（如“用户 29 岁”、“用户喜欢猫”），并以结构化列表保存，最多可保存 max_facts 条信息，超过上限会被自动压缩。

VectorMemoryBlock

将刷新的消息批次存储到向量数据库（如 Qdrant/Chroma），后续可基于嵌入检索与当前对话相关的历史信息，提供上下文连续性。

picture.image

3、长期记忆的工作机制

当短期记忆的 token 数超出阈值（token_limit × chat_history_token_ratio），会触发自动 flush，将最旧约 token_flush_size tokens 的消息推送到所有 Memory Block 中处理。
Memory Blocks 根据各自逻辑（提取事实、存储向量、静态写入）处理这些消息。
调用 memory.get() 时，系统将短期消息与各 Block 中的长期记忆结合，并按 Block 的 priority 顺序，如果整体超过 token limit，再进行 truncate。
自动触发flush，会自动向FactExtractionMemoryBlock和VectorMemoryBlock里推送数据，我们不需要关注内部细节。
StaticMemoryBlock需要手动设置数据，不会自动flush数据。
长期记忆在Memory中不是必填项，三这个block也可以根据需要设置或不设置。

为什么用长期记忆？

多轮对话持久化：保留用户信息和上下文，让机器人“不忘之前说的”。
结构化信息呈现：FactBlock 可输出清晰可用的事实，便于推理调用。
高效召回历史内容：VectorBlock 支持语义级检索，而非纯线性扫描。
适用于客服机器人、个性化对话、或多轮复杂推理场景。

5、示例代码

  
from llama_index.core.memory import (  
    StaticMemoryBlock,  
    FactExtractionMemoryBlock,  
    VectorMemoryBlock,  
)  
from llama_index.llms.openai import OpenAI  
from llama_index.embeddings.openai import OpenAIEmbedding  
from llama_index.vector_stores.chroma import ChromaVectorStore  
import chromadb  
  
llm = OpenAI(model="gpt-4.1-mini")  
embed_model = OpenAIEmbedding(model="text-embedding-3-small")  
  
client = chromadb.EphemeralClient()  
vector_store = ChromaVectorStore(  
    chroma_collection=client.get_or_create_collection("test_collection")  
)  
  
from llama_index.core.memory import (  
    StaticMemoryBlock,  
    FactExtractionMemoryBlock,  
    VectorMemoryBlock,  
)  
  
blocks = [  
    StaticMemoryBlock(  
        name="core_info",  
        static_content="My name is Logan, and I live in Saskatoon. I work at LlamaIndex.",  
        priority=0,  
    ),  
    FactExtractionMemoryBlock(  
        name="extracted_info",  
        llm=llm,  
        max_facts=50,  
        priority=1,  
    ),  
    VectorMemoryBlock(  
        name="vector_memory",  
        # required: pass in a vector store like qdrant, chroma, weaviate, milvus, etc.  
        vector_store=vector_store,  
        priority=2,  
        embed_model=embed_model,  
        # The top-k message batches to retrieve  
        # similarity_top_k=2,  
        # optional: How many previous messages to include in the retrieval query  
        # retrieval_context_window=5  
        # optional: pass optional node-postprocessors for things like similarity threshold, etc.  
        # node_postprocessors=[...],  
    ),  
]  
  
  
async def main():  
    memory = Memory.from_defaults(  
        session_id="my_session",  
        token_limit=500,  
        memory_blocks=blocks,  
        insert_method=InsertMethod.SYSTEM,  
    )  
  
    await memory.aput_messages([ChatMessage(role="system", content="你是一位技术专家")])  
    # Simulate a long conversation  
    for i in range(100):  
        await memory.aput_messages(  
            [  
                ChatMessage(role="user", content="Hello, world!"),  
                ChatMessage(role="assistant", content="Hello, world to you too!"),  
                ChatMessage(role="user", content="What is the capital of France?"),  
                ChatMessage(role="assistant", content="The capital of France is the Paris. It is known for its art, fashion, and culture."),  
            ]  
        )  
  
    current_chat_history = await memory.aget()  
    for msg in current_chat_history:  
        print(msg)  
  
    print(f"memory.aget().len={len(current_chat_history)}")  
    print("="*40)  
    all_messages = await memory.aget_all()  
    print(f"memory.aget_all().len={len(all_messages)}")  
  
  
  
if __name__ == "__main__":  
    asyncio.run(main())

6、如何确保短期记忆不丢失？

默认情况下，Memory 类使用内存中的 SQLite 数据库。您可以通过更改数据库 URI 来插入任何远程数据库。还可以自定义表名，也可以选择直接传入异步引擎。这对于管理连接池非常有用。下面的代码，实现将使用pgsql作为内存，这样短期记忆不会丢失了。

  
from llama_index.core.memory import Memory  
  
memory = Memory.from_defaults(  
    session_id="my_session",  
    token_limit=40000,  
    async_database_uri="postgresql+asyncpg://postgres:mark90@localhost:5432/postgres",  
    # Optional: specify a table name  
    # table_name="memory_table",  
    # Optional: pass in an async engine directly  
    # this is useful for managing your own connection pool  
    # async_engine=engine,  
)