【RAG】表格场景RAG怎么做？TableRAG：一种增强大规模表格理解框架 - 文章 - 开发者社区

前面很多期介绍了密集文档场景的RAG方法，今天来看看大量表格场景的RAG怎么做的。

现有结合大模型的方法通常需要将整个表格作为输入，这会导致一些挑战，比如位置偏差、上下文长度限制等，尤其是在处理大型表格时。为了解决这些问题，文章提出了TableRAG框架，该框架利用查询扩展结合模式 和单元格检索 ，以在向LLM提供信息之前精确定位关键信息 。这种方法能够更高效地编码数据和精确检索，显著减少提示长度并减轻信息丢失。

picture.image 表提示技术在LLM中的应用比较

(a) Read Table

语言模型读取整个表格。这是最直接的方法，但往往不可行，因为大型表格会超出模型的处理能力。阴影区域表示提供给语言模型的数据，包括所有行和列。对于大型表格，这种方法不现实，因为会超过模型的令牌限制。

(b) Read Schema

语言模型只读取表格的模式（schema），即列名和数据类型。只包含列名和数据类型的信息，不包含表格内容的具体信息。这种方法会导致表格内容的信息丢失。

(c) Row-Column Retrieval

对行和列进行编码，然后根据它们与问题的相似性进行选择。只有行和列的交集被呈现给语言模型。编码后，基于与问题的相关性选择行和列。对于大型表格，编码所有行和列仍然不可行。

(d) Schema-Cell Retrieval (Ours)

编码列名和单元格，并根据它们与语言模型生成的关于问题查询的相关性进行检索。只有检索到的模式和单元格提供给语言模型。包括检索到的列名和单元格值。提高了编码和推理的效率。

(e) Retrieval Performance on ArcadeQA

展示了在 ArcadeQA 数据集上不同方法的检索结果。TableRAG 在列和单元格检索方面都优于其他方法，从而提高了后续表格推理过程的性能。

方法

picture.image TableRAG Example

核心思想是结合模式检索和单元格检索，获得解决问题的必要信息，通过程序辅助的LLM。实际上，没必要将整个表格给LLM。相反，关键信息通常位于与问题直接相关的特定列名、数据类型和单元格值中 。例如，考虑一个问题“钱包的平均价格是多少？”为了解决这个问题，程序可能只需要提取与“钱包”相关的行，然后从价格列计算平均值。仅知道相关列名以及表中“钱包”的表示方式就足以编写程序。因此，TableRAG解决了RAG的上下文长度限制。

picture.image TableRAG流程图：表格被用来构建Schema和单元格数据库。然后通过LLM将问题扩展成多个模式和单元格查询。这些查询依次用于Schema检索和列-单元格对。每个查询的前K个候选项被组合起来，输入到LLM求解器的提示中以回答问题。

TableRAG核心组件

Tabular Query Expansion(表格查询扩展)

为了有效地操作表格，关键是要精确地找出查询所需的列名和单元格值 。与之前的方法不同，TableRAG 不仅使用问题本身作为单一查询，而是为模式和单元格值生成单独的查询。例如，对于问题 "What is the average price for wallets?"，模型被提示生成针对列名（如 "product" 和 "price"）以及相关单元格值（如 "wallet"）的潜在查询。然后，这些查询被用来从表格中检索相关的模式和单元格值。

Schema Retrieval(Schema检索)

在生成查询后，Schema检索会使用预训练的编码器 fenc 来获取相关的列名。编码器将查询与编码的列名进行匹配，以确定相关性。检索到的模式数据包括列名、数据类型和示例值 。对于被识别为数值或日期时间类型的列，会显示最小值和最大值作为示例值；对于分类列，会展示三个最常见的类别作为示例值。通过这种方式，检索到的模式为表格的格式和内容提供了结构化的概览，这将用于更有针对性的数据提取。

相关prompt如下：


              
                
========================================= Prompt =========================================  
You are working with a pandas dataframe regarding "amazon seller order status prediction  
orders data" in Python. The name of the dataframe is ‘df‘. Your task is to use ‘  
python_repl_ast‘ to answer the question: "What is the average price for leather wallets?"  
Tool description:  
- ‘python_repl_ast‘: A Python interactive shell. Use this to execute python commands. Input  
should be a valid single line python command.  
Since you cannot view the table directly, here are some schemas and cell values retrieved  
from the table.  
Schema Retrieval Results:  
{"column\_name": "item\_total", "dtype": "object", "cell\_examples": [’$449.00’, ’$399.00’, ’  
$549.00’]}  
{"column\_name": "quantity", "dtype": "int64", "min": 1, "max": 4}  
{"column\_name": "order\_no", "dtype": "object", "cell\_examples": [’402-4845680-8041921’,  
’405-9763961-5211537’, ’404-3964908-7850720’]}  
...  
Cell Retrieval Queries: leather wallets, average price, order status, prediction, amazon  
seller  
Cell Retrieval Results:  
{"column\_name": "order\_status", "cell\_value": "Delivered to buyer"}  
{"column\_name": "description", "cell\_value": "Pure Leather Camel Color Gent’s Wallet with  
Coin Compartment and Card Holders | Men’s Ultra Slim Money Organiser (1 pc)"}  
...  
Strictly follow the given format to respond:  
Thought: you should always think about what to do  
Action: the single line Python command to execute  
Observation: the result of the action  
... (this Thought/Action/Observation can repeat N times)  
Thought: before giving the final answer, you should think about the observations  
Final Answer: the final answer to the original input question (Answer1, Answer2, ...)  
Notes:  
- Do not use markdown or any other formatting in your responses.  
- Ensure the last line is only "Final Answer: Answer1, Answer2, ..." form, no other form.  
- Directly output the Final Answer rather than outputting by Python.  
- Ensure to have a concluding thought that verifies the table, observations and the  
question before giving the final answer.  
Now, given a table regarding "amazon seller order status prediction orders data", please  
use ‘python_repl_ast‘ with the column names and cell values above to answer the question: "  
What is the average price for leather wallets?"  
Begin!  
======================================= Completion =======================================  
Thought: I need to remove the comma in the item_total values before converting them to  
float.  
Action: df[’item_total’] = df[’item_total’].str.replace(’$’,’’).str.replace(’,’,’’).astype(  
float)  
Observation: success!  
Thought: Now, I can filter out the rows with leather wallets and calculate the average  
price.  
Action: df[df[’description’].str.contains(’wallet’, case=False)][’item_total’].mean()  
Observation: 442.7916666666667  
Thought: before giving the final answer, I should consider the observations and ensure that  
the calculations are accurate.  
Final Answer: The average price for leather wallets is $442.79.

实验结果

数据集：为了验证TableRAG在大规模表格上的可扩展性，作者构建了两个新的基准数据集：ArcadeQA和BirdQA，分别源自Arcade和BIRD-SQL数据集。此外，作者还从TabFact数据集中生成了合成数据，将表格扩展到更大的规模。

并且比较了四种不同的方法，包括ReadTable、ReadSchema、RandRowSampling和RowColRetrieval。所有方法都基于相同的PyReAct求解器实现。

picture.image

TableRAG的检索设计显著减少了计算成本和token使用，同时保持了高性能。

参考文献

TableRAG: Million-Token Table Understanding with Language Models，https://arxiv.org/abs/2410.04739v1