LLM之LangChain（五）| 使用LangChain Agent分析非结构化数据 - 文章 - 开发者社区

picture.image

   想象一下，你有一家面包店，你派出了甜食商情报团队来收集竞争对手的数据。他们会汇报竞争情况，他们有很多很棒的想法，你想把它们应用到你的业务中。然而，数据是非结构化的！您如何分析这些数据，以了解最需要什么，并为您业务的下一步计划做出最佳的策略？在第1部分中，我们使用“PydanticOutputParser”来分析我们的数据并添加所需的结构。在第2部分中，我们将创建一个LangChain Agent来进行数据分析。


   为了探索这个用例，创建了一个玩具数据集[1]。以下是数据集中的一个示例样本：

At Velvet Frosting Cupcakes, our team learned about the unveiling of a seasonal pastry menu that changes monthly. Introducing a rotating seasonal menu at our bakery using the “SeasonalJoy” subscription platform and adding a special touch to our cookies with the “FloralStamp” cookie stamper could keep our offerings fresh and exciting for customers.

第一部分：从非结构化数据抽取结构化信息

方法一：create_extract_chain

  定义数据抽取的结构，并且使用LangChain创建一个提取链。


          
from langchain.chains import create_extraction_chain
          
from langchain.chat_models import ChatOpenAI
          

          
# Schema
          
schema = {
          
    "properties": {
          
        "company": {"type": "string"},
          
        "offering": {"type": "string"},
          
        "advantage": {"type": "string"},
          
        "products_and_services": {"type": "string"},
          
        "additional_details": {"type": "string"},
          
    }
          
}

    定义测试样本


          
# Inputs
          
in1 = """Sweet Delights Bakery introduced lavender-infused vanilla cupcakes with a honey buttercream frosting, using the "Frosting-Spreader-3000". This innovation could inspire our next cupcake creation"""
          
in2 = """Whisked Away Cupcakes introduced a dessert subscription service, ensuring regular customers receive fresh batches of various sweets. Exploring a similar subscription model using the "SweetSubs" program could boost customer loyalty."""
          
in3 = """At Velvet Frosting Cupcakes, our team learned about the unveiling of a seasonal pastry menu that changes monthly. Introducing a rotating seasonal menu at our bakery using the "SeasonalJoy" subscription platform and adding a special touch to our cookies with the "FloralStamp" cookie stamper could keep our offerings fresh and exciting for customers."""
          

          
inputs = [in1, in2, in3]

   创建Chain


          
# Run chain
          
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
          
chain = create_extraction_chain(schema, llm)

   运行Chain


          
for input in inputs:
          
    print(chain.run(input))

  现在，我们将输出结构化为Python列表：


          
[{'company': 'Sweet Delights Bakery', 'offering': 'lavender-infused vanilla cupcakes', 'advantage': 'inspiring next cupcake creation', 'products_and_services': 'Frosting-Spreader-3000'}]
          
[{'company': 'Whisked Away Cupcakes', 'offering': 'dessert subscription service', 'advantage': 'ensuring regular customers receive fresh batches of various sweets', 'products_and_services': '', 'additional_details': ''}, {'company': '', 'offering': 'subscription model using the "SweetSubs" program', 'advantage': 'boost customer loyalty', 'products_and_services': '', 'additional_details': ''}]
          
[{'company': 'Velvet Frosting Cupcakes', 'offering': 'rotating seasonal menu', 'advantage': 'fresh and exciting offerings', 'products_and_services': 'SeasonalJoy subscription platform, FloralStamp cookie stamper'}]

   导入包含竞争情报的CSV，将其应用于提取链进行解析和结构化，并将解析后的信息无缝集成回原始数据集。下面的Python代码正是这样做的：


          
import pandas as pd
          
from langchain.chains import create_extraction_chain
          
from langchain.chat_models import ChatOpenAI
          

          
# Load in the data.csv (semicolon separated) file
          
df = pd.read_csv("data.csv", sep=';')
          

          
# Define Schema based on your data
          
schema = {
          
    "properties": {
          
        "company": {"type": "string"},
          
        "offering": {"type": "string"},
          
        "advantage": {"type": "string"},
          
        "products_and_services": {"type": "string"},
          
        "additional_details": {"type": "string"},
          
    }
          
}
          

          
# Create extraction chain
          
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
          
chain = create_extraction_chain(schema, llm)
          

          
# ----------
          
# Add the data to a data frame
          
# ----------
          

          
# Extract information and create a DataFrame from the list of dictionaries
          
extracted_data = df['INTEL'].apply(lambda x: chain.run(x)[0]).apply(pd.Series)
          

          
# Replace missing values with NaN
          
extracted_data.replace('', np.nan, inplace=True)
          

          
# Concatenate the extracted_data DataFrame with the original df
          
df = pd.concat([df, extracted_data], axis=1)
          

          
# display the data frame
          
df.head()

picture.image

   这次运行花了大约15秒，但它还没有找到我们要求的所有信息。接下来，让我们尝试一种不同的方法。

方法二：Pydantic

  在下面的代码中，Pydantic用于定义表示竞争情报信息结构的数据模型。Pydantic是Python的数据验证和解析库，允许您使用Python数据类型定义简单或复杂的数据结构。在这种情况下，我们使用Pydantic模型（竞争对手和公司）来定义竞争情报数据的结构。


          
import pandas as pd
          
from typing import Optional, Sequence
          
from langchain.llms import OpenAI
          
from langchain.output_parsers import PydanticOutputParser
          
from langchain.prompts import PromptTemplate
          
from pydantic import BaseModel
          

          
# Load data from CSV
          
df = pd.read_csv("data.csv", sep=';')
          

          
# Pydantic models for competitive intelligence
          
class Competitor(BaseModel):
          
    company: str
          
    offering: str
          
    advantage: str
          
    products_and_services: str
          
    additional_details: str
          

          
class Company(BaseModel):
          
    """Identifying information about all competitive intelligence in a text."""
          
    company: Sequence[Competitor]
          

          
# Set up a Pydantic parser and prompt template
          
parser = PydanticOutputParser(pydantic_object=Company)
          
prompt = PromptTemplate(
          
    template="Answer the user query.\n{format_instructions}\n{query}\n",
          
    input_variables=["query"],
          
    partial_variables={"format_instructions": parser.get_format_instructions()},
          
)
          

          
# Function to process each row and extract information
          
def process_row(row):
          
    _input = prompt.format_prompt(query=row['INTEL'])
          
    model = OpenAI(temperature=0)
          
    output = model(_input.to_string())
          
    result = parser.parse(output)
          
    
          
    # Convert Pydantic result to a dictionary
          
    competitor_data = result.model_dump()
          

          
    # Flatten the nested structure for DataFrame creation
          
    flat_data = {'INTEL': [], 'company': [], 'offering': [], 'advantage': [], 'products_and_services': [], 'additional_details': []}
          

          
    for entry in competitor_data['company']:
          
        flat_data['INTEL'].append(row['INTEL'])
          
        flat_data['company'].append(entry['company'])
          
        flat_data['offering'].append(entry['offering'])
          
        flat_data['advantage'].append(entry['advantage'])
          
        flat_data['products_and_services'].append(entry['products_and_services'])
          
        flat_data['additional_details'].append(entry['additional_details'])
          

          
    # Create a DataFrame from the flattened data
          
    df_cake = pd.DataFrame(flat_data)
          

          
    return df_cake
          

          
# Apply the function to each row and concatenate the results
          
intel_df = pd.concat(df.apply(process_row, axis=1).tolist(), ignore_index=True)
          

          
# Display the resulting DataFrame
          
intel_df.head()

picture.image

   速度很快！与  **create\_extract\_chain** 不同，这次找到了所有条目的详细信息。

第一部分总结：

  发现PydanticOutputParser更快、更可靠。每次运行大约需要1秒和400个tokens。而create\_extract\_chain运行大约需要2.5秒和250个tokens。


   我们已经设法从非结构化文本中提取了一些结构化数据！  **第2部分重点是使用LangChain Agent分析这些结构化数据** 。

第二部分：使用LangChain Agent分析这些结构化数据

什么是LangChain Agent？

   在LangChain中，Agent是利用语言模型来选择要执行的操作序列的系统。与Chain不同的是，在Chain中，动作被硬编码在代码中，而Agent利用语言模型作为“推理引擎”，决定采取哪些动作以及以何种顺序采取这些动作。


   现在，使用LangChain中的CSV Agent来分析我们的结构化数据了：

步骤1：创建Agent

     **首先加载必要的库：**


          
from langchain.agents.agent_types import AgentType
          
from langchain_community.llms import OpenAI
          
from langchain_experimental.agents.agent_toolkits import create_csv_agent

   **创建Agent**


          
agent = create_csv_agent(
          
    OpenAI(temperature=0),
          
    "data/intel.csv",
          
    verbose=True,
          
    agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
          
)

    现在我们可以用一些问题来测试我们的Agent：

步骤2：向Agent提出问题

    当你问LangChain Agent问题时，你会看到它思考自己的行为。

picture.image

     **询问通用问题**


        
            

          agent.run("What insights can I get from this data?")

‘This dataframe contains information about different companies and their products/services, as well as additional details and potential opportunities for improvement.’

     **询问竞争对手优势**


        
            

          agent.run("What are 3 specific areas of focus that you can obtain through analyzing the advantages offered by the competition?")

‘

Three specific areas of focus that can be obtained through analyzing the advantages offered by the competition are: streamlining production processes,

incorporating unique and distinctive flavors , and using sustainable and high-quality ingredients. ’

     **询问主要竞争对手主题**


        
            

          agent.run("What are some key themes that the competitors represented in the data are focusing on providing? Be specific with examples, and talk about the advantages of these")

‘The key themes that the competitors are focusing on providing are efficiency, unique flavors, and high-quality ingredients . For example, Coco candy co is using the 77Tyrbo Choco machine to coat their candy

gummies, which streamlines the process and saves time. Cinnamon Bliss Bakery adds a secret touch of cinnamon in their chocolate brownies with the CinnaMagic ingredient, which adds a distinctive flavor. Choco Haven factory uses organic and locally sourced ingredients, including the EcoCocoa brand, to elevate the quality of their chocolates.’

参考文献：

[1] https://github.com/ingridstevens/AI-projects/blob/main/unstructured\_data/data.csv

[2] https://medium.com/@ingridwickstevens/extract-structured-data-from-unstructured-text-using-llms-71502addf52b

[3] https://medium.com/@ingridwickstevens/analyze-structured-data-extracted-from-unstructured-text-using-llm-agents-4ea4eaf3ae78

[4] https://github.com/ingridstevens/AI-projects/blob/main/unstructured\_data/unstructured\_extraction\_chain.ipynb

[5] https://github.com/ingridstevens/AI-projects/blob/main/unstructured\_data/unstructured\_pydantic.ipynb

[6] https://github.com/ingridstevens/AI-projects/blob/main/unstructured\_data/data.csv