EDC：一个基于大模型的3阶段知识图谱自动构建框架（提取、定义、标准化） - 文章 - 开发者社区

大型语言模型（LLMs）的进步促使了一系列最近的工作将它们应用于KGC（knowledge graph creation），例如，通过零次/少次提示。尽管在小型特定领域的数据集上取得了成功，但这些模型在扩展到许多现实世界应用中常见的文本时面临困难。一个主要问题是，在先前的方法中，为了生成有效的三元组，

KG模式必须包含在LLM提示中；更大且更复杂的模式很容易超出LLM的上下文窗口长度。

为了解决这个问题，提出了一个 提取-定义-标准化 （ EDC：Extract, Define, Canonicalize ）的三阶段框架：开放信息抽取，随后是模式定义和事后标准化。EDC的灵活性在于它可以应用于预定义目标模式可用和不可用的环境中；在后者情况下，它会自动构建模式并应用自我标准化。为了进一步提高性能，引入了一个训练有素的组件来检索与输入文本相关的模式元素；这以类似于检索增强生成的方式提高了LLMs的提取性能。

picture.image

开放信息抽取（Open Information Extraction）

在这一阶段，利用LLMs进行开放信息抽取，通过少量样本提示（few-shot prompting），从输入文本中识别并提取实体-关系三元组（[Subject, Relation, Object]），而不依赖于任何特定的预定义模式。这一阶段的目标是生成一个开放的知识图谱，其中包含了从文本中自由提取的三元组。

prompt模版：


          
Given a piece of text, extract relational triplets in the form of [Subject, Relation, Object] from it.
          
Here are some examples:
          
Example 1:
          
Text: The 17068.8 millimeter long ALCO RS-3 has a diesel-electric transmission.
          
Triplets: [[’ALCO RS-3’, ’powerType’, ’Diesel-electric transmission’], [’ALCO RS-3’, ’length’,
          
’17068.8 (millimetres)’]] ...
          
Now please extract triplets from the following text: Alan Shepard was born on Nov 18, 1923 and
          
selected by NASA in 1959. He was a member of the Apollo 14 crew.

模式定义（Schema Definition）

在第二阶段，框架会提示LLMs为开放知识图谱中提取的每个关系提供一个自然语言定义。这些定义有助于标准化三元组，使得语义上等价的实体和关系类型能够被统一表示。例如，对于关系“participatedIn”，定义可能是“主题实体参与了由对象实体指定的事件或任务”。

prompt模版：


          
Given a piece of text and a list of relational triplets extracted from it, write a definition for each
          
relation present.
          
Example 1:
          
Text: The 17068.8 millimeter long ALCO RS-3 has a diesel-electric transmission.
          
Triplets: [[’ALCO RS-3’, ’powerType’, ’Diesel-electric transmission’], [’ALCO RS-3’, ’length’,
          
’17068.8 (millimetres)’]]
          
Definitions:
          
powerType: The subject entity uses the type of power or energy source specified by the object
          
entity.
          
...
          
Now write a definition for each relation present in the triplets extracted from the following text:
          
Text: Alan Shepard was an American who was born on Nov 18, 1923 in New Hampshire, was
          
selected by NASA in 1959, was a member of the Apollo 14 crew and died in California
          
Triplets: [[’Alan Shepard’, ‘bornOn’, ‘Nov 18, 1923’], [’Alan Shepard’, participatedIn’, ’Apollo
          
14’]]

模式标准化（Schema Canonicalization）

第三阶段的目标是将开放知识图谱进一步精炼为规范形式，消除冗余和歧义。这一阶段根据是否有预定义的目标模式，采取不同的策略：

目标对齐（Target Alignment）：如果存在预定义的目标模式，则通过识别目标模式中最相关的组件来标准化每个元素，并考虑它们进行标准化。为了防止过度泛化，LLMs会评估每个潜在转换的可行性。
自我标准化（Self Canonicalization）：如果没有预定义的目标模式，框架会自动构建模式，并通过合并语义上相似的模式组件来标准化三元组，从而形成规范的知识图谱。

prompt模版：


          
Given a piece of text, a relational triplet extracted from it, and the definition of the relation in it,
          
choose the most appropriate relation to replace it in this context if there is any.
          
Text: Alan Shepard was born on Nov 18, 1923 and selected by NASA in 1959. He was a member
          
of the Apollo 14 crew.
          
Triplets: [’Alan Shepard’, participatedIn’, ’Apollo 14’] Definition of ‘participatedIn’: The subject
          
entity took part in the event or mission specified by the object entity.
          
Choices:
          
A. ’mission’: The subject entity participated in the event or operation specified by the object
          
entity.
          
B. ’season’: The subject entity participated in the season specified by the object entity.
          
C. ’league’: The subject entity participates or competes in the league specified by the object
          
entity.
          
D. ’activeYearsStartYear’: The subject entity started their active career in the year specified by
          
the object entity.
          
E. ’foundingYear’: The subject entity was founded in the year specified by the object entity.
          
F. None of the above

此外，为了进一步提高性能，EDC框架还包括了一个可选的迭代细化（Refinement）阶段，该阶段利用由EDC生成的数据来增强提取的三元组质量。这包括使用一个训练有素的模式检索器来检索与输入文本相关的模式元素，类似于检索增强生成的方法，从而提高了LLMs的提取性能。

在三个KGC基准测试上展示了EDC能够在没有任何参数调整的情况下提取高质量的三元组，并且与之前的工作相比，能够处理显著更大的模式 。

EDC

及其改进版本EDC+R在 WebNLG、REBEL和Wiki-NRE 数据集上的性能表现，与各自的基线模型在目标对齐设置下进行比较（使用“部分”标准的F1分数）。由于发现在此之后的改进显著减少，EDC+R只进行了一次细化迭代。

picture.image


          
Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction
          
https://arxiv.org/pdf/2404.03868.pdf
          
https://github.com/clear-nus/edc