【信息抽取 & LLM】大模型结构化输出控制技巧及简历信息抽取结构化实践

技术

前言

在使用大模型进行信息抽取任务时,如何使得大模型的输出结果更加可控、稳定(输出稳定的json等)非常重要,这关系到抽取的数据后期开发使用。常见的方法有:

  • 微调法:微调大模型输出稳定格式的结果(json等)
  • few-shot法:通过在prompt中告知大模型几个示例,要求大模型输出相似的格式

但是,尽管如此,在实际操作过程中,仍然会面对着输出不稳定的情况,那么,经常采用的方法就是对输出的结果进行校验,如:要求输出json时,常校验json是否合理。校验失败时,常对大模型进行重复请求多次,以此达到输出结构化的格式。

下面,介绍一个输出控制的工具-outlines并通过一个招聘领域(简历信息抽取并json结构化)的小demo,介绍下使用方法

一、实践

直接上demo代码、用法

安装:


        
          
pip install outlines  

      

  1. 简历文本(输入)

        
          
张三  
 | 年龄:25岁 | 籍贯:北京 | 共产党员 | 18794434244  
求职意向:算法工程师 | 期望城市:北京  
个人优势  
擅长领域:深度学习、CV 、图像识别、语义分割、目标检测、自动驾驶感知算法、JavaWeb开发  
专业技能:熟悉 Python 、PyTorch 框架、熟悉 Java 、MySQL、SpringMVC、SpringBoot、Office  
教育经历  
北京工业大学 硕士 软件工程 2012-2015  
担任职务:党支部书记;主修课程:深度学习、图像识别、语义分割、遥感图像处理、软件工程、时空大数据  
河南工业大学 本科 计算机科学与技术 2008-2012  
担任职务:班长、党支部书记;主修课程:Java、数据库、操作系统、计算机网络、数据结构  
实习经历  
大模型自然语言处理科技(北京)有限公司 算法工程师 2023.12-2024.03  
 负责数据采集、清洗并标注2D、3D驾驶数据,确保数据质量和多样性  
 负责自动驾驶感知模块的算法开发与优化,利用行车影像数据进行模型迭代优化  
项目经历  
图像识别 总负责人 2022.09-至今  
 设计了一种高性能深度学习网络 MT-AENet ,用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务,并开发出建筑物  
总览可视化系统,实现对城市建筑物变化动态监测  
信息管理系统(SSM框架) 项目设计师 2024.02-2024.03  
负责高校党务信息管理系统的整体架构设计,确保系统功能模块化、高效稳定  
技术栈:Java、MySQL、MyBatis、HTML、CSS、JavaScript、Vue.js、AJAX、Spring MVC、Maven、Git  
 使用 MySQL 数据库设计,MyBatis 框架实现数据持久层的开发,提高 JDBC 开发效率  
 使用 HTML、CSS  JavaScript 技术, 结合 Element 组件库,快速构建响应式前端网页界面  

      

  1. schema定义

该步骤主要定义需要从简历中抽取的实体类型 及能够被outlines接收的schema结构。


        
          
{  
    "title": "Resume",  
    "type": "object",  
    "properties": {  
        "fullName": {  
            "title": "全名",  
            "maxLength": 50,  
            "type": "string"  
        },  
        "contact": {  
            "title": "联系方式",  
            "type": "object",  
            "properties": {  
                "phone": {  
                    "title": "电话号码",  
                    "type": "string"  
                }  
            },  
            "required": ["phone"]  
        },  
        "education": {  
            "title": "教育背景",  
            "type": "array",  
            "items": {  
                "type": "object",  
                "properties": {  
                    "degree": {  
                        "title": "学位",  
                        "type": "string"  
                    },  
                    "institution": {  
                        "title": "学校",  
                        "type": "string"  
                    },  
                    "fieldOfStudy": {  
                        "title": "专业",  
                        "type": "string"  
                    },  
                    "graduationYear": {  
                        "title": "毕业年份",  
                        "type": "integer"  
                    }  
                },  
                "required": ["degree", "institution", "fieldOfStudy", "graduationYear"]  
            }  
        },  
        "experience": {  
            "title": "工作经验",  
            "type": "array",  
            "items": {  
                "type": "object",  
                "properties": {  
                    "jobTitle": {  
                        "title": "职位",  
                        "type": "string"  
                    },  
                    "company": {  
                        "title": "公司",  
                        "type": "string"  
                    },  
                    "duration": {  
                        "title": "任职时间",  
                        "type": "string"  
                    },  
                    "responsibilities": {  
                        "title": "职责",  
                        "type": "array",  
                        "items": {  
                            "type": "string"  
                        }  
                    }  
                },  
                "required": ["jobTitle", "company", "duration", "responsibilities"]  
            }  
        },  
        "skills": {  
            "title": "技能",  
            "type": "array",  
            "items": {  
                "type": "string"  
            }  
        }  
    },  
    "required": ["fullName", "contact", "education", "experience", "skills"]  
}  

      

  1. LLM信息抽取完整代码

        
          
import outlines  
  
schema = '''{  
    "title": "Resume",  
    "type": "object",  
    "properties": {  
        "fullName": {  
            "title": "全名",  
            "maxLength": 50,  
            "type": "string"  
        },  
        "contact": {  
            "title": "联系方式",  
            "type": "object",  
            "properties": {  
                "phone": {  
                    "title": "电话号码",  
                    "type": "string"  
                }  
            },  
            "required": ["phone"]  
        },  
        "education": {  
            "title": "教育背景",  
            "type": "array",  
            "items": {  
                "type": "object",  
                "properties": {  
                    "degree": {  
                        "title": "学位",  
                        "type": "string"  
                    },  
                    "institution": {  
                        "title": "学校",  
                        "type": "string"  
                    },  
                    "fieldOfStudy": {  
                        "title": "专业",  
                        "type": "string"  
                    },  
                    "graduationYear": {  
                        "title": "毕业年份",  
                        "type": "integer"  
                    }  
                },  
                "required": ["degree", "institution", "fieldOfStudy", "graduationYear"]  
            }  
        },  
        "experience": {  
            "title": "工作经验",  
            "type": "array",  
            "items": {  
                "type": "object",  
                "properties": {  
                    "jobTitle": {  
                        "title": "职位",  
                        "type": "string"  
                    },  
                    "company": {  
                        "title": "公司",  
                        "type": "string"  
                    },  
                    "duration": {  
                        "title": "任职时间",  
                        "type": "string"  
                    },  
                    "responsibilities": {  
                        "title": "职责",  
                        "type": "array",  
                        "items": {  
                            "type": "string"  
                        }  
                    }  
                },  
                "required": ["jobTitle", "company", "duration", "responsibilities"]  
            }  
        },  
        "skills": {  
            "title": "技能",  
            "type": "array",  
            "items": {  
                "type": "string"  
            }  
        }  
    },  
    "required": ["fullName", "contact", "education", "experience", "skills"]  
}'''  
  
model = outlines.models.transformers(  
    "大模型的路径")  
generator = outlines.generate.json(model, schema)  
  
resume_text = '''张三  
男 | 年龄:25岁 | 籍贯:北京 | 共产党员 | 18794434244  
求职意向:算法工程师 | 期望城市:北京  
个人优势  
擅长领域:深度学习、CV 、图像识别、语义分割、目标检测、自动驾驶感知算法、JavaWeb开发  
专业技能:熟悉 Python 、PyTorch 框架、熟悉 Java 、MySQL、SpringMVC、SpringBoot、Office  
教育经历  
北京工业大学 硕士 软件工程 2012-2015  
担任职务:党支部书记;主修课程:深度学习、图像识别、语义分割、遥感图像处理、软件工程、时空大数据  
河南工业大学 本科 计算机科学与技术 2008-2012  
担任职务:班长、党支部书记;主修课程:Java、数据库、操作系统、计算机网络、数据结构  
实习经历  
大模型自然语言处理科技(北京)有限公司 算法工程师 2023.12-2024.03  
● 负责数据采集、清洗并标注2D、3D驾驶数据,确保数据质量和多样性  
● 负责自动驾驶感知模块的算法开发与优化,利用行车影像数据进行模型迭代优化  
项目经历  
图像识别 总负责人 2022.09-至今  
● 设计了一种高性能深度学习网络 MT-AENet ,用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务,并开发出建筑物  
总览可视化系统,实现对城市建筑物变化动态监测  
信息管理系统(SSM框架) 项目设计师 2024.02-2024.03  
负责高校党务信息管理系统的整体架构设计,确保系统功能模块化、高效稳定  
技术栈:Java、MySQL、MyBatis、HTML、CSS、JavaScript、Vue.js、AJAX、Spring MVC、Maven、Git  
● 使用 MySQL 数据库设计,MyBatis 框架实现数据持久层的开发,提高 JDBC 开发效率  
● 使用 HTML、CSS 和 JavaScript 技术, 结合 Element 组件库,快速构建响应式前端网页界面  
'''  
character = generator(resume_text)  
  
print(repr(character))  
  

      

  1. 输出json结构化结果

        
          
{  
  "fullName": "张三",  
  "contact": {  
    "phone": "18794434244"  
  },  
  "education": [  
    {  
      "degree": "硕士",  
      "institution": "北京工业大学",  
      "fieldOfStudy": "软件工程",  
      "graduationYear": 2015  
    },  
    {  
      "degree": "学士",  
      "institution": "河南工业大学",  
      "fieldOfStudy": "计算机科学与技术",  
      "graduationYear": 2012  
    }  
  ],  
  "experience": [  
    {  
      "jobTitle": "算法工程师",  
      "company": "大模型自然语言处理科技(北京)有限公司",  
      "duration": "2023.12-2024.03",  
      "responsibilities": [  
        "设计了一种高性能深度学习网络 MT-AENet ,用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务,并开发出建筑物",  
        "负责数据采集、清洗并标注2D、3D驾驶数据,确保数据质量和多样性",  
        "负责自动驾驶感知模块的算法开发与优化,利用行车影像数据进行模型迭代优化"  
      ]  
    },  
    {  
      "jobTitle": "图像识别",  
      "company": "图像识别",  
      "duration": "2022.09-至今",  
      "responsibilities": [  
        "设计了一种高性能深度学习网络 MT-AENet ,用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务,并开发出建筑物",  
        "实现对城市建筑物变化动态监测的建筑物",  
        "传感器数据采集停车场停车车位判断算法优化"  
      ]  
    },  
    {  
      "jobTitle": "信息管理系统(SSM框架)",  
      "company": "信息管理系统(SSM框架)",  
      "duration": "-non",  
      "responsibilities": [  
        "负责高校党务信息管理系统的整体架构设计,确保系统功能模块化、高效稳定",  
        "使用 MySQL 数据库设计,MyBatis 框架实现数据持久层的开发,提高 JDBC 开发效率",  
        "使用 HTML、CSS 和 JavaScript 技术, 结合 Element 组件库,快速构建响应式前端网页界面"  
      ]  
    }  
  ],  
  "skills": [  
    "Python",  
    "PyTorch",  
    "Java",  
    "MySQL",  
    "SpringMVC",  
    "SpringBoot",  
    "Office"  
  ]  
}  

      

二、其他格式控制案例

Multiple choices


        
          
import outlines  
  
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")  
  
prompt = """You are a sentiment-labelling assistant.  
Is the following review positive or negative?  
  
Review: This restaurant is just awesome!  
"""  
  
generator = outlines.generate.choice(model, ["Positive", "Negative"])  
answer = generator(prompt)  

      

Type constraint


        
          
import outlines  
  
model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")  
  
prompt = "<s>result of 9 + 9 = 18</s><s>result of 1 + 2 = "  
answer = outlines.generate.format(model, int)(prompt)  
print(answer)  
# 3  
  
prompt = "sqrt(2)="  
generator = outlines.generate.format(model, float)  
answer = generator(prompt, max_tokens=10)  
print(answer)  
# 1.41421356  

      

Efficient regex-structured generation


        
          
import outlines  
  
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")  
  
prompt = "What is the IP address of the Google DNS servers? "  
  
generator = outlines.generate.text(model)  
unstructured = generator(prompt, max_tokens=30)  
  
generator = outlines.generate.regex(  
    model,  
    r"((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",  
)  
structured = generator(prompt, max_tokens=30)  
  
print(unstructured)  
# What is the IP address of the Google DNS servers?  
#  
# Passive DNS servers are at DNS servers that are private.  
# In other words, both IP servers are private. The database  
# does not contain Chelsea Manning  
  
print(structured)  
# What is the IP address of the Google DNS servers?  
# 2.2.6.1  

      

Efficient JSON generation following a Pydantic model


        
          
from enum import Enum  
from pydantic import BaseModel, constr  
  
import outlines  
import torch  
  
  
class Weapon(str, Enum):  
    sword = "sword"  
    axe = "axe"  
    mace = "mace"  
    spear = "spear"  
    bow = "bow"  
    crossbow = "crossbow"  
  
  
class Armor(str, Enum):  
    leather = "leather"  
    chainmail = "chainmail"  
    plate = "plate"  
  
  
class Character(BaseModel):  
    name: constr(max_length=10)  
    age: int  
    armor: Armor  
    weapon: Weapon  
    strength: int  
  
  
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")  
  
# Construct structured sequence generator  
generator = outlines.generate.json(model, Character)  
  
# Draw a sample  
rng = torch.Generator(device="cuda")  
rng.manual_seed(789001)  
  
character = generator("Give me a character description", rng=rng)  
  
print(repr(character))  
# Character(name='Anderson', age=28, armor=<Armor.chainmail: 'chainmail'>, weapon=<Weapon.sword: 'sword'>, strength=8)  
  
character = generator("Give me an interesting character description", rng=rng)  
  
print(repr(character))  
# Character(name='Vivian Thr', age=44, armor=<Armor.plate: 'plate'>, weapon=<Weapon.crossbow: 'crossbow'>, strength=125)  

      

Efficient JSON generation following a JSON Schema


        
          
import outlines  
  
schema = '''{  
    "title": "Character",  
    "type": "object",  
    "properties": {  
        "name": {  
            "title": "Name",  
            "maxLength": 10,  
            "type": "string"  
        },  
        "age": {  
            "title": "Age",  
            "type": "integer"  
        },  
        "armor": {"$ref": "#/definitions/Armor"},  
        "weapon": {"$ref": "#/definitions/Weapon"},  
        "strength": {  
            "title": "Strength",  
            "type": "integer"  
        }  
    },  
    "required": ["name", "age", "armor", "weapon", "strength"],  
    "definitions": {  
        "Armor": {  
            "title": "Armor",  
            "description": "An enumeration.",  
            "enum": ["leather", "chainmail", "plate"],  
            "type": "string"  
        },  
        "Weapon": {  
            "title": "Weapon",  
            "description": "An enumeration.",  
            "enum": ["sword", "axe", "mace", "spear", "bow", "crossbow"],  
            "type": "string"  
        }  
    }  
}'''  
  
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")  
generator = outlines.generate.json(model, schema)  
character = generator("Give me a character description")  

      

Using context-free grammars to guide generation


        
          
import outlines  
  
arithmetic_grammar = """  
    ?start: expression  
  
    ?expression: term (("+" | "-") term)*  
  
    ?term: factor (("*" | "/") factor)*  
  
    ?factor: NUMBER  
           | "-" factor  
           | "(" expression ")"  
  
    %import common.NUMBER  
"""  
  
model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")  
generator = outlines.generate.cfg(model, arithmetic_grammar)  
sequence = generator("Alice had 4 apples and Bob ate 2. Write an expression for Alice's apples:")  
  
print(sequence)  
# (8-2)  

      

Open functions


        
          
import outlines  
  
  
def add(a: int, b: int):  
    return a + b  
  
model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")  
generator = outlines.generate.json(model, add)  
result = generator("Return json with two integers named a and b respectively. a is odd and b even.")  
  
print(add(**result))  
# 3  

      

Prompting


        
          
import outlines  
  
examples = [  
    ("The food was disgusting", "Negative"),  
    ("We had a fantastic night", "Positive"),  
    ("Recommended", "Positive"),  
    ("The waiter was rude", "Negative")  
]  
  
@outlines.prompt  
def labelling(to_label, examples):  
    """You are a sentiment-labelling assistant.  
  
    {% for example in examples %}  
    {{ example[0] }} // {{ example[1] }}  
    {% endfor %}  
    {{ to\_label }} //  
    """  
  
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")  
prompt = labelling("Just awesome", examples)  
answer = outlines.generate.text(model)(prompt, max_tokens=100)  

      

总结

本文介绍了大模型输出结构控制的技巧工具-outlines,并通过一个简历信息抽取的实践demo,验证其有效性。还简单记录了一些其他格式控制的代码。

参考文献

https://github.com/outlines-dev/outlines

往期相关

[浅尝prompt咒语设计:one-shot微调chatglm-6b实践信息抽取](http://mp.weixin.qq.com/s?__biz=Mzg4NjI0NDg0Ng==&mid=2247484023&idx=1&sn=7dbbf3c41e78ec000f20a964f494a98c&chksm=cf9dd6f6f8ea5fe07fcc6844d66003ebe74f802b8aa287352387b080f26c7c75d35bb805ea89&scene=21#wechat_redirect)  



[大语言模型控制生成的过程Trick:自定义LogitsProcessor实践](http://mp.weixin.qq.com/s?__biz=Mzg4NjI0NDg0Ng==&mid=2247484215&idx=1&sn=20b51b6c19cca9a67a0f1475e6183bca&chksm=cf9dd7b6f8ea5ea0c9d47bacff09fd6e7498b606dac557de9a19b1fde29fb2c30aa6962dcf6d&scene=21#wechat_redirect)  
0
0
0
0
关于作者
关于作者

文章

0

获赞

0

收藏

0

相关资源
大规模高性能计算集群优化实践
随着机器学习的发展,数据量和训练模型都有越来越大的趋势,这对基础设施有了更高的要求,包括硬件、网络架构等。本次分享主要介绍火山引擎支撑大规模高性能计算集群的架构和优化实践。
相关产品
评论
未登录
看完啦,登录分享一下感受吧~
暂无评论