【信息抽取 & LLM】大模型结构化输出控制技巧及简历信息抽取结构化实践 - 文章 - 开发者社区

前言

在使用大模型进行信息抽取任务时，如何使得大模型的输出结果更加可控、稳定（输出稳定的json等）非常重要，这关系到抽取的数据后期开发使用。常见的方法有：

微调法：微调大模型输出稳定格式的结果（json等）
few-shot法：通过在prompt中告知大模型几个示例，要求大模型输出相似的格式

但是，尽管如此，在实际操作过程中，仍然会面对着输出不稳定的情况，那么，经常采用的方法就是对输出的结果进行校验，如：要求输出json时，常校验json是否合理。校验失败时，常对大模型进行重复请求多次，以此达到输出结构化的格式。

下面，介绍一个输出控制的工具-outlines ，并通过一个招聘领域（简历信息抽取并json结构化）的小demo，介绍下使用方法 。

一、实践

直接上demo代码、用法

安装：


        
          
pip install outlines

简历文本（输入）


        
          
张三  
男 | 年龄：25岁 | 籍贯：北京 | 共产党员 | 18794434244  
求职意向：算法工程师 | 期望城市：北京  
个人优势  
擅长领域：深度学习、CV 、图像识别、语义分割、目标检测、自动驾驶感知算法、JavaWeb开发  
专业技能：熟悉 Python 、PyTorch 框架、熟悉 Java 、MySQL、SpringMVC、SpringBoot、Office  
教育经历  
北京工业大学 硕士 软件工程 2012-2015  
担任职务：党支部书记；主修课程：深度学习、图像识别、语义分割、遥感图像处理、软件工程、时空大数据  
河南工业大学 本科 计算机科学与技术 2008-2012  
担任职务：班长、党支部书记；主修课程：Java、数据库、操作系统、计算机网络、数据结构  
实习经历  
大模型自然语言处理科技（北京）有限公司 算法工程师 2023.12-2024.03  
● 负责数据采集、清洗并标注2D、3D驾驶数据，确保数据质量和多样性  
● 负责自动驾驶感知模块的算法开发与优化，利用行车影像数据进行模型迭代优化  
项目经历  
图像识别 总负责人 2022.09-至今  
● 设计了一种高性能深度学习网络 MT-AENet ，用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务，并开发出建筑物  
总览可视化系统，实现对城市建筑物变化动态监测  
信息管理系统（SSM框架） 项目设计师 2024.02-2024.03  
负责高校党务信息管理系统的整体架构设计，确保系统功能模块化、高效稳定  
技术栈：Java、MySQL、MyBatis、HTML、CSS、JavaScript、Vue.js、AJAX、Spring MVC、Maven、Git  
● 使用 MySQL 数据库设计，MyBatis 框架实现数据持久层的开发，提高 JDBC 开发效率  
● 使用 HTML、CSS 和 JavaScript 技术， 结合 Element 组件库，快速构建响应式前端网页界面

schema定义

该步骤主要定义需要从简历中抽取的实体类型 及能够被outlines接收的schema结构。


        
          
{  
    "title": "Resume",  
    "type": "object",  
    "properties": {  
        "fullName": {  
            "title": "全名",  
            "maxLength": 50,  
            "type": "string"  
        },  
        "contact": {  
            "title": "联系方式",  
            "type": "object",  
            "properties": {  
                "phone": {  
                    "title": "电话号码",  
                    "type": "string"  
                }  
            },  
            "required": ["phone"]  
        },  
        "education": {  
            "title": "教育背景",  
            "type": "array",  
            "items": {  
                "type": "object",  
                "properties": {  
                    "degree": {  
                        "title": "学位",  
                        "type": "string"  
                    },  
                    "institution": {  
                        "title": "学校",  
                        "type": "string"  
                    },  
                    "fieldOfStudy": {  
                        "title": "专业",  
                        "type": "string"  
                    },  
                    "graduationYear": {  
                        "title": "毕业年份",  
                        "type": "integer"  
                    }  
                },  
                "required": ["degree", "institution", "fieldOfStudy", "graduationYear"]  
            }  
        },  
        "experience": {  
            "title": "工作经验",  
            "type": "array",  
            "items": {  
                "type": "object",  
                "properties": {  
                    "jobTitle": {  
                        "title": "职位",  
                        "type": "string"  
                    },  
                    "company": {  
                        "title": "公司",  
                        "type": "string"  
                    },  
                    "duration": {  
                        "title": "任职时间",  
                        "type": "string"  
                    },  
                    "responsibilities": {  
                        "title": "职责",  
                        "type": "array",  
                        "items": {  
                            "type": "string"  
                        }  
                    }  
                },  
                "required": ["jobTitle", "company", "duration", "responsibilities"]  
            }  
        },  
        "skills": {  
            "title": "技能",  
            "type": "array",  
            "items": {  
                "type": "string"  
            }  
        }  
    },  
    "required": ["fullName", "contact", "education", "experience", "skills"]  
}

LLM信息抽取完整代码


        
          
import outlines  
  
schema = '''{  
    "title": "Resume",  
    "type": "object",  
    "properties": {  
        "fullName": {  
            "title": "全名",  
            "maxLength": 50,  
            "type": "string"  
        },  
        "contact": {  
            "title": "联系方式",  
            "type": "object",  
            "properties": {  
                "phone": {  
                    "title": "电话号码",  
                    "type": "string"  
                }  
            },  
            "required": ["phone"]  
        },  
        "education": {  
            "title": "教育背景",  
            "type": "array",  
            "items": {  
                "type": "object",  
                "properties": {  
                    "degree": {  
                        "title": "学位",  
                        "type": "string"  
                    },  
                    "institution": {  
                        "title": "学校",  
                        "type": "string"  
                    },  
                    "fieldOfStudy": {  
                        "title": "专业",  
                        "type": "string"  
                    },  
                    "graduationYear": {  
                        "title": "毕业年份",  
                        "type": "integer"  
                    }  
                },  
                "required": ["degree", "institution", "fieldOfStudy", "graduationYear"]  
            }  
        },  
        "experience": {  
            "title": "工作经验",  
            "type": "array",  
            "items": {  
                "type": "object",  
                "properties": {  
                    "jobTitle": {  
                        "title": "职位",  
                        "type": "string"  
                    },  
                    "company": {  
                        "title": "公司",  
                        "type": "string"  
                    },  
                    "duration": {  
                        "title": "任职时间",  
                        "type": "string"  
                    },  
                    "responsibilities": {  
                        "title": "职责",  
                        "type": "array",  
                        "items": {  
                            "type": "string"  
                        }  
                    }  
                },  
                "required": ["jobTitle", "company", "duration", "responsibilities"]  
            }  
        },  
        "skills": {  
            "title": "技能",  
            "type": "array",  
            "items": {  
                "type": "string"  
            }  
        }  
    },  
    "required": ["fullName", "contact", "education", "experience", "skills"]  
}'''  
  
model = outlines.models.transformers(  
    "大模型的路径")  
generator = outlines.generate.json(model, schema)  
  
resume_text = '''张三  
男 | 年龄：25岁 | 籍贯：北京 | 共产党员 | 18794434244  
求职意向：算法工程师 | 期望城市：北京  
个人优势  
擅长领域：深度学习、CV 、图像识别、语义分割、目标检测、自动驾驶感知算法、JavaWeb开发  
专业技能：熟悉 Python 、PyTorch 框架、熟悉 Java 、MySQL、SpringMVC、SpringBoot、Office  
教育经历  
北京工业大学 硕士 软件工程 2012-2015  
担任职务：党支部书记；主修课程：深度学习、图像识别、语义分割、遥感图像处理、软件工程、时空大数据  
河南工业大学 本科 计算机科学与技术 2008-2012  
担任职务：班长、党支部书记；主修课程：Java、数据库、操作系统、计算机网络、数据结构  
实习经历  
大模型自然语言处理科技（北京）有限公司 算法工程师 2023.12-2024.03  
● 负责数据采集、清洗并标注2D、3D驾驶数据，确保数据质量和多样性  
● 负责自动驾驶感知模块的算法开发与优化，利用行车影像数据进行模型迭代优化  
项目经历  
图像识别 总负责人 2022.09-至今  
● 设计了一种高性能深度学习网络 MT-AENet ，用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务，并开发出建筑物  
总览可视化系统，实现对城市建筑物变化动态监测  
信息管理系统（SSM框架） 项目设计师 2024.02-2024.03  
负责高校党务信息管理系统的整体架构设计，确保系统功能模块化、高效稳定  
技术栈：Java、MySQL、MyBatis、HTML、CSS、JavaScript、Vue.js、AJAX、Spring MVC、Maven、Git  
● 使用 MySQL 数据库设计，MyBatis 框架实现数据持久层的开发，提高 JDBC 开发效率  
● 使用 HTML、CSS 和 JavaScript 技术， 结合 Element 组件库，快速构建响应式前端网页界面  
'''  
character = generator(resume_text)  
  
print(repr(character))

输出json结构化结果


        
          
{  
  "fullName": "张三",  
  "contact": {  
    "phone": "18794434244"  
  },  
  "education": [  
    {  
      "degree": "硕士",  
      "institution": "北京工业大学",  
      "fieldOfStudy": "软件工程",  
      "graduationYear": 2015  
    },  
    {  
      "degree": "学士",  
      "institution": "河南工业大学",  
      "fieldOfStudy": "计算机科学与技术",  
      "graduationYear": 2012  
    }  
  ],  
  "experience": [  
    {  
      "jobTitle": "算法工程师",  
      "company": "大模型自然语言处理科技（北京）有限公司",  
      "duration": "2023.12-2024.03",  
      "responsibilities": [  
        "设计了一种高性能深度学习网络 MT-AENet ，用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务，并开发出建筑物",  
        "负责数据采集、清洗并标注2D、3D驾驶数据，确保数据质量和多样性",  
        "负责自动驾驶感知模块的算法开发与优化，利用行车影像数据进行模型迭代优化"  
      ]  
    },  
    {  
      "jobTitle": "图像识别",  
      "company": "图像识别",  
      "duration": "2022.09-至今",  
      "responsibilities": [  
        "设计了一种高性能深度学习网络 MT-AENet ，用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务，并开发出建筑物",  
        "实现对城市建筑物变化动态监测的建筑物",  
        "传感器数据采集停车场停车车位判断算法优化"  
      ]  
    },  
    {  
      "jobTitle": "信息管理系统（SSM框架）",  
      "company": "信息管理系统（SSM框架）",  
      "duration": "-non",  
      "responsibilities": [  
        "负责高校党务信息管理系统的整体架构设计，确保系统功能模块化、高效稳定",  
        "使用 MySQL 数据库设计，MyBatis 框架实现数据持久层的开发，提高 JDBC 开发效率",  
        "使用 HTML、CSS 和 JavaScript 技术， 结合 Element 组件库，快速构建响应式前端网页界面"  
      ]  
    }  
  ],  
  "skills": [  
    "Python",  
    "PyTorch",  
    "Java",  
    "MySQL",  
    "SpringMVC",  
    "SpringBoot",  
    "Office"  
  ]  
}

二、其他格式控制案例

Multiple choices


        
          
import outlines  
  
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")  
  
prompt = """You are a sentiment-labelling assistant.  
Is the following review positive or negative?  
  
Review: This restaurant is just awesome!  
"""  
  
generator = outlines.generate.choice(model, ["Positive", "Negative"])  
answer = generator(prompt)

Type constraint


        
          
import outlines  
  
model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")  
  
prompt = "<s>result of 9 + 9 = 18</s><s>result of 1 + 2 = "  
answer = outlines.generate.format(model, int)(prompt)  
print(answer)  
# 3  
  
prompt = "sqrt(2)="  
generator = outlines.generate.format(model, float)  
answer = generator(prompt, max_tokens=10)  
print(answer)  
# 1.41421356

Efficient regex-structured generation


        
          
import outlines  
  
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")  
  
prompt = "What is the IP address of the Google DNS servers? "  
  
generator = outlines.generate.text(model)  
unstructured = generator(prompt, max_tokens=30)  
  
generator = outlines.generate.regex(  
    model,  
    r"((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",  
)  
structured = generator(prompt, max_tokens=30)  
  
print(unstructured)  
# What is the IP address of the Google DNS servers?  
#  
# Passive DNS servers are at DNS servers that are private.  
# In other words, both IP servers are private. The database  
# does not contain Chelsea Manning  
  
print(structured)  
# What is the IP address of the Google DNS servers?  
# 2.2.6.1

Efficient JSON generation following a Pydantic model


        
          
from enum import Enum  
from pydantic import BaseModel, constr  
  
import outlines  
import torch  
  
  
class Weapon(str, Enum):  
    sword = "sword"  
    axe = "axe"  
    mace = "mace"  
    spear = "spear"  
    bow = "bow"  
    crossbow = "crossbow"  
  
  
class Armor(str, Enum):  
    leather = "leather"  
    chainmail = "chainmail"  
    plate = "plate"  
  
  
class Character(BaseModel):  
    name: constr(max_length=10)  
    age: int  
    armor: Armor  
    weapon: Weapon  
    strength: int  
  
  
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")  
  
# Construct structured sequence generator  
generator = outlines.generate.json(model, Character)  
  
# Draw a sample  
rng = torch.Generator(device="cuda")  
rng.manual_seed(789001)  
  
character = generator("Give me a character description", rng=rng)  
  
print(repr(character))  
# Character(name='Anderson', age=28, armor=<Armor.chainmail: 'chainmail'>, weapon=<Weapon.sword: 'sword'>, strength=8)  
  
character = generator("Give me an interesting character description", rng=rng)  
  
print(repr(character))  
# Character(name='Vivian Thr', age=44, armor=<Armor.plate: 'plate'>, weapon=<Weapon.crossbow: 'crossbow'>, strength=125)

Efficient JSON generation following a JSON Schema


        
          
import outlines  
  
schema = '''{  
    "title": "Character",  
    "type": "object",  
    "properties": {  
        "name": {  
            "title": "Name",  
            "maxLength": 10,  
            "type": "string"  
        },  
        "age": {  
            "title": "Age",  
            "type": "integer"  
        },  
        "armor": {"$ref": "#/definitions/Armor"},  
        "weapon": {"$ref": "#/definitions/Weapon"},  
        "strength": {  
            "title": "Strength",  
            "type": "integer"  
        }  
    },  
    "required": ["name", "age", "armor", "weapon", "strength"],  
    "definitions": {  
        "Armor": {  
            "title": "Armor",  
            "description": "An enumeration.",  
            "enum": ["leather", "chainmail", "plate"],  
            "type": "string"  
        },  
        "Weapon": {  
            "title": "Weapon",  
            "description": "An enumeration.",  
            "enum": ["sword", "axe", "mace", "spear", "bow", "crossbow"],  
            "type": "string"  
        }  
    }  
}'''  
  
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")  
generator = outlines.generate.json(model, schema)  
character = generator("Give me a character description")

Using context-free grammars to guide generation


        
          
import outlines  
  
arithmetic_grammar = """  
    ?start: expression  
  
    ?expression: term (("+" | "-") term)*  
  
    ?term: factor (("*" | "/") factor)*  
  
    ?factor: NUMBER  
           | "-" factor  
           | "(" expression ")"  
  
    %import common.NUMBER  
"""  
  
model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")  
generator = outlines.generate.cfg(model, arithmetic_grammar)  
sequence = generator("Alice had 4 apples and Bob ate 2. Write an expression for Alice's apples:")  
  
print(sequence)  
# (8-2)

Open functions


        
          
import outlines  
  
  
def add(a: int, b: int):  
    return a + b  
  
model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")  
generator = outlines.generate.json(model, add)  
result = generator("Return json with two integers named a and b respectively. a is odd and b even.")  
  
print(add(**result))  
# 3

Prompting


        
          
import outlines  
  
examples = [  
    ("The food was disgusting", "Negative"),  
    ("We had a fantastic night", "Positive"),  
    ("Recommended", "Positive"),  
    ("The waiter was rude", "Negative")  
]  
  
@outlines.prompt  
def labelling(to_label, examples):  
    """You are a sentiment-labelling assistant.  
  
    {% for example in examples %}  
    {{ example[0] }} // {{ example[1] }}  
    {% endfor %}  
    {{ to\_label }} //  
    """  
  
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")  
prompt = labelling("Just awesome", examples)  
answer = outlines.generate.text(model)(prompt, max_tokens=100)

总结

本文介绍了大模型输出结构控制的技巧工具-outlines，并通过一个简历信息抽取的实践demo，验证其有效性。还简单记录了一些其他格式控制的代码。

参考文献

https://github.com/outlines-dev/outlines

往期相关

[浅尝prompt咒语设计：one-shot微调chatglm-6b实践信息抽取](http://mp.weixin.qq.com/s?__biz=Mzg4NjI0NDg0Ng==&mid=2247484023&idx=1&sn=7dbbf3c41e78ec000f20a964f494a98c&chksm=cf9dd6f6f8ea5fe07fcc6844d66003ebe74f802b8aa287352387b080f26c7c75d35bb805ea89&scene=21#wechat_redirect)  



[大语言模型控制生成的过程Trick：自定义LogitsProcessor实践](http://mp.weixin.qq.com/s?__biz=Mzg4NjI0NDg0Ng==&mid=2247484215&idx=1&sn=20b51b6c19cca9a67a0f1475e6183bca&chksm=cf9dd7b6f8ea5ea0c9d47bacff09fd6e7498b606dac557de9a19b1fde29fb2c30aa6962dcf6d&scene=21#wechat_redirect)