- 引言
- 简介
- 数据合成pipeline
- 合成的结果样本
- 代码讲解
- 总结
今天要介绍的这篇论文出自 Salesforce AI Research,详细介绍了名为 Webscale-RL 的新型数据合成pipeline,旨在解决大型语言模型 (LLMs) 中强化学习 (RL) 数据稀缺的问题。
更多AI相关欢迎关注微信公众号"小窗幽记机器学习":
Webscale-RL系统地将万亿级预训练文档转换为数百万个多样化且可验证的问答对,从而使 RL 训练能够达到预训练的规模。
研究人员声称,使用 Webscale-RL 数据集进行 RL 训练在各种基准测试中显著优于持续预训练等基线,并且在数据效率方面提高了 100 倍,为开发更强大、更高效的 LLMs 提供了一条途径。该数据合成pipeline通过数据过滤、领域分类、多重“角色”分配和质量检查等步骤来确保生成的数据集在规模和多样性上都具有高保真度。
论文地址: https://arxiv.org/abs/2510.06499
Github开源地址(真开源!):
https://github.com/SalesforceAIResearch/PretrainRL-pipeline/
官方默认是使用OpenAI的API,小伙伴可以稍微修改使用其他OpenAI格式的API或者自定义的LLM API。
Webscale-RL pipeline是一种自动化且可扩展的数据引擎,其核心目标是将大规模预训练语料库系统地转换为可验证的问答对(QA pairs) ,从而扩展 RL 数据的规模和多样性。该pipeline利用生成模型(即大型语言模型,LLM)将叙事性的预训练文档转换为适用于 RL 训练的 QA 对。
该数据合成流程包含四个主要阶段,旨在确保最终数据集具有高规模、高多样性和高可靠性(可验证性)。
图2:Webscale-RL数据流水线概览,该pipeline系统地将大规模预训练数据转换为RL数据,同时保持网络数据的规模和多样性。pipeline维护了一个特定领域的演示库(即可few shot),用于提供少样本示例以实现高质量生成,并为每个文档分配多个角色以鼓励反映不同的观点。生成的问答对经过正确性验证和泄露预防检查,以确保RL数据集的可靠性。
该流程利用了不同的 LLM 模型进行操作,例如使用 GPT-4.1 进行数据过滤和 QA 生成,使用 GPT-4.1-mini 进行领域分类和最终的质量检查。
阶段一:数据过滤
目标: 移除那些不太可能生成可验证的高质量问题的输入文档。
方法: 采用两阶段过滤机制: 1). 启发式过滤: 初步剔除明显低质量的文档。 2). 基于 LLM 的精细过滤: LLM 识别并移除以下两类内容:
- 非信息性页面(Non-informative pages): 内容大部分为样板文件(boilerplate),例如网站 HTML 中的导航、页眉或页脚。
- 非自足片段(Non-self-contained fragments): 缺乏足够的上下文来验证答案。
结果: 确保保留的文档既具有信息量,又可转换为可验证的 RL 数据。
阶段二:领域分类与角色分配
- 领域分类: 使用基于 LLM 的分类器将每个文档归类到一个特定的领域(如商业、医疗、社会科学等)。领域标签用于后续 QA 生成步骤中收集相关的少样本示例。
- 角色分配(Persona Assignment): 这是增强多样性的关键步骤。为每个文档分配多个 (最多 3 个)角色或假定的目标受众,以鼓励从不同角度生成问题。
- 例如,一篇关于“医疗保健”的文章可以被分配给“医疗专家”、“患者”或“健康记者”等角色。
- 这种方法鼓励根据相同文档从不同视角生成问题,捕捉源数据中更广泛的信息,从而丰富 RL 数据集的多样性。
阶段三:可验证QA对生成
核心步骤: 基于 LLM 的 QA 生成器(使用 GPT-4.1)在以下条件(上下文)的指导下,生成可验证的问答对: 1). 源文档。 2). 领域标签。 3). 选定的角色。
生成策略:
- 少样本示例指导: 从 领域特定的演示库 中抽取少样本示例,以确保生成的问题质量高且类型多样。
- 角色视角生成: 指导生成器从被分配的角色的视角来提取 QA 对。
- 问题自足性: 指示生成器提供必要的上下文,确保问题是自足的(self-contained),因为在 RL 训练过程中,模型无法访问源文档。
- 答案简洁性: 仅要求生成 简短、可验证的 正确答案(例如,数字、日期、名称或短语),而不是冗长的解释或详细的推理步骤,这显著降低了生成复杂性,并允许使用更具成本效益的 LLM 进行生成。
- 真实性要求: 问题和答案必须完全来源于原始材料,不得生成材料中没有的信息。
阶段四:质量检查与信息泄漏控制
- 目标: 确保 RL 数据集的可靠性,即使大部分生成的 QA 对质量很高,仍需检查错误或幻觉。
- 验证过程: 使用基于 LLM 的验证器(使用 GPT-4.1-mini)执行多阶段检查:
- 正确性验证 : 评估答案是否根据源文档是正确的(接地性),这有效地减少了 RL 训练期间奖励信号的无效性。
- 泄漏预防 (Leakage prevention): 确保问题不会明确泄露答案(即答案不能轻易地从提示中获取)。这确保了最终数据集真正测试模型的知识或推理能力,而不是其直接总结或回忆信息的能力。
- 
最终处理: 任何未能通过这些标准的 QA 对都会被过滤掉。 
- 
数据去污染: 在完成这四个阶段后,还会应用数据去污染步骤,以移除与评估基准重叠的内容。 
通过这个流程,大规模的预训练数据(例如 DCLM、Wikipedia、MegaMath 等)被系统地转换为大规模、多样化且可验证的 RL 就绪数据集。
  
{  
"pretrain\_text": "Skip Navigation  
  
People and Climate  
  
  
Let's Talk About It  
  
In this activity, students observe that sunlight shines most directly on the center of the globe, nearest to the equator. Conversely, areas furthest from the equator receive the least sunlight and heat. Students learn about the major climate types on Earth, and how climate affects people’s lifestyles.  
  
Discuss how light from the flashlight shined on the globe. Tell students that the central part of Earth (near the equator) receives light at a more direct angle from the sun than do other regions of the globe. Emphasize that in addition to temperature and sunlight, rainfall (or lack thereof) is an important component of weather and climate.   
  
Ask students if they have ever wondered why there are seasons of the year. Tell them that seasons are caused by Earth’s tilt as it revolves around the sun. When the Northern Hemisphere is tilted toward the sun, that half of Earth experiences summer, and the Southern Hemisphere has winter.  
  
Funded by the following grant(s)  
  
National Institute of Environmental Health Sciences, NIH  
My Health My World: National Dissemination  
Grant Number: 5R25ES009259  
The Environment as a Context for Opportunities in Schools  
Grant Number: 5R25ES010698, R25ES06932  
Houston Endowment Inc.",  
  
"question": "When learning about Earth's climate and weather, it is important to understand why we have seasons each year. What is the reason for the changing seasons?",  
"answer":"Seasons are caused by Earth's tilt as it revolves around the sun."  
"domain": "Natural Science",  
"persona": "students"  
第1阶段(数据过滤)Prompt模板:
  
FILTER\_FORMAT = """  
The output MUST strictly adhere to the following JSON format, and other text MUST NOT be included:  
{
"thought": "Describe your reasoning for identifying whether the data is qualified or not.",
"qualified": "Y for qualified, N for not qualified.",
}
"""  
  
FILTER\_TEMPLATE = """  
You are a helpful data analyst. You will be given a material which can come from very diverse sources and may not be well-structured. Our final goal is to generate question and answer pair from the material. In this stage, your task is to identify whether the material is qualified for the following criteria:  
- The material is informative and self-contained for the user.  
- The content has sufficient depth and clarity.  
- It's possible to extract question and corresponding answer from the material.  
Based on the above instructions, identify whether the material is qualified or not.  
  
Material:  
{material}  
  
{format\_inst}  
"""  
如果通过第1阶段的数据过滤,进入第2阶段。
第2阶段(领域分类与角色分配)Prompt模板:
  
IDENTIFIER\_FORMAT = """  
The output MUST strictly adhere to the following JSON format, and other text MUST NOT be included:  
{  
  "thought": "Describe your reasoning for identifying the domain and persona of the material.",  
  "domain": "The domain of the material.",  
  "persona": "The persona that the material is intended for. Separate with comma if there are multiple personas. Max 3 personas.",  
}  
"""  
  
ALL\_DOMAINS = [  
    "Math",  
    "Technology & Engineering",  
    "Coding",  
    "Social Science",  
    "Natural Science",  
    "Travel & Lifestyle",  
    "Commerce & Economics",  
    "Medicine & Health",  
    "Education",  
    "Other",  
]  
  
IDENTIFIER\_TEMPLATE = """  
You are a helpful data analyst. You will be given a material from a website which can come from very diverse sources and may not be well-structured. Our final goal is to generate question and answer pair from the material. In this stage, your task is to identify the domain and persona of the material.  
  
Here are the instructions for the domain and persona:  
- The domain is the main topic of the material. You should choose from the following domains: {all\_domains}. If you find that the material is not related to any of the domains or the domain is not clear, you should choose "Other". If there are multiple domains that the material is related to, you should choose the most relevant domain.  
- The persona is the intended audience of the material. If the material is intended for multiple personas, you should list several personas (up to 3) that will be interested in the material.  
  
Based on the above instructions, identify the domain and persona of the material.  
  
Material:  
{material}  
  
{format\_inst}  
"""  
所以第2阶段输出的结果包含了domain信息和  
persona信息。
第3阶段(可验证QA对生成)Prompt模板:
  
GENERATOR\_FORMAT = """  
The output MUST strictly adhere to the following JSON format, and other text MUST NOT be included:  
{
"thought": "Describe your reasoning for generating question and answer pair.",
"question": "The question generated from the material.",
"answer": "The answer generated from the material.",
}
"""  
  
GENERATOR\_TEMPLATE = """  
You will be given a material from a website which can come from very diverse sources and may not be well-structured. Our final goal is to generate question and answer pair from the material. In this stage, your task is to generate a question and answer pair from the material.  
  
Here are the instructions for the question and answer generation:  
- You will act as a given persona. You should generate a question and answer pair from your perspective.  
- Both the question and answer should be totally from the material. Do not generate any information that is not in the material.  
- You should generate such a question that its corresponding answer is relatively short and can be easily and clearly verified.  
- The generated question will be asked without providing the original material. Therefore, you should add a necessary brief introduction of the background before the question. NEVER ask a question with "according to the material".  
- When adding introduction to the question, you should NEVER explicitly include the answer in the question, which will be viewed as info leakage and is strictly forbidden.  
  
Here are some examples of QA pairs extracted from the material:  
{few\_shot\_example}  
  
Based on the above instructions and examples, generate the question and answer pair from the material according to your persona.  
  
[Material]  
{material}  
  
  
[Your persona]  
{persona}  
  
{format\_inst}  
"""  
上述prompt模板中使用到的few\_shot\_example的生成逻辑如下。会根据不同的domain使用预设的few shot,具体执行的代码:
  
few\_shot\_example = self.get\_few\_shot\_example(self.generator\_cfg.num\_fewshot, domain)  
核心的代码逻辑:
  
    def format\_examples(self, examples: List[Dict]):  
        """  
        Formats a list of example dictionaries into a structured string.  
        """  
        example\_str = "\n[BEGIN OF EXAMPLES]\n"  
        for i, example in enumerate(examples):  
            example\_str += f"Example {i}:\n"  
            example\_str += self.few\_shot\_example\_template.format(  
                original\_material=example["original\_material"],  
                persona=example["persona"],  
                question=example["question"],  
                answer=example["answer"],  
            )  
            example\_str += "\n"  
        example\_str += "[END OF EXAMPLES]\n\n"  
        return example\_str  
其中few\_shot\_example\_template如下:
  
GENERATOR\_FEW\_SHOT\_TEMPLATE = """  
Material:  
{original\_material}  
  
  
Persona:  
{persona}  
  
  
Extracted Question:  
{question}  
  
  
Extracted Answer:  
{answer}  
"""  
可以看出,只需要将few shot里面的{original\_material}、{persona}、{question}和{answer}信息填入即可。
最终构建得到few\_shot\_example结果:
  
'\n[BEGIN OF EXAMPLES]  
Example 0:  
  
Material:  
{材料的明文信息}  
  
  
Persona:  
military personnel  
  
  
Extracted Question:  
In Army operations, what are the core competencies of the fires warfighting function?  
  
  
Extracted Answer:  
Air defense artillery and field artillery.\n\n  
  
Example 1:  
  
Material:  
{资料信息}  
  
Persona:  
Mandopop enthusiasts  
  
  
Extracted Question:  
In Khalil Fong\'s 2009 cover album \'Timeless\', which Mandarin classics and their original artists did he cover?  
  
  
Extracted Answer:  
He covered Faye Wong’s \'Red Bean\' and A-mei’s \'Remember\'.  
  
[END OF EXAMPLES]\n\n'  
第4阶段(质量检查与信息泄漏控制)Prompt模板:
  
CHECKER\_FORMAT = """  
The output MUST strictly adhere to the following JSON format, and other text MUST NOT be included:  
{
"thought": "Describe your reasoning for checking the question and answer pair.",
"has_context": "Y for has context, N for no context",
"answer_correctness": "Y for correct, N for incorrect",
"info_leakage": "Y for has info leakage, N for no info leakage",
}
"""  
  
CHECKER\_TEMPLATE = """  
You are a data labeler. You will be given a material and a question and answer pair generated from the material. Your task is to check whether the question and answer pair is correct according to the material and whether there is info leakage from question to answer.  
  
Here are the instructions for checking:  
- The necesssary context for the question should be added to the question, e.g., a question with "according to the material" should be removed.  
- For the answer correctness, you should check whether the answer is correct according to the original material.  
- The information leakage indicates that the question explicitly provides information about the answer and then the answer can be directly obtained from the question.  
  
Based on the above instructions, check the QA pair extracted from the original material in terms of the answer correctness and info leakage.  
  
[Original Material]:  
{material}  
  
[Extracted Question]:  
{question}  
  
[Extracted Answer]:  
{answer}  
  
{format\_inst}  
"""  
Webscale-RL引入了可扩展数据pipeline ,将万亿级预训练文档系统转化为120万个 多样化、可验证的 RL问答对。该方法解决了强化学习(RL)数据稀缺和多样性不足的瓶颈。实验证明,在Webscale-RL上训练的模型性能显著优于 基线,并实现了高达100倍的数据效率提升 ,为RL规模化提供了可行路径。
但是。Webscale-RL方法及其生成的数据集存在以下几点不足和限制:
1. 数据集在特定领域的高质量数据覆盖不足
- 领域覆盖不均衡: 当前的Webscale-RL数据集在某些领域,如 编程 ,缺乏高质量数据的覆盖。
- 影响: 这种不足导致在编程基准测试(coding benchmarks)上的性能提升相对较小。例如,在编码任务上的性能改进相对较小,这可能反映了预训练语料库中编码数据的比例较低。
2. 强化学习(RL)训练的效率瓶颈
- 奖励模型成本高昂: 当前的RL训练采用了一个 生成式奖励模型 (generative reward model)。
- 反馈机制和成本: 该奖励模型根据模型生成的答案是否与 事实真相答案匹配 来提供 二元反馈 (binary feedback)。虽然这种奖励在高模型性能和训练稳定性方面表现良好,但它引入了 大量的额外推理成本 。
- 可扩展性瓶颈: 这种高推理成本成为Webscale-RL 进一步扩展到更大模型和数据集的瓶颈之一 。
未来的改进方向:
基于这些局限性,论文提出了未来的改进方向:
- 重新平衡领域分布: 未来工作可以根据目标应用(例如,整合存储库规模的编码数据)重新平衡预训练源的领域分布,以 增强编程能力 。
- 探索更高效的奖励模型: 未来可以探索更高效的奖励模型,以进一步将RL训练扩展到更大的模型和数据集。
