文章出发点: 越来越多的研究表明,在线 RL 比 DPO 等离线方法更有效,如下图。但是PPO 一方面对 GPU 内存要求较高,另一方面对超参比较敏感。来自cohere的科学家们提出了RLOO的新rlhf算法。RLOO 通过将整个模型完成视为单个操作,使用 REINFORCE 来简化 RLHF 训练过程,不需要估算价值value的模型,对超参数调整的敏感度低于PPO,效果优于PPO。
文章标题: Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
https://arxiv.org/abs/2402.14740
https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo
一、工作原理
与ppo对比
- 相同的地方
- 策略模型生成一些回复,获取策略模型和参考模型下的每token概率
- 计算每个token KL惩罚作为当前策略和参考策略下logprobs之间的差异
- 从奖励模型中获得整个完成的分数
from torch import Tensor
response = Tensor([4., 5., 6.])
per_token_logprobs = Tensor([-12.3, -8.3, -2.3])
reference_per_token_logprobs = Tensor([-11.3, -8.4, -2.0])
kl = per_token_logprobs - reference_per_token_logprobs
score_from_rm = 1.0
print(f"{kl=}") # kl=tensor([-1.0000, 0.1000, -0.3000])
per_token_reward = kl.clone()
per_token_reward[-1] += score_from_rm # assume last token is the EOS token
print(f"{per\_token\_reward=}") # per\_token\_reward=tensor([-1.0000, 0.1000, 0.7000])
print(f"{score\_from\_rm=}") # score\_from\_rm=1.0
print("#### Modeling each token as an action")
for action, reward in zip(response, per_token_reward):
print(f"{action=}, {reward=}")
# action=tensor(4.), reward=tensor(-1.)
# action=tensor(5.), reward=tensor(0.1000)
# action=tensor(6.), reward=tensor(0.7000)
print("#### Modeling the entire response as an action")
entire_generation_reward = per_token_reward.sum()
print(f"action='entire completion', reward={entire\_generation\_reward}")
# action='entire completion', reward=-0.2000 (-1 + 0.1 + 0.7)
- 不同的地方
- 将整个模型完成视为单个操作,而常规 PPO 将每个完成 token 视为单独的操作
baseline = Tensor([0.2, 0.3, 0.4]) # dummy baseline
print("#### Modeling each token as an action")
advantage = per_token_reward - baseline
per_token_reinforce_loss = per_token_logprobs * advantage
print(f"{advantage=}") # advantage=tensor([-1.2000, -0.2000, 0.3000])
print(f"{per\_token\_reinforce\_loss=}") # per\_token\_reinforce\_loss=tensor([14.7600, 1.6600, -0.6900])
print(f"{per\_token\_reinforce\_loss.mean()=}") # per\_token\_reinforce\_loss.mean()=tensor(5.2433)
print("#### Modeling the entire response as an action")
advantage = entire_generation_reward - baseline.sum()
reinforce_loss = per_token_logprobs.sum() * advantage
print(f"{advantage=}") # advantage=tensor(-1.1000)
print(f"{reinforce\_loss=}") # reinforce\_loss=tensor(25.1900)
- 在PPO中根据gae的value model来计算adavantage,在 RLOO 使用 REINFORCE 损失,使用一个batch中,其他样本的奖励作为baseline
import torch
local_batch_size = 3
rloo_k = 4
rlhf_reward = torch.tensor([
1, 2, 3, # first rlhf reward for three prompts
2, 3, 4, # second rlhf reward for three prompts
5, 6, 7, # third rlhf reward for three prompts
8, 9, 10, # fourth rlhf reward for three prompts
]).float() # here we have 3 prompts which have 4 completions each
# slow impl
baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)
advantages = torch.zeros_like(rlhf_reward)
for i in range(0, len(advantages), local_batch_size):
other_response_rlhf_rewards = []
for j in range(0, len(advantages), local_batch_size):
if i != j:
other_response_rlhf_rewards.append(rlhf_reward[j : j + local_batch_size])
advantages[i : i + local_batch_size] = rlhf_reward[i : i + local_batch_size] - torch.stack(
other_response_rlhf_rewards
).mean(0)
assert (1 - (2 + 5 + 8) / 3 - advantages[0].item()) < 1e-6
assert (6 - (3 + 2 + 9) / 3 - advantages[7].item()) < 1e-6
# vectorized impl
rlhf_reward = rlhf_reward.reshape(rloo_k, local_batch_size)
baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)
vec_advantages = rlhf_reward - baseline
torch.testing.assert_close(vec_advantages.flatten(), advantages)
二、trl支持RLOO
https://huggingface.co/docs/trl/main/en/rloo_trainer
https://huggingface.co/docs/trl/main/en/ppov2_trainer
三、结论
- RLOO 所需内存减少约 50-70%,运行速度比 PPO 快 2-3 倍
- 它只需要放三个模型在内存中,而 PPO 中需要四个模型。
- RLOO 对超参数调整的敏感度低于 PPO。
- RLOO (k=4) 在 Anthropic-HH 数据集上实现了 62.2% 的胜率,优于 RAFT (59.3%)、PPO (56.7%) 和 DPO (50.0%)。
- RLOO 中的 k 越高,性能越好(k=4 时为 61.9%,k=2 时为 61.3%)。显存对比:
“
NLP前沿交流群成立,详见置顶推文。进群加微:nipi64310 备注一下”加群“
-END-
右下角,帮忙点点
+