cohere重磅开源：全新在线RLHF算法RLOO，性能飙升、显存大减，超参无忧，已入驻huggingface！ - 文章 - 开发者社区

文章出发点： 越来越多的研究表明，在线 RL 比 DPO 等离线方法更有效，如下图。但是PPO 一方面对 GPU 内存要求较高，另一方面对超参比较敏感。来自cohere的科学家们提出了RLOO的新rlhf算法。RLOO 通过将整个模型完成视为单个操作，使用 REINFORCE 来简化 RLHF 训练过程，不需要估算价值value的模型，对超参数调整的敏感度低于PPO，效果优于PPO。

picture.image

文章标题： Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs


        
          
https://arxiv.org/abs/2402.14740  
https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo

一、工作原理

与ppo对比

相同的地方

策略模型生成一些回复，获取策略模型和参考模型下的每token概率
计算每个token KL惩罚作为当前策略和参考策略下logprobs之间的差异
从奖励模型中获得整个完成的分数


        
          
from torch import Tensor  
response = Tensor([4., 5., 6.])  
per_token_logprobs = Tensor([-12.3, -8.3, -2.3])  
reference_per_token_logprobs = Tensor([-11.3, -8.4, -2.0])  
kl = per_token_logprobs - reference_per_token_logprobs  
score_from_rm = 1.0  
print(f"{kl=}")  # kl=tensor([-1.0000,  0.1000, -0.3000])  
per_token_reward = kl.clone()  
per_token_reward[-1] += score_from_rm  # assume last token is the EOS token  
print(f"{per\_token\_reward=}")  # per\_token\_reward=tensor([-1.0000,  0.1000,  0.7000])  
print(f"{score\_from\_rm=}")  # score\_from\_rm=1.0  
print("#### Modeling each token as an action")  
for action, reward in zip(response, per_token_reward):  
    print(f"{action=}, {reward=}")  
# action=tensor(4.), reward=tensor(-1.)  
# action=tensor(5.), reward=tensor(0.1000)  
# action=tensor(6.), reward=tensor(0.7000)  
print("#### Modeling the entire response as an action")  
entire_generation_reward = per_token_reward.sum()  
print(f"action='entire completion', reward={entire\_generation\_reward}")  
# action='entire completion', reward=-0.2000 (-1 + 0.1 + 0.7)

不同的地方

将整个模型完成视为单个操作，而常规 PPO 将每个完成 token 视为单独的操作


        
          
baseline = Tensor([0.2, 0.3, 0.4])  # dummy baseline  
print("#### Modeling each token as an action")  
advantage = per_token_reward - baseline  
per_token_reinforce_loss = per_token_logprobs * advantage  
print(f"{advantage=}")  # advantage=tensor([-1.2000, -0.2000,  0.3000])  
print(f"{per\_token\_reinforce\_loss=}")  # per\_token\_reinforce\_loss=tensor([14.7600,  1.6600, -0.6900])  
print(f"{per\_token\_reinforce\_loss.mean()=}")  # per\_token\_reinforce\_loss.mean()=tensor(5.2433)  
  
print("#### Modeling the entire response as an action")  
advantage = entire_generation_reward - baseline.sum()  
reinforce_loss = per_token_logprobs.sum() * advantage  
print(f"{advantage=}")  # advantage=tensor(-1.1000)  
print(f"{reinforce\_loss=}")  # reinforce\_loss=tensor(25.1900)

在PPO中根据gae的value model来计算adavantage，在 RLOO 使用 REINFORCE 损失，使用一个batch中，其他样本的奖励作为baseline


        
          
import torch  
local_batch_size = 3  
rloo_k = 4  
  
rlhf_reward = torch.tensor([  
    1, 2, 3, # first rlhf reward for three prompts  
    2, 3, 4, # second rlhf reward for three prompts  
    5, 6, 7, # third rlhf reward for three prompts  
    8, 9, 10, # fourth rlhf reward for three prompts  
]).float() # here we have 3 prompts which have 4 completions each  
  
# slow impl  
baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)  
advantages = torch.zeros_like(rlhf_reward)  
for i in range(0, len(advantages), local_batch_size):  
    other_response_rlhf_rewards = []  
    for j in range(0, len(advantages), local_batch_size):  
        if i != j:  
            other_response_rlhf_rewards.append(rlhf_reward[j : j + local_batch_size])  
    advantages[i : i + local_batch_size] = rlhf_reward[i : i + local_batch_size] - torch.stack(  
        other_response_rlhf_rewards  
    ).mean(0)  
assert (1 - (2 + 5 + 8) / 3 - advantages[0].item()) < 1e-6  
assert (6 - (3 + 2 + 9) / 3 - advantages[7].item()) < 1e-6  
  
# vectorized impl  
rlhf_reward = rlhf_reward.reshape(rloo_k, local_batch_size)  
baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)  
vec_advantages = rlhf_reward - baseline  
torch.testing.assert_close(vec_advantages.flatten(), advantages)

二、trl支持RLOO


        
          
https://huggingface.co/docs/trl/main/en/rloo_trainer  
https://huggingface.co/docs/trl/main/en/ppov2_trainer

三、结论

RLOO 所需内存减少约 50-70%，运行速度比 PPO 快 2-3 倍
它只需要放三个模型在内存中，而 PPO 中需要四个模型。
RLOO 对超参数调整的敏感度低于 PPO。
RLOO (k=4) 在 Anthropic-HH 数据集上实现了 62.2% 的胜率，优于 RAFT (59.3%)、PPO (56.7%) 和 DPO (50.0%)。
RLOO 中的 k 越高，性能越好（k=4 时为 61.9%，k=2 时为 61.3%）。显存对比：

picture.image

“

NLP前沿交流群成立，详见置顶推文。进群加微：nipi64310 备注一下”加群“

-END-

右下角，帮忙点点

picture.image