cohere重磅开源:全新在线RLHF算法RLOO,性能飙升、显存大减,超参无忧,已入驻huggingface!

智能语音交互小程序流媒体协议

文章出发点: 越来越多的研究表明,在线 RL 比 DPO 等离线方法更有效,如下图。但是PPO 一方面对 GPU 内存要求较高,另一方面对超参比较敏感。来自cohere的科学家们提出了RLOO的新rlhf算法。RLOO 通过将整个模型完成视为单个操作,使用 REINFORCE 来简化 RLHF 训练过程,不需要估算价值value的模型,对超参数调整的敏感度低于PPO,效果优于PPO。

picture.image

文章标题: Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs


        
          
https://arxiv.org/abs/2402.14740  
https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo  

      

一、工作原理

与ppo对比

  • 相同的地方
  • 策略模型生成一些回复,获取策略模型和参考模型下的每token概率
  • 计算每个token KL惩罚作为当前策略和参考策略下logprobs之间的差异
  • 从奖励模型中获得整个完成的分数picture.image

        
          
from torch import Tensor  
response = Tensor([4., 5., 6.])  
per_token_logprobs = Tensor([-12.3, -8.3, -2.3])  
reference_per_token_logprobs = Tensor([-11.3, -8.4, -2.0])  
kl = per_token_logprobs - reference_per_token_logprobs  
score_from_rm = 1.0  
print(f"{kl=}")  # kl=tensor([-1.0000,  0.1000, -0.3000])  
per_token_reward = kl.clone()  
per_token_reward[-1] += score_from_rm  # assume last token is the EOS token  
print(f"{per\_token\_reward=}")  # per\_token\_reward=tensor([-1.0000,  0.1000,  0.7000])  
print(f"{score\_from\_rm=}")  # score\_from\_rm=1.0  
print("#### Modeling each token as an action")  
for action, reward in zip(response, per_token_reward):  
    print(f"{action=}, {reward=}")  
# action=tensor(4.), reward=tensor(-1.)  
# action=tensor(5.), reward=tensor(0.1000)  
# action=tensor(6.), reward=tensor(0.7000)  
print("#### Modeling the entire response as an action")  
entire_generation_reward = per_token_reward.sum()  
print(f"action='entire completion', reward={entire\_generation\_reward}")  
# action='entire completion', reward=-0.2000 (-1 + 0.1 + 0.7)  

      
  • 不同的地方
  • 将整个模型完成视为单个操作,而常规 PPO 将每个完成 token 视为单独的操作

        
          
baseline = Tensor([0.2, 0.3, 0.4])  # dummy baseline  
print("#### Modeling each token as an action")  
advantage = per_token_reward - baseline  
per_token_reinforce_loss = per_token_logprobs * advantage  
print(f"{advantage=}")  # advantage=tensor([-1.2000, -0.2000,  0.3000])  
print(f"{per\_token\_reinforce\_loss=}")  # per\_token\_reinforce\_loss=tensor([14.7600,  1.6600, -0.6900])  
print(f"{per\_token\_reinforce\_loss.mean()=}")  # per\_token\_reinforce\_loss.mean()=tensor(5.2433)  
  
print("#### Modeling the entire response as an action")  
advantage = entire_generation_reward - baseline.sum()  
reinforce_loss = per_token_logprobs.sum() * advantage  
print(f"{advantage=}")  # advantage=tensor(-1.1000)  
print(f"{reinforce\_loss=}")  # reinforce\_loss=tensor(25.1900)  

      
  • 在PPO中根据gae的value model来计算adavantage,在 RLOO 使用 REINFORCE 损失,使用一个batch中,其他样本的奖励作为baseline

        
          
import torch  
local_batch_size = 3  
rloo_k = 4  
  
rlhf_reward = torch.tensor([  
    1, 2, 3, # first rlhf reward for three prompts  
    2, 3, 4, # second rlhf reward for three prompts  
    5, 6, 7, # third rlhf reward for three prompts  
    8, 9, 10, # fourth rlhf reward for three prompts  
]).float() # here we have 3 prompts which have 4 completions each  
  
# slow impl  
baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)  
advantages = torch.zeros_like(rlhf_reward)  
for i in range(0, len(advantages), local_batch_size):  
    other_response_rlhf_rewards = []  
    for j in range(0, len(advantages), local_batch_size):  
        if i != j:  
            other_response_rlhf_rewards.append(rlhf_reward[j : j + local_batch_size])  
    advantages[i : i + local_batch_size] = rlhf_reward[i : i + local_batch_size] - torch.stack(  
        other_response_rlhf_rewards  
    ).mean(0)  
assert (1 - (2 + 5 + 8) / 3 - advantages[0].item()) < 1e-6  
assert (6 - (3 + 2 + 9) / 3 - advantages[7].item()) < 1e-6  
  
# vectorized impl  
rlhf_reward = rlhf_reward.reshape(rloo_k, local_batch_size)  
baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)  
vec_advantages = rlhf_reward - baseline  
torch.testing.assert_close(vec_advantages.flatten(), advantages)  

      

二、trl支持RLOO


        
          
https://huggingface.co/docs/trl/main/en/rloo_trainer  
https://huggingface.co/docs/trl/main/en/ppov2_trainer  

      

三、结论

  • RLOO 所需内存减少约 50-70%,运行速度比 PPO 快 2-3 倍
  • 它只需要放三个模型在内存中,而 PPO 中需要四个模型。
  • RLOO 对超参数调整的敏感度低于 PPO。
  • RLOO (k=4) 在 Anthropic-HH 数据集上实现了 62.2% 的胜率,优于 RAFT (59.3%)、PPO (56.7%) 和 DPO (50.0%)。
  • RLOO 中的 k 越高,性能越好(k=4 时为 61.9%,k=2 时为 61.3%)。显存对比:

picture.image

NLP前沿交流群成立,详见置顶推文。进群加微:nipi64310 备注一下”加群“

-END-

右下角,帮忙点点

picture.image

+ picture.image

0
0
0
0
关于作者
关于作者

文章

0

获赞

0

收藏

0

相关资源
DevOps 在字节移动研发中的探索和实践
在日益复杂的APP工程架构下,如何保证APP能高效开发,保障团队效能和工程质量?本次将结合字节内部应用的事件案例,介绍DevOps团队对移动研发效能建设的探索和思考。
相关产品
评论
未登录
看完啦,登录分享一下感受吧~
暂无评论