torch 之 torch.nn.utils.clip_grad_norm_ - 文章 - 开发者社区

torch.nn.utils.clip_grad_norm_ 和 torch.nn.utils.clip_grad_value_ 只解决梯度爆炸问题，不解决梯度消失问题。

torch.nn.utils.clip_grad_norm_

  
def clip\_grad\_norm\_(  
        parameters: \_tensor\_or\_tensors, max\_norm: float, norm\_type: float = 2.0,  
        error\_if\_nonfinite: bool = False) -> torch.Tensor:  
    r"""Clips gradient norm of an iterable of parameters.  
  
    The norm is computed over all gradients together, as if they were  
    concatenated into a single vector. Gradients are modified in-place.  
  
    Args:  
        parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a  
            single Tensor that will have gradients normalized  
        max\_norm (float or int): max norm of the gradients  
        norm\_type (float or int): type of the used p-norm. Can be ``'inf'`` for  
            infinity norm.  
        error\_if\_nonfinite (bool): if True, an error is thrown if the total  
            norm of the gradients from :attr:`parameters` is ``nan``,  
            ``inf``, or ``-inf``. Default: False (will switch to True in the future)  
  
    Returns:  
        Total norm of the parameter gradients (viewed as a single vector).  
    """  
    if isinstance(parameters, torch.Tensor):  
        parameters = [parameters]  
    grads = [p.grad for p in parameters if p.grad is not None]  
    max_norm = float(max_norm)  
    norm_type = float(norm_type)  
    if len(grads) == 0:  
        return torch.tensor(0.)  
    device = grads[0].device  
    if norm_type == inf:  
        norms = [g.detach().abs().max().to(device) for g in grads]  
        total_norm = norms[0] if len(norms) == 1 else torch.max(torch.stack(norms))  
    else:  
        total_norm = torch.norm(torch.stack([torch.norm(g.detach(), norm_type).to(device) for g in grads]), norm_type)  
    if error_if_nonfinite and torch.logical_or(total_norm.isnan(), total_norm.isinf()):  
        raise RuntimeError(  
            f'The total norm of order {norm\_type} for gradients from '  
            '`parameters` is non-finite, so it cannot be clipped. To disable '  
            'this error and scale the gradients by the non-finite norm anyway, '  
            'set `error\_if\_nonfinite=False`')  
    clip_coef = max_norm / (total_norm + 1e-6)  
    # Note: multiplying by the clamped coef is redundant when the coef is clamped to 1, but doing so  
    # avoids a `if clip\_coef < 1:` conditional which can require a CPU <=> device synchronization  
    # when the gradients do not reside in CPU memory.  
    clip_coef_clamped = torch.clamp(clip_coef, max=1.0)  
    for g in grads:  
        g.detach().mul_(clip_coef_clamped.to(g.device))  
    return total_norm

clip_grad_norm_ 在参数的 grad 上乘以 clip_coef_clamped。 clip_coef_clamped 即 clip_coef 裁剪到

区间，所以只能用于解决梯度爆炸的问题。

其中 clip_coef的计算公式是：

max_norm是输入参数，预期裁剪到的梯度。total_norm是global norm，其将每个grad的norm拼接起来作为一个vector来计算norm值，其代码如下：

  
    if norm_type == inf:  
        norms = [g.detach().abs().max().to(device) for g in grads]  
        total_norm = norms[0] if len(norms) == 1 else torch.max(torch.stack(norms))  
    else:  
        total_norm = torch.norm(torch.stack([torch.norm(g.detach(), norm_type).to(device) for g in grads]), norm_type)

所以：

clip_coef 越小，梯度的裁剪越厉害。
max_norm 越小，梯度的裁剪越厉害。
total_norm 越大，梯度裁剪的越厉害。

如果 torch.nn.utils.clip_grad_norm_ 传入的 parameters 不一致，其 total_norm 值会不一致，在其它参数一致的情况下裁剪后的梯度会对不齐。

torch.nn.utils.clip_grad_value_

torch.nn.utils.clip_grad_value_将参数的 grad 裁剪到之间。

  
def clip\_grad\_value\_(parameters: \_tensor\_or\_tensors, clip\_value: float) -> None:  
    r"""Clips gradient of an iterable of parameters at specified value.  
  
    Gradients are modified in-place.  
  
    Args:  
        parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a  
            single Tensor that will have gradients normalized  
        clip\_value (float or int): maximum allowed value of the gradients.  
            The gradients are clipped in the range  
            :math:`\left[\text{-clip\\_value}, \text{clip\\_value}\right]`  
    """  
    if isinstance(parameters, torch.Tensor):  
        parameters = [parameters]  
    clip_value = float(clip_value)  
    for p in filter(lambda p: p.grad is not None, parameters):  
        p.grad.data.clamp_(min=-clip_value, max=clip_value)

参考文献

picture.image