Fully Sharded Data Parallelism (FSDP)

在这篇博客中，我们将探索完全分片数据并行（FSDP），这是一种允许以分布式方式高效训练大型神经网络模型的技术。我们将从鸟瞰的角度来检查FSDP，并阐明其潜在的机制。

Why FSDP?

在选择分布式学习方法时，重要的是要了解每种策略的优缺点，以便实现与手边的目标用例相匹配的策略。

特别是对于大型语言模型（llm），大量参数需要处理重要的GPU内存需求，而FSDP在这种情况下作为高性能解决方案出现，因为它能够有效地解决重要的GPU内存需求。通过利用多个GPU， FSDP提供了一个实际的权衡，优化GPU通信，以尽量减少内存使用。

另一方面，对于通常可以适合单个GPU的计算机视觉模型，分布式数据并行（DDP）通常被证明更有效，允许整个模型运行，同时避免与FSDP相关的GPU通信开销。

Why Can’t We Run the Model Sequentially?

假设我们发现自己想要运行一个不适合单个GPU的大型模型。运行模型的“naive”方法包括通过将不同的模型层分配给不同的gpu，以所谓的“vertical”方式分割模型。每个GPU处理一组特定的层，整个模型可以依次运行。问题解决了，对吧？不幸的是，情况并非如此——这种朴素的方法有明显的局限性，我们将展示FSDP是克服它们的答案。

让我们考虑一个有 n个GPU的场景，模型大小表示为 S，其中模型不能装入一个GPU中，但可以装入所有GPU的组合中。在“vertically”分割模型后，每个GPU都有一个大小为 S/n的模型切片。单个GPU在其模型切片上运行前向传递所需的时间表示为 T_forward，而反向运行时间表示为 T_backward。

这种方法允许在 T_forward*n中完成前向传递，反映了FSDP的最佳情况。此外，只传递激活，而不传递模型权重、梯度或优化器状态。

这个策略，至少乍一看，似乎是一个有效的方法，模型用多个GPU训练，并考虑到该方法的直接性质，有人可能会质疑做任何更复杂的事情的必要性-似乎我们不需要广泛的GPU通信，分割GPU之间的数据集，或执行任何复杂的GPU通信，如FSDP中使用的操作。

当第一个GPU完成其前向传递并打算在下一批中运行前向传递时，问题就出现了。在更新第一个GPU模型权重的初始权重之前，我们不能开始后续批次的下一个前向传递，因此第一个GPU必须等待所有下游GPU完成前向传递并向后传播各自的梯度，然后才能更新权重，导致大量空闲时间。具体来说，在第一批应用反向传播之前，第一个GPU等待 T_forward*(n-1) + T_backward*(n-1)。

我们将模型划分到不同的GPU中，但大多数时候，每个GPU只是无所事事，等待数据在模型中的其他地方传播！

picture.image

在这里插入图片描述

FSDP为这个问题提供了一个解决方案，提供了在没有大量空闲GPU时间的情况下充分利用大型模型上所有GPU的能力。

Laying the Groundwork

有两个单独的动作发生设置FSDP进程：

这里的竖切、横切是针对模型操作的。

Model Partition (Vertical Assignment)

就像上面的简单解决方案一样，在 vertical splitting 中，模型层被组织成“units”。例如，在具有9个卷积层的模型中，每个unit负责特定范围的层。第一个GPU unit 可能管理第1到3层，第二个unit管理第4到6层，最后一个unit可能管理第7到9层。

Sharding (Horizontal Splitting)

“Horizontal splitting”是指将模型参数在每一层内分割 ，并存储在单个gpu上；这个过程通常也被称为sharding 。例如，对于3个全连接层，不是将每个全连接层存储在一个GPU上，而是每个GPU保存三分之一的全连接层实体（即参数，梯度和优化器状态 ）。

在整个训练过程中，gpu之间的协作发生在它们共享必要的分片时，这样做，我们存储冗余参数并导致gpu之间的通信开销，但这样做，我们能够使所有gpu始终保持忙碌。

所有gpu将通过从其他gpu收集模型参数和其他实体的必要分片，在前向和后向的步骤中逐一并行运行所有单元。

Sharded Entities

在PyTorch的FSDP中，有多个被称为“分片策略”的配置设置来控制模型分片的分布和管理。这篇博文将深入研究复杂的 FULL_SHARD分片策略，这是内存效率最高但通信密集型的策略。

在 FULL_SHARD策略下，以下关键实体将被分片：

Model Parameters (MP) : 这些包括模型的核心组件，如weights、biases和buffers，以及特定于模型体系结构的附加参数。

Gradients (GRD) : 这些是在反向传播期间计算的梯度，允许更新模型的权重。

Optimizer State (OS) : 这些数据包括在训练过程中执行一些梯度下降所需的所有部分。例如，当使用Adam优化器时，该实体包含存储的gradients、momentum和variance。通常，数据以32位浮点格式（ FP32 ）保留，以确保优化过程中的精度。

picture.image

在这里插入图片描述

Step-by-Step FSDP Breakdown

在上面的例子中，我们假设我们有三个可用的GPU，我们将再次假设模型不能放入一个GPU，但可以在所有GPU的组合中放入。我们将在玩具示例中使用的神经网络有9层，每个单元将分配3层。一旦解释了这个玩具示例的机制，就可以直接扩展到更常见的情况，即模型不能同时适用于所有gpu。

我们将定义以下术语：

• shard : a single chunk of the split entities that is attached to a specific GPU (contains a small portion of every entity across the entire model)
• Activation ( ACT ): the activation calculated in each GPU separately during forward pass
• unit : a part of the model with its assigned layers created by vertical model partition
• MEM_total : the memory size of all the parameters to be stored, i.e., MP + GRD + OS

Setup

以下最初步骤为FSDP奠定了基础：

• Split dataset ：将数据集拆分为三个子集，并将每个子集分配给特定的GPU进行独立处理。

picture.image

在这里插入图片描述

• Assign units : 在训练过程中，为每个unit分配特定的层，unit负责管理他们。
• Shard the model : 将每个实体 (MP,OS,GRD) 划分为3个分片并分配给GPU，以便每个GPU只需要在其内存中保留 MEM_total/3 。

💡如上所述，模型实体的分片是“horizontal” ，这意味着每个分片包括来自每一层的模型参数，如下图所示：

picture.image

在这里插入图片描述

Forward Pass

值得注意的是，在PyTorch的FSDP中使用“FULL_SHARD”策略，gradients和optimizer state都显示为它们在第一个反向和优化器steps之前的状态。在这个阶段，这些实体还没有计算出来；它们是placeholders。这意味着它们将只包含在反向传递和优化器步骤中更新后的实际计算信息。

前向包括以下步骤：

• Broadcast model parameters : All GPUs will gather the model parameters of the first unit (MP 1) so they can run the first forward step.

💡在图中，不透明 的颜色表示分片由GPU“拥有 ”，并将在整个训练过程中持续存在。相反，透明填充表示未连接到GPU的分片实体，它们在重新分片期间使用后将被丢弃。

picture.image

在这里插入图片描述

• Forward pass unit 1: Each GPU will run forward pass on unit 1 on its respective batch using the complete MP 1 that each GPU gathered from all the other GPUs. Since each GPU has different input batch the ACT that each one will calculate will be different even though all of them currently hold the same model parameters, MP 1. In some FSDP configurations, the forward pass can be preformed in parallel by loading the next MP (in this case MP 2), which further accelerates training. However, this also increases GPU memory usage since the GPU must hold the MP from two different units at the same time.

picture.image

在这里插入图片描述

• Save activations : After we calculate the ACT they will be retained in each GPU for later use in gradient computation during the backward pass.
• Reshard MP 1 : Delete only the broadcasted (low opacity) MP 1 from each GPU in order to free up GPU memory—note that each GPU still holds on to the shard that was assigned to it.

picture.image

在这里插入图片描述

• Repeat for all the other units : Repeat the process for subsequent units 2 and 3, broadcast, run the forward pass, and reshard the MP while holding on to the ACT until unit 3 forward pass is done. Doing so will give us the ACT for the entire model.

picture.image

在这里插入图片描述

• Compute loss : For each GPU, compute the loss of its respective batch using the loss function.

picture.image

在这里插入图片描述

Backward Pass

反向传播包括以下步骤：

• Broadcast model parameters : Gather MP for the current unit—we already have the MP at hand for the backward pass on unit 3, since we just broadcasted them to all GPUs for the forward pass. Therefore this step can be skipped for the unit 3 but is required for the backward pass of unit 2 and 1.
• Propagate backward : Initiate backward propagation and update GRD using the ACT and MP on all GPUs for unit 3. As mentioned at start of the Forward Pass section , we remark that at this point, the gradients have not yet been calculated and are only placeholders that do not contain any actual information. In the next step, they will be individually calculated for each GPU.

picture.image

在这里插入图片描述

• Accumulate gradients : Take the GRD calculated in each GPU for unit 3, sum them to get the accumulated GRD, then distribute the accumulated GRD across the GPUs. Afterwards, we reshard the broadcasted GRD 3 by removing the broadcasted GRD and replacing the existing shard of GRD in each GPU with the accumulated one ( reduce-scatter operation on GRD).

picture.image

在这里插入图片描述

• Reshard MP and ACT : Remove the broadcasted MP and ACT from all GPUs to free up GPU memory.

picture.image

在这里插入图片描述

• Repeat for all the other units : Repeat the previous steps, broadcast, execute backward pass to collect GRD, and discard ACT until the completion of backpropagation on units 2 and 1.

picture.image

在这里插入图片描述

Optimizer Step

• Apply optimizer step : Run the optimizer step, update all MP and optimizer states. This constitutes a complete training step for the entire model on a single batch, achieving our goal of updating the model parameters while operating GPUs in parallel.

picture.image

在这里插入图片描述

• Next batch : This brings us back to the initial state but with updated MP GRD, and OS. Now, we can repeat all the steps for forward and backward propagation, as well as the optimization step, using the next batch as input until the training is complete.

这是对FSDP训练过程的描述。总的来说，我们已经看到这个过程包括多个GPU之间有些复杂的交互操作，但导致GPU空闲时间最少。通过这种方式，FSDP充分利用了可用的计算资源，并允许在服务器端训练环境中以有效的方式训练大型模型。

参考文献

• https://blog.clika.io/fsdp-1/