DeepSeek 开源周(二):DeepSeek MoE 架构回顾和 DeepEP 性能实测

大模型向量数据库机器学习

点击下方

卡片

,关注“

慢慢学AIGC

picture.image

DeepEP 简介

开源周的第二天,DeepSeek 放出了用于加速 MoE 模型训练和推理的 EP 通信库——DeepEP,具备高效优化的 all-to-all 通信性能,支持节点内(基于 NVLink)和节点间(基于 RDMA)通信,具备高吞吐(适合训练和推理的 Prefill 阶段)和低延迟(适合推理 Decode 阶段)的 kernel 实现,支持原生 FP8 分发,具备灵活的 GPU 资源控制,便于实现计算-通信重叠。

Github:https://github.com/deepseek-ai/DeepEP

DeepSeek MoE 架构回顾

DeepSeek 最早在 2024 年 1 月份发布并开源面向 MoE 语言模型的 DeepSeekMoE 架构,通过细粒度的专家分割和共享专家隔离,DeepSeekMoE 相比主流 MoE 架构实现了显著更高的专家专业化程度和性能。DeepSeekMoE 的图解如下(图片来自论文:https://arxiv.org/pdf/2401.06066):

picture.image

图(a)展示了采用传统 top-2 路由策略的 MoE 层。

图(b)说明了细粒度专家分割策略。

图(c)展示了共享专家隔离策略的整合,构成了完整的 DeepSeekMoE 架构。

注:三种架构的专家参数数量和计算成本保持不变。

DeepSeek 在 2T token 上训练了 DeepSeekMoE 16B,激活参数量 2.8B,仅使用了 DeepSeek 7B 和 LLaMA 2 7B 约 40% 的计算量,但评测性能相当。

picture.image

16B 参数量的 MoE 模型已开源。

HuggingFace:

https://huggingface.co/deepseek-ai/deepseek-moe-16b-base

https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat

DeepSeek 在这篇论文中还探索将 DeepSeekMoE 扩展到 145B 参数量(激活参数量为 22.2B),展示了与 DeepSeek 67B 相当的性能,仅使用 28.5%(甚至可能只有18.2%)的计算量。

picture.image

DeepSeek MoE 发布并未引起太大反响。真正引起行业震动的发布是 2024 年 5 月发布的 DeepSeek V2,以经济的训练和高效推理为特征的混合专家(MoE)语言模型。它总共

包含 236B 参数

,其中

每个 token 激活 21B 参数

,并

支持 128K tokens 的上下文长度

DeepSeek-V2 采用了

创新的架构

,包括

多头潜在注意力(MLA)和 DeepSeekMoE

。DeepSeek V2 引发了 LLM API 价格大战,其盛况笔者也曾有记录,详见《写在云厂商 LLM API 价格调整后》《盘点国内外大模型推理服务 API 价格》。

与 DeepSeek 67B 相比,DeepSeek-V2 实现了显著更强的性能,同时

节省了 42.5% 的训练成本

,减少了 93.3% 的 KV 缓存,并将

最大生成吞吐量提升了 5.76 倍

picture.image

DeepSeek V2 模型 层数为 60隐藏维度设置为 5120将第一层之外的所有 FFN 替换为 MoE 层每个 MoE 层由 2 个共享专家和 160 个路由专家组成,其中每个专家的中间隐藏维度为 1536。在路由专家中,每个 token 将激活 6 个专家

picture.image

由于 DeepSeek-V2 对每个 token 激活的参数更少,所需的浮点运算也比 DeepSeek 67B 少,从理论上讲,训练 DeepSeek-V2 比训练 DeepSeek 67B 更经济。尽管训练 MoE 模型会引入额外的通信开销,但通过算子和通信优化,DeepSeek-V2 的训练可以达到相对较高的模型浮点运算利用率(MFU)。在 DeepSeek 使用 H800 集群的实际训练中,对于每万亿 token 的训练,DeepSeek 67B 需要 300.6K GPU 小时,而 DeepSeek-V2 仅需要 172.8K GPU 小时,即稀疏的 DeepSeek-V2 与密集的 DeepSeek 67B 相比可以节省 42.5% 的训练成本。

为了高效部署 DeepSeek-V2 服务,首先将其参数转换为 FP8 精度。此外,对 DeepSeek-V2 执行 KV 缓存量化,将其 KV 缓存中的每个元素进一步压缩至平均 6 位。得益于 MLA 和这些优化,实际部署的 DeepSeek-V2 比 DeepSeek 67B 需要的 KV 缓存显著更少,因此可以服务更大的批量大小。基于实际部署的 DeepSeek 67B 服务的提示和生成长度分布评估了 DeepSeek-V2 的生成吞吐量。在单机 8 卡 H800 GPU 节点上,DeepSeek-V2 的生成吞吐量超过每秒 50K 个 token,是 DeepSeek 67B 最大生成吞吐量的 5.76 倍。此外,DeepSeek-V2 的提示 prefill 吞吐量超过每秒 100K 个 token。

picture.image

DeepSeek V2 真正做到了凭实力降本(输入每百万 tokens 1 元,输出每百万 tokens 2 元),而不是像其他友商一样靠补贴打价格战。

2024 年 9 月 5 日,DeepSeek 正式发布 DeepSeek-V2.5,不仅保留了原有 Chat 模型的通用对话能力和 Coder 模型的强大代码处理能力,还更好地对齐了人类偏好。此外,DeepSeek-V2.5 在写作任务、指令跟随等多个方面也实现了大幅提升。DeepSeek-V2.5 和 V2 架构完全相同,API 接口向前兼容。V2 到 V2.5 的血缘关系如下图所示:

picture.image

DeepSeek V2 论文:https://arxiv.org/pdf/2405.04434

HuggingFace 权重(包括 V2、V2-Lite 和 V2.5):

https://huggingface.co/collections/deepseek-ai/deepseek-v2-669a1c8b8f2dbc203fbd7746

2024 年 12 月 26 日,深度求索上线并开源

全新系列模型 DeepSeek-V3,凭借 671B

总参数量,

37B 激活参数量 ,多项评测成绩超越了 Qwen2.5-72B 和 Llama-3.1-405B 等其他开源模型,并在性能上和世界顶尖的闭源模型 GPT-4o 以及 Claude-3.5-Sonnet 不分伯仲。

picture.image

DeepSeek V3 API 定价为每百万输入 tokens 0.5 元(缓存命中)/ 2 元(缓存未命中),每百万输出 tokens 8 元。

picture.image

和 DeepSeek V2.5 相比,V3 基本的 MLA/MoE 架构不变,但超参数有如下变化:

  • first_k_dense_replace 从 1 增大到 3;
  • hidden_size 从 5120 增大到 7168;
  • intermediate_size 从 12288 增大到 18432;
  • moe_intermediate_size 从 1536 增大到 2048;
  • n_routed_experts 从 160 增大到 256;
  • n_shared_experts 从 2 减小到 1;
  • norm_topk_prob 从 false 改为 true;
  • num_hidden_layers 从 60 改为 61;
  • 设置 num_nextn_predict_layers 为 1;
  • 新增 quantization_config,支持 FP8 量化;
  • num_experts_per_tok 从 6 增大到 8;
  • routed_scaling_factor 从 16.0 改为 2.5;
  • scoring_func 从 softmax 改为 sigmoid;
  • topk_group 从 3 增大到 4(和

num_experts_per_tok 相关)

  • topk_method 从 group_limited_greedy 改为 noaux_tc;

  • vocab_size 从 102400 增大到 129280;

DeepSeek V3 和后来的 R1 架构及超参数完全相同。不再赘述。

通过对 DeepSeek 主要模型的简单回顾,我们发现 DeepSeek 一直在模型效率和效果上追求极致的性能,使用创新的架构和高效的工程不断突破上限。

DeepEP 性能实测

从前一节我们了解到,MoE 架构具有稀疏性,不是每个专家都参与计算,而是需要由路由决定激活哪些专家。因此在工程实现时,通信主要涉及“Dispatch(分发)”和“Combine(聚合)”这两个步骤,由于采用细粒度专家,每个专家计算量相对较小,重点需要优化通信部分。DeepEP 即为优化 MoE 通信部分的加速库。

和前文《DeepSeek 开源周(一):FlashMLA 在 H100 上的性能实测》一样,本文实测需要在 H100 或 H800 机器上运行,而且需要使用 NVLink 版本,不能用 PCIe 版本。

笔者环境如下(2 台相同规格的 H100 服务器):

GPU
NVIDIA H100 80GB HBM3 *8

驱动版本:535.161.08

CUDA 版本:12.6 | | NIC | Mellanox ConnectX-7(400Gb) *8(IB)

驱动版本:MLNX_OFED_LINUX-23.10-0.5.5.0 | | OS | Ubuntu-Server 22.04.3 LTS amd64

内核版本:5.15.0-91-generic | | Python | 3.12.7 | | PyTorch | 2.6.0 |

在开始之前,请务必检查 GPU 驱动是否正常安装,nvidia-fabricmanager 服务是否正常启动,最好跑一些 cuda samples 验证。

还有很重要的一点,后续步骤要求有物理机 root 权限!如果你只能访问容器,是无法完成后续内核模块更新的。中间物理机需要重启一次,如果上面在跑重要的训练任务,建议提前保存并迁移。你如果不确定自己拿到的到底是物理机还是容器,有个简单的判断方法是看你能否在这个环境里再启动容器,如果可以启动说明是物理机,否则是容器。

安装必要的依赖


        
            

          
 apt install 
 build-essential
  git cmake ninja
 -build
  devscripts
 
        
      

编译安装 GDRCopy

GDRCopy 是由 Nvidia 开发的低延迟 GPU 显存拷贝库,基于 GPU Direct RDMA 技术。

以下步骤需要 root 权限,且需要在物理机上操作:


          
git clone https://github.com/NVIDIA/gdrcopy
          
cd gdrcopy
          
make -j8
          
make prefix=/opt/gdrcopy install
          
cd packages
          
CUDA=/usr/local/cuda ./build-deb-packages.sh
          
apt install ./*.deb
          
cd /var/lib/dkms/gdrdrv/
          
ln -sf 2.5 2.5-1
          
cd -
          
apt install ./*.deb
          
gdrcopy_copybw
      

输出如下:


          
# gdrcopy_copybw
          
GPU id:0; name: NVIDIA H100 80GB HBM3; Bus id: 0000:18:00
          
GPU id:1; name: NVIDIA H100 80GB HBM3; Bus id: 0000:2a:00
          
GPU id:2; name: NVIDIA H100 80GB HBM3; Bus id: 0000:3a:00
          
GPU id:3; name: NVIDIA H100 80GB HBM3; Bus id: 0000:5d:00
          
GPU id:4; name: NVIDIA H100 80GB HBM3; Bus id: 0000:9a:00
          
GPU id:5; name: NVIDIA H100 80GB HBM3; Bus id: 0000:ab:00
          
GPU id:6; name: NVIDIA H100 80GB HBM3; Bus id: 0000:ba:00
          
GPU id:7; name: NVIDIA H100 80GB HBM3; Bus id: 0000:db:00
          
selecting device 0
          
testing size: 131072
          
rounded size: 131072
          
gpu alloc fn: cuMemAlloc
          
device ptr: 7fc901e00000
          
map_d_ptr: 0x7fc92a85c000
          
info.va: 7fc901e00000
          
info.mapped_size: 131072
          
info.page_size: 65536
          
info.mapped: 1
          
info.wc_mapping: 1
          
page offset: 0
          
user-space pointer:0x7fc92a85c000
          
writing test, size=131072 offset=0 num_iters=10000
          
write BW: 14752.2MB/s
          
reading test, size=131072 offset=0 num_iters=100
          
read BW: 410.663MB/s
          
unmapping buffer
          
unpinning buffer
          
closing gdrdrv
      

编译安装 NVSHMEM

NVIDIA NVSHMEM 是一种编程接口,它在 NVIDIA GPU 集群中实现了分区全局地址空间(PGAS)模型。NVSHMEM 提供了一个易于使用的接口,用于分配对称地分布在各个 GPU 上的内存。除了 CPU 端接口外,NVSHMEM 还提供了 CUDA 内核端接口,允许 NVIDIA CUDA 线程访问对称分布内存中的任何位置。

NVSHMEM 源码需要从英伟达官方注册下载,链接为:

https://developer.nvidia.com/nvshmem-archive

你需要使用邮箱注册并登录,才能获得下载链接。

picture.image

这里需要下载 3.1.7 版本。下载后解压并应用 DeepEP 中的 patch:


          
git clone https://github.com/deepseek-ai/DeepEP
          
tar xvf nvshmem_src_3.1.7-1.txz
          
cd nvshmem_src/
          
git apply ../DeepEP/third-party/nvshmem.patch
      

以下步骤需要物理机重启,高危操作:


          
# 开启 IBGDA
          
echo "options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords=\"PeerMappingOverride=1;\"" > /etc/modprobe.d/nvidia.conf
          
update-initramfs -u
          
reboot
      

重启后,编译 NVSHMEM


          
cd nvshmem_src/
          
MPI_HOME=/usr/mpi/gcc/openmpi-4.1.7a1
          
CUDA_HOME=/usr/local/cuda && \
          
GDRCOPY_HOME=/opt/gdrcopy && \
          
NVSHMEM_SHMEM_SUPPORT=0 \
          
NVSHMEM_UCX_SUPPORT=0 \
          
NVSHMEM_USE_NCCL=0 \
          
NVSHMEM_IBGDA_SUPPORT=1 \
          
NVSHMEM_PMIX_SUPPORT=0 \
          
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
          
NVSHMEM_USE_GDRCOPY=1 \
          
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/opt/nvshmem
          
cd build/
          
make -j8
          
make install
      

设置环境变量:


          
export NVSHMEM_DIR=/opt/nvshmem
          
export LD_LIBRARY_PATH="${NVSHMEM_DIR}/lib:$LD_LIBRARY_PATH"
          
export PATH="${NVSHMEM_DIR}/bin:$PATH"
      

验证这一步成功:


          
# nvshmem-info -a
          
NVSHMEM v3.1.7
          

          
Build Information:
          
  CUDA API                     12060
          
  CUDA Driver                  12020
          
  Build Timestamp              Feb 25 2025 14:32:50
          
  Build Variables
          
	NVSHMEM_DEBUG=OFF NVSHMEM_DEVEL=OFF NVSHMEM_DEFAULT_PMI2=OFF
          
	NVSHMEM_DEFAULT_PMIX=OFF NVSHMEM_DEFAULT_UCX=OFF NVSHMEM_DISABLE_COLL_POLL=ON
          
	NVSHMEM_ENABLE_ALL_DEVICE_INLINING=OFF NVSHMEM_GPU_COLL_USE_LDST=OFF
          
	NVSHMEM_IBGDA_SUPPORT=ON NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=OFF
          
	NVSHMEM_IBDEVX_SUPPORT=OFF NVSHMEM_IBRC_SUPPORT=ON
          
	NVSHMEM_MPI_SUPPORT=ON NVSHMEM_NVTX=ON NVSHMEM_PMIX_SUPPORT=OFF
          
	NVSHMEM_SHMEM_SUPPORT=OFF NVSHMEM_TEST_STATIC_LIB=OFF
          
	NVSHMEM_TIMEOUT_DEVICE_POLLING=OFF NVSHMEM_TRACE=OFF NVSHMEM_UCX_SUPPORT=OFF
          
	NVSHMEM_USE_DLMALLOC=OFF NVSHMEM_USE_NCCL=OFF NVSHMEM_USE_GDRCOPY=ON
          
	NVSHMEM_VERBOSE=OFF CUDA_HOME=/usr/local/cuda GDRCOPY_HOME=/usr/local/gdrdrv
          
	LIBFABRIC_HOME=/usr/local/libfabric MPI_HOME=/usr/local/ompi
          
	NCCL_HOME=/usr/local/nccl NVSHMEM_PREFIX=/usr/local/nvshmem PMIX_HOME=/usr
          
	SHMEM_HOME=/usr/local/ompi UCX_HOME=/usr/local/ucx
          

          
Standard options:
          
  NVSHMEM_VERSION              false (type: bool, default: false)
          
	Print library version at startup
          
  NVSHMEM_INFO                 false (type: bool, default: false)
          
	Print environment variable options at startup
          
  NVSHMEM_DISABLE_NVLS         false (type: bool, default: false)
          
	Disable NVLS SHARP resources for collectives, even if available for platform
          
  NVSHMEM_SYMMETRIC_SIZE       1073741824 (type: size, default: 1073741824)
          
	Specifies the size (in bytes) of the symmetric heap memory per PE. The
          
	size is implementation-defined and must be at least as large as the integer
          
	ceiling of the product of the numeric prefix and the scaling factor. The
          
	character suffixes for the scaling factor are as follows:
          

          
	  *  k or K multiplies by 2^10 (kibibytes)
          
	  *  m or M multiplies by 2^20 (mebibytes)
          
	  *  g or G multiplies by 2^30 (gibibytes)
          
	  *  t or T multiplies by 2^40 (tebibytes)
          

          
	For example, string '20m' is equivalent to the integer value 20971520, or 20
          
	mebibytes. Similarly the string '3.1M' is equivalent to the integer value
          
	3250586. Only one multiplier is recognized and any characters following the
          
	multiplier are ignored, so '20kk' will not produce the same result as '20m'.
          
	Usage of string '.5m' will yield the same result as the string '0.5m'.
          
	An invalid value for NVSHMEM_SYMMETRIC_SIZE is an error, which the NVSHMEM
          
	library shall report by either returning a nonzero value from
          
	nvshmem_init_thread or causing program termination.
          
  NVSHMEM_DEBUG                "" (type: string, default: "")
          
	Set to enable debugging messages.
          
	Optional values: VERSION, WARN, INFO, ABORT, TRACE
          

          
Bootstrap options:
          
  NVSHMEM_BOOTSTRAP            "PMI" (type: string, default: "PMI")
          
	Name of the default bootstrap that should be used to initialize NVSHMEM.
          
	Allowed values: PMI, MPI, SHMEM, plugin, UID
          
  NVSHMEM_BOOTSTRAP_PMI        "PMI" (type: string, default: "PMI")
          
	Name of the PMI bootstrap that should be used to initialize NVSHMEM.
          
	Allowed values: PMI, PMI-2, PMIX
          
  NVSHMEM_BOOTSTRAP_PLUGIN     "" (type: string, default: "")
          
	Absolute path to or name of the bootstrap plugin file to load when
          
	NVSHMEM_BOOTSTRAP=plugin is specified
          
  NVSHMEM_BOOTSTRAP_MPI_PLUGIN "nvshmem_bootstrap_mpi.so" (type: string, default: "nvshmem_bootstrap_mpi.so")
          
	Absolute path to or name of the MPI bootstrap plugin file.
          
	NVSHMEM will search for the plugin based on linux linker priorities. See man
          
	dlopen
          
  NVSHMEM_BOOTSTRAP_SHMEM_PLUGIN "nvshmem_bootstrap_shmem.so" (type: string, default: "nvshmem_bootstrap_shmem.so")
          
	Absolute path to or name of the SHMEM bootstrap plugin file.
          
	NVSHMEM will search for the plugin based on linux linker priorities. See man
          
	dlopen
          
  NVSHMEM_BOOTSTRAP_PMI_PLUGIN "nvshmem_bootstrap_pmi.so" (type: string, default: "nvshmem_bootstrap_pmi.so")
          
	Absolute path to or name of the PMI bootstrap plugin file.
          
	NVSHMEM will search for the plugin based on linux linker priorities. See man
          
	dlopen
          
  NVSHMEM_BOOTSTRAP_PMI2_PLUGIN "nvshmem_bootstrap_pmi2.so" (type: string, default: "nvshmem_bootstrap_pmi2.so")
          
	Absolute path to or name of the PMI-2 bootstrap plugin file.
          
	NVSHMEM will search for the plugin based on linux linker priorities. See man
          
	dlopen
          
  NVSHMEM_BOOTSTRAP_PMIX_PLUGIN "nvshmem_bootstrap_pmix.so" (type: string, default: "nvshmem_bootstrap_pmix.so")
          
	Absolute path to or name of the PMIx bootstrap plugin file.
          
	NVSHMEM will search for the plugin based on linux linker priorities. See man
          
	dlopen
          
  NVSHMEM_BOOTSTRAP_UID_PLUGIN "nvshmem_bootstrap_uid.so" (type: string, default: "nvshmem_bootstrap_uid.so")
          
	Absolute path to or name of the UID bootstrap plugin file.
          
	NVSHMEM will search for the plugin based on linux linker priorities. See man
          
	dlopen
          

          
Additional options:
          
  NVSHMEM_CUDA_PATH            "" (type: string, default: "")
          
	Path to directory containing libcuda.so (for use when not in default location)
          
  NVSHMEM_DEBUG_ATTACH_DELAY   0 (type: int, default: 0)
          
	Delay (in seconds) during the first call to NVSHMEM_INIT to allow for attaching
          
	a debuggger (Default 0)
          
  NVSHMEM_DEBUG_FILE           "" (type: string, default: "")
          
	Debugging output filename, may contain %h for hostname and %p for pid
          
  NVSHMEM_MAX_TEAMS            32 (type: long, default: 32)
          
	Maximum number of simultaneous teams allowed
          
  NVSHMEM_MAX_P2P_GPUS         128 (type: int, default: 128)
          
	Maximum number of P2P GPUs
          
  NVSHMEM_MAX_MEMORY_PER_GPU   137438953472 (type: size, default: 137438953472)
          
	Maximum memory per GPU
          
  NVSHMEM_DISABLE_CUDA_VMM     false (type: bool, default: false)
          
	Disable use of CUDA VMM for P2P memory mapping. By default, CUDA VMM is enabled
          
	on x86 and disabled on P9. CUDA VMM feature in NVSHMEM requires CUDA RT version
          
	and CUDA Driver version to be greater than or equal to 11.3.
          
  NVSHMEM_DISABLE_P2P          false (type: bool, default: false)
          
	Disable P2P connectivity of GPUs even when available
          
  NVSHMEM_IGNORE_CUDA_MPS_ACTIVE_THREAD_PERCENTAGE false (type: bool, default: false)
          
	When doing Multi-Process Per GPU (MPG) run, full API support is available only
          
	if sum of CUDA_MPS_ACTIVE_THREAD_PERCENTAGE of processes running on a GPU is <=
          
	100%. Through this variable, user can request NVSHMEM runtime to ignore the
          
	active thread percentage and allow full MPG support. Users enable it at their
          
	own risk as NVSHMEM might deadlock.
          
  NVSHMEM_CUMEM_GRANULARITY    536870912 (type: size, default: 536870912)
          
	Granularity for cuMemAlloc/cuMemCreate
          
  NVSHMEM_PROXY_REQUEST_BATCH_MAX 32 (type: int, default: 32)
          
	Maxmum number of requests that the proxy thread processes in a single iteration
          
	of the progress loop.
          

          
Collectives options:
          
  NVSHMEM_DISABLE_NCCL         false (type: bool, default: false)
          
	Disable use of NCCL for collective operations
          
  NVSHMEM_BARRIER_DISSEM_KVAL  2 (type: int, default: 2)
          
	Radix of the dissemination algorithm used for barriers
          
  NVSHMEM_BARRIER_TG_DISSEM_KVAL 2 (type: int, default: 2)
          
	Radix of the dissemination algorithm used for thread group barriers
          
  NVSHMEM_FCOLLECT_LL_THRESHOLD 2048 (type: size, default: 2048)
          
	Message size threshold up to which fcollect LL algo will be used
          

          
  NVSHMEM_REDUCE_SCRATCH_SIZE  524288 (type: size, default: 524288)
          
	Amount of symmetric heap memory (minimum 16B, multiple of 8B) reserved by
          
	runtime for every team to implement reduce and reducescatter collectives
          

          
  NVSHMEM_BCAST_ALGO           0 (type: int, default: 0)
          
	Broadcast algorithm to be used.
          
	  * 0 - use default algorithm selection strategy
          

          
  NVSHMEM_REDMAXLOC_ALGO       1 (type: int, default: 1)
          
	Reduction algorithm to be used for MAXLOC operation.
          
	  * 1 - default, flag alltoall algorithm
          
	  * 2 - flat reduce + flat bcast
          
	  * 3 - topo-aware two-level reduce + topo-aware bcast
          

          

          
Transport options:
          
  NVSHMEM_REMOTE_TRANSPORT     "ibrc" (type: string, default: "ibrc")
          
	Selected transport for remote operations: ibrc, ucx, libfabric, ibdevx, none
          
  NVSHMEM_ENABLE_NIC_PE_MAPPING false (type: bool, default: false)
          
	When not set or set to 0, a PE is assigned the NIC on the node that is closest
          
	to it by distance. When set to 1, NVSHMEM either assigns NICs to PEs on a
          
	round-robin basis or uses NVSHMEM_HCA_PE_MAPPING or NVSHMEM_HCA_LIST when they
          
	are specified.
          
  NVSHMEM_DISABLE_LOCAL_ONLY_PROXY false (type: bool, default: false)
          
	When running on an NVLink-only configuaration (No-IB, No-UCX), completely
          
	disable the proxy thread. This will disable device side global exit and device
          
	side wait timeout polling (enabled by NVSHMEM_TIMEOUT_DEVICE_POLLING build-time
          
	variable) because these are processed by the proxy thread.
          
  NVSHMEM_IB_ENABLE_IBGDA      false (type: bool, default: false)
          
	Set to enable GPU-initiated communication transport.
          

          
Hidden options:
          
  NVSHMEM_INFO_HIDDEN          true (type: bool, default: false)
          
	Print hidden environment variable options at startup
          
  NVSHMEM_DISABLE_NVLS_SHARING true (type: bool, default: true)
          
	Disable NVLS SHARP resource sharing for user-defined teams
          
  NVSHMEM_HEAP_KIND            "DEVICE" (type: string, default: "DEVICE")
          
	Specify the memory kind used by the NVSHMEM symmetric heap.
          
	Allowed values: VIDMEM, SYSMEM
          
  NVSHMEM_ENABLE_RAIL_OPT      false (type: bool, default: false)
          
	Enable Rail Optimization when heap is in SYSMEM
          
  NVSHMEM_BOOTSTRAP_TWO_STAGE  false (type: bool, default: false)
          
	Ignore CUDA device setting during initialization,forcing two-stage
          
	initialization
          
  NVSHMEM_DEBUG_SUBSYS         "" (type: string, default: "")
          
	Comma separated list of debugging message sources. Prefix with '^' to exclude.
          
	Values: INIT, COLL, P2P, PROXY, TRANSPORT, MEM, BOOTSTRAP, TOPO, UTIL, ALL
          
  NVSHMEM_ENABLE_ERROR_CHECKS  false (type: bool, default: false)
          
	Enable error checks
          
  NVSHMEM_DISABLE_MNNVL        false (type: bool, default: false)
          
	Disable MNNVL connectivity for GPUs even when available
          
  NVSHMEM_CUMEM_HANDLE_TYPE    "FILE_DESCRIPTOR" (type: string, default: "FILE_DESCRIPTOR")
          
	Handle type for cuMemCreate. Supported are - FABRIC or FILE_DESCRIPTOR
          
  NVSHMEM_BYPASS_ACCESSIBILITY_CHECK false (type: bool, default: false)
          
	Bypass peer GPU accessbility checks
          
  NVSHMEM_FCOLLECT_NTHREADS    512 (type: int, default: 512)
          
	Sets number of threads per block for fcollect collective.
          
	By default, if no env is set, default value is min(max_occupancy per CTA, msg
          
	size per PE).
          
	If env is specified, value overrides the default irrespective of max occupancy
          
	per CTA
          

          
  NVSHMEM_REDUCESCATTER_NTHREADS 512 (type: int, default: 512)
          
	Sets number of threads per block for reducescatter collective.
          
	By default, if no env is set, default value is min(max_occupancy per CTA, msg
          
	size per PE).
          
	If env is specified, value overrides the default irrespective of max occupancy
          
	per CTA
          

          
  NVSHMEM_MAX_CTAS             1 (type: int, default: 1)
          
	Sets number of blocks per grid for host onstream collective.
          
	By default, if no env is set, default value to 1 CTA
          
	If env is specified, value overrides the default value
          

          
  NVSHMEM_REDUCE_RECEXCH_KVAL  2 (type: int, default: 2)
          
	Radix of the recursive exchange reduction algorithm
          
  NVSHMEM_FCOLLECT_LL128_THRESHOLD 0 (type: size, default: 0)
          
	Message size threshold up to which the fcollect LL128 algo will be used.
          
	LL128 will be used only when FCOLLECT_LL_THRESHOLD < size
          
  NVSHMEM_FCOLLECT_NVLS_THRESHOLD 16777216 (type: size, default: 16777216)
          
	Message size threshold up to which fcollect NVLS algo will be used
          

          
  NVSHMEM_REDUCESCATTER_NVLS_THRESHOLD 16777216 (type: size, default: 16777216)
          
	Message size threshold up to which reducescatter NVLS algo will be used
          

          
  NVSHMEM_BCAST_TREE_KVAL      2 (type: int, default: 2)
          
	Radix of the broadcast tree algorithm
          
  NVSHMEM_FCOLLECT_ALGO        0 (type: int, default: 0)
          
	Fcollect algorithm to be used.
          
	  * 0 - use default algorithm selection strategy
          

          
  NVSHMEM_REDUCE_ALGO          0 (type: int, default: 0)
          
	Allreduce algorithm to be used.
          
	   * 0 - use default algorithm selection strategy
          

          
  NVSHMEM_REDUCESCATTER_ALGO   0 (type: int, default: 0)
          
	Reduce Scatter algorithm to be used.
          
	  * 0 - use default algorithm selection strategy
          

          
  NVSHMEM_ASSERT_ATOMICS_SYNC  false (type: bool, default: false)
          
	Bypass flush on wait_until at target
          
  NVSHMEM_BYPASS_FLUSH         false (type: bool, default: false)
          
	Bypass flush in proxy when enforcing consistency
          

          
NVTX options:
          
  NVSHMEM_NVTX                 "off" (type: string, default: "off")
          
	Set to enable NVTX instrumentation. Accepts a comma separated list of
          
	instrumentation groups. By default the NVTX instrumentation is disabled.
          
	  init                : library setup
          
	  alloc               : memory management
          
	  launch              : kernel launch routines
          
	  coll                : collective communications
          
	  wait                : blocking point-to-point synchronization
          
	  wait_on_stream      : point-to-point synchronization (on stream)
          
	  test                : non-blocking point-to-point synchronization
          
	  memorder            : memory ordering (quiet, fence)
          
	  quiet_on_stream     : nvshmemx_quiet_on_stream
          
	  atomic_fetch        : fetching atomic memory operations
          
	  atomic_set          : non-fetchong atomic memory operations
          
	  rma_blocking        : blocking remote memory access operations
          
	  rma_nonblocking     : non-blocking remote memory access operations
          
	  proxy               : activity of the proxy thread
          
	  common              : init,alloc,launch,coll,memorder,wait,atomic_fetch,rma_blocking,proxy
          
	  all                 : all groups
          
	  off                 : disable all NVTX instrumentation
      

编译安装 DeepEP

这一步相对简单,需要准备一个 PyTorch 环境:


          
conda create -n deepep python=3.12
          
conda activate deepep
          
pip install torch torchvision torchaudio
      

之后进入 DeepEP 代码库,执行:


          
python setup.py build
          
ln -s build/lib.linux-x86_64-cpython-312/deep_ep_cpp.cpython-312-x86_64-linux-gnu.so
      

运行节点内测试(一个节点即可)


        
            

          
 python tests/test\_intranode.py
 
        
      

测试结果如下:


          
# export PYTHONPATH=$(pwd)
          
# python tests/test_intranode.py
          
[config] num_tokens=4096, hidden=7168, num_topk=8
          
[layout] Kernel performance: 0.050 ms
          

          
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
          
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
          
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
          
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
          
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
          
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
          
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
          
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
          
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
          
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
          
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
          
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
          
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
          
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
          
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
          
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
          
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
          
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
          
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
          
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
          
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
          
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
          
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
          
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed
          

          
[tuning] SMs 24, NVL chunk 4: 276.67 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8: 288.95 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12: 268.19 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16: 262.28 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20: 254.48 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24: 251.96 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28: 248.25 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32: 246.71 GB/s (NVL)
          
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 8, 288.95 GB/s (NVL)
          

          
[tuning] SMs 24, NVL chunk 4: 295.29 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8: 267.37 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12: 249.41 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16: 245.60 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20: 240.31 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24: 237.01 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28: 232.88 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32: 229.54 GB/s (NVL)
          
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 4, 295.29 GB/s (NVL)
          

          
[tuning] SMs 24, NVL chunk 1: 159.08 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 2: 285.78 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 3: 322.23 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4: 331.61 GB/s (NVL)
          
[tuning] Best combine: SMs 24, NVL chunk 4: 331.61 GB/s (NVL)
          

          

          
[rank0]:[W225 14:50:19.764550397 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
      

运行节点间测试(需要至少两个节点)


          
# node 0
          
export MASTER_ADDR=master_ip
          
export WORLD_SIZE=2
          
export RANK=0
          
python tests/test_internode.py
          
# node 1
          
export MASTER_ADDR=master_ip
          
export WORLD_SIZE=2
          
export RANK=1
          
python tests/test_internode.py
      

实测结果:


          
# python tests/test_internode.py
          
[config] num_tokens=4096, hidden=7168, num_topk_groups=2, num_topk=8
          
[layout] Kernel performance: 0.050 ms
          

          
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
          
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
          
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
          
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
          
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
          
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
          
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
          
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
          
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
          
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
          
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
          
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
          
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
          
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
          
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
          
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
          
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
          
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
          
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
          
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
          
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
          
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
          
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
          
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed
          

          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 10.97 GB/s (RDMA), 35.81 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 17.77 GB/s (RDMA), 58.01 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 22.53 GB/s (RDMA), 73.54 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 25.73 GB/s (RDMA), 83.97 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 26.87 GB/s (RDMA), 87.71 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 32.59 GB/s (RDMA), 106.39 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 36.38 GB/s (RDMA), 118.74 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 38.04 GB/s (RDMA), 124.16 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 11.20 GB/s (RDMA), 36.54 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 18.50 GB/s (RDMA), 60.37 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 23.93 GB/s (RDMA), 78.12 GB/s (NVL)
          
^@[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 28.92 GB/s (RDMA), 94.41 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 32.54 GB/s (RDMA), 106.21 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 35.75 GB/s (RDMA), 116.68 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 37.07 GB/s (RDMA), 120.98 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 39.43 GB/s (RDMA), 128.70 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 10.57 GB/s (RDMA), 34.51 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 17.82 GB/s (RDMA), 58.17 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 22.11 GB/s (RDMA), 72.17 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 26.70 GB/s (RDMA), 87.16 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 32.15 GB/s (RDMA), 104.95 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 36.08 GB/s (RDMA), 117.78 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 37.26 GB/s (RDMA), 121.60 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 39.20 GB/s (RDMA), 127.95 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 11.29 GB/s (RDMA), 36.84 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 18.36 GB/s (RDMA), 59.93 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 23.36 GB/s (RDMA), 76.25 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 29.11 GB/s (RDMA), 95.03 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 33.19 GB/s (RDMA), 108.35 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 36.45 GB/s (RDMA), 118.98 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 38.80 GB/s (RDMA), 126.65 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 40.93 GB/s (RDMA), 133.61 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 11.23 GB/s (RDMA), 36.66 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 17.81 GB/s (RDMA), 58.13 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 24.11 GB/s (RDMA), 78.70 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 28.71 GB/s (RDMA), 93.71 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 33.50 GB/s (RDMA), 109.33 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 35.70 GB/s (RDMA), 116.53 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 38.64 GB/s (RDMA), 126.14 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 40.59 GB/s (RDMA), 132.48 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 11.20 GB/s (RDMA), 36.56 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 18.50 GB/s (RDMA), 60.39 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 24.05 GB/s (RDMA), 78.50 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 29.41 GB/s (RDMA), 96.00 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 32.84 GB/s (RDMA), 107.18 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 35.91 GB/s (RDMA), 117.20 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 38.63 GB/s (RDMA), 126.09 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 39.95 GB/s (RDMA), 130.40 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 11.01 GB/s (RDMA), 35.94 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 18.05 GB/s (RDMA), 58.93 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 23.78 GB/s (RDMA), 77.63 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 28.78 GB/s (RDMA), 93.94 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 33.62 GB/s (RDMA), 109.73 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 36.42 GB/s (RDMA), 118.87 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 38.38 GB/s (RDMA), 125.26 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 40.04 GB/s (RDMA), 130.70 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 11.28 GB/s (RDMA), 36.83 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 18.35 GB/s (RDMA), 59.90 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 23.94 GB/s (RDMA), 78.15 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 29.36 GB/s (RDMA), 95.83 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 33.50 GB/s (RDMA), 109.33 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 35.51 GB/s (RDMA), 115.91 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 38.34 GB/s (RDMA), 125.15 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 39.71 GB/s (RDMA), 129.61 GB/s (NVL)
          
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 16, RDMA chunk 32: 40.93 GB/s (RDMA), 133.61 GB/s (NVL)
          

          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 17.94 GB/s (RDMA), 58.57 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 24.68 GB/s (RDMA), 80.56 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 33.81 GB/s (RDMA), 110.36 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 39.38 GB/s (RDMA), 128.52 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 44.11 GB/s (RDMA), 143.96 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 48.23 GB/s (RDMA), 157.41 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 51.26 GB/s (RDMA), 167.31 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 51.84 GB/s (RDMA), 169.20 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 17.93 GB/s (RDMA), 58.51 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 25.27 GB/s (RDMA), 82.49 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 34.88 GB/s (RDMA), 113.85 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 41.21 GB/s (RDMA), 134.53 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 45.80 GB/s (RDMA), 149.49 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 50.08 GB/s (RDMA), 163.46 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 52.69 GB/s (RDMA), 171.99 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 53.79 GB/s (RDMA), 175.58 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 17.96 GB/s (RDMA), 58.63 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 25.22 GB/s (RDMA), 82.32 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 34.93 GB/s (RDMA), 114.03 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 41.28 GB/s (RDMA), 134.75 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 45.32 GB/s (RDMA), 147.94 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 50.14 GB/s (RDMA), 163.66 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 53.04 GB/s (RDMA), 173.13 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 53.74 GB/s (RDMA), 175.42 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 17.88 GB/s (RDMA), 58.36 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 25.23 GB/s (RDMA), 82.36 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 34.79 GB/s (RDMA), 113.54 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 41.19 GB/s (RDMA), 134.43 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 46.38 GB/s (RDMA), 151.38 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 50.38 GB/s (RDMA), 164.45 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 52.95 GB/s (RDMA), 172.83 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 53.61 GB/s (RDMA), 174.99 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 17.47 GB/s (RDMA), 57.02 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 24.37 GB/s (RDMA), 79.53 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 33.86 GB/s (RDMA), 110.52 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 40.59 GB/s (RDMA), 132.49 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 45.57 GB/s (RDMA), 148.75 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 48.75 GB/s (RDMA), 159.12 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 50.47 GB/s (RDMA), 164.72 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 53.81 GB/s (RDMA), 175.64 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 18.07 GB/s (RDMA), 58.97 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 25.25 GB/s (RDMA), 82.43 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 34.92 GB/s (RDMA), 113.99 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 40.75 GB/s (RDMA), 133.00 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 46.85 GB/s (RDMA), 152.91 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 49.97 GB/s (RDMA), 163.09 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 52.37 GB/s (RDMA), 170.92 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 54.19 GB/s (RDMA), 176.89 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 17.92 GB/s (RDMA), 58.49 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 25.43 GB/s (RDMA), 83.00 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 34.62 GB/s (RDMA), 112.99 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 41.16 GB/s (RDMA), 134.35 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 46.43 GB/s (RDMA), 151.56 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 49.62 GB/s (RDMA), 161.96 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 51.97 GB/s (RDMA), 169.65 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 53.93 GB/s (RDMA), 176.02 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 18.03 GB/s (RDMA), 58.85 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 25.24 GB/s (RDMA), 82.39 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 35.02 GB/s (RDMA), 114.31 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 41.45 GB/s (RDMA), 135.30 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 45.33 GB/s (RDMA), 147.95 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 49.53 GB/s (RDMA), 161.68 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 50.26 GB/s (RDMA), 164.06 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 54.14 GB/s (RDMA), 176.72 GB/s (NVL)
          
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 24, RDMA chunk 32: 54.19 GB/s (RDMA), 176.89 GB/s (NVL)
          

          
[tuning] SMs 24, NVL chunk 1, RDMA chunk 8: 28.80 GB/s (RDMA), 94.01 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 1, RDMA chunk 12: 35.33 GB/s (RDMA), 115.31 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 1, RDMA chunk 16: 39.71 GB/s (RDMA), 129.60 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 1, RDMA chunk 20: 42.86 GB/s (RDMA), 139.88 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 1, RDMA chunk 24: 45.25 GB/s (RDMA), 147.71 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 1, RDMA chunk 28: 46.71 GB/s (RDMA), 152.46 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 1, RDMA chunk 32: 48.84 GB/s (RDMA), 159.43 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 2, RDMA chunk 8: 28.52 GB/s (RDMA), 93.09 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 2, RDMA chunk 12: 36.09 GB/s (RDMA), 117.79 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 2, RDMA chunk 16: 39.77 GB/s (RDMA), 129.82 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 2, RDMA chunk 20: 43.46 GB/s (RDMA), 141.85 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 2, RDMA chunk 24: 45.53 GB/s (RDMA), 148.61 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 2, RDMA chunk 28: 47.19 GB/s (RDMA), 154.02 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 2, RDMA chunk 32: 49.68 GB/s (RDMA), 162.14 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 3, RDMA chunk 8: 27.93 GB/s (RDMA), 91.17 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 3, RDMA chunk 12: 35.50 GB/s (RDMA), 115.87 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 3, RDMA chunk 16: 40.53 GB/s (RDMA), 132.28 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 3, RDMA chunk 20: 43.10 GB/s (RDMA), 140.69 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 3, RDMA chunk 24: 45.95 GB/s (RDMA), 149.98 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 3, RDMA chunk 28: 47.53 GB/s (RDMA), 155.13 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 3, RDMA chunk 32: 50.00 GB/s (RDMA), 163.19 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 27.71 GB/s (RDMA), 90.45 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 35.08 GB/s (RDMA), 114.50 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 39.93 GB/s (RDMA), 130.32 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 43.41 GB/s (RDMA), 141.71 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 46.71 GB/s (RDMA), 152.48 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 48.25 GB/s (RDMA), 157.48 GB/s (NVL)
          
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 50.77 GB/s (RDMA), 165.71 GB/s (NVL)
          
[tuning] Best combine: SMs 24, NVL chunk 4, RDMA chunk 32: 50.77 GB/s (RDMA), 165.71 GB/s (NVL)
          

          

          
[rank0]:[W225 16:44:30.570331162 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
      

运行低延迟测试(至少需要两个节点)


          
# python tests/test_low_latency.py
          
Allocating buffer size: 2116.291072 MB ...
          
[rank 3] Dispatch + combine bandwidth: 42.96 GB/s, avg_t=513.21 us, min_t=508.26 us, max_t=522.21 us
          
[rank 5] Dispatch + combine bandwidth: 42.95 GB/s, avg_t=513.34 us, min_t=507.52 us, max_t=518.69 us
          
[rank 0] Dispatch + combine bandwidth: 42.95 GB/s, avg_t=513.41 us, min_t=505.60 us, max_t=522.69 us
          
[rank 1] Dispatch + combine bandwidth: 42.97 GB/s, avg_t=513.13 us, min_t=505.92 us, max_t=520.77 us
          
[rank 7] Dispatch + combine bandwidth: 42.97 GB/s, avg_t=513.08 us, min_t=506.66 us, max_t=519.55 us
          
[rank 2] Dispatch + combine bandwidth: 42.96 GB/s, avg_t=513.24 us, min_t=508.83 us, max_t=518.02 us
          
[rank 6] Dispatch + combine bandwidth: 42.97 GB/s, avg_t=513.07 us, min_t=505.66 us, max_t=520.06 us
          
[rank 4] Dispatch + combine bandwidth: 42.95 GB/s, avg_t=513.39 us, min_t=504.03 us, max_t=523.84 us
          
[rank 1] Dispatch bandwidth: 37.80 GB/s, avg_t=198.75 us | Combine bandwidth: 43.78 GB/s, avg_t=332.01 us
          
[rank 0] Dispatch bandwidth: 38.72 GB/s, avg_t=193.99 us | Combine bandwidth: 42.75 GB/s, avg_t=340.03 us
          
[rank 3] Dispatch bandwidth: 39.60 GB/s, avg_t=189.69 us | Combine bandwidth: 43.21 GB/s, avg_t=336.44 us
          
[rank 5] Dispatch bandwidth: 40.54 GB/s, avg_t=185.31 us | Combine bandwidth: 43.33 GB/s, avg_t=335.49 us
          
[rank 2] Dispatch bandwidth: 38.80 GB/s, avg_t=193.61 us | Combine bandwidth: 43.27 GB/s, avg_t=335.99 us
          
[rank 4] Dispatch bandwidth: 40.37 GB/s, avg_t=186.09 us | Combine bandwidth: 43.17 GB/s, avg_t=336.73 us
          
[rank 6] Dispatch bandwidth: 41.84 GB/s, avg_t=179.55 us | Combine bandwidth: 43.08 GB/s, avg_t=337.43 us
          
[rank 7] Dispatch bandwidth: 43.50 GB/s, avg_t=172.67 us | Combine bandwidth: 42.51 GB/s, avg_t=341.97 us
          
[rank 3] Dispatch send/recv time: 29.08 us | Combine send/recv time: 39.44 us
          
[rank 0] Dispatch send/recv time: 29.10 us | Combine send/recv time: 39.68 us
          
[rank 2] Dispatch send/recv time: 29.50 us | Combine send/recv time: 40.20 us
          
[rank 5] Dispatch send/recv time: 30.64 us | Combine send/recv time: 41.12 us
          
[rank 6] Dispatch send/recv time: 28.97 us | Combine send/recv time: 39.28 us
          
[rank 4] Dispatch send/recv time: 28.88 us | Combine send/recv time: 39.44 us
          
[rank 1] Dispatch send/recv time: 29.80 us | Combine send/recv time: 40.03 us
          
[rank 7] Dispatch send/recv time: 29.49 us | Combine send/recv time: 39.66 us
          
[rank0]:[W225 16:48:30.888565952 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
      

测试结果汇总

节点内/节点间带宽测试

num_tokens
hidden
num_topk
Dtype
CommType
Bandwidth(GB/s)
4096
7168
8
BF16
Dispatch Intranode
288.95(NVLink)
4096
7168
8
FP8
Dispatch Intranode
295.29(NVLink)
4096
7168
8
N/A
Combine Intranode
331.61(NVLink)
4096
7168
8*2(group)
FP8
Dispatch Internode
40.93(RDMA)
4096
7168
8*2(group)
BF16
Dispatch Internode
54.19(RDMA)
4096
7168
8*2(group)
N/A
Combine Internode
50.77(RDMA)

注意,由于本文使用 H100(未阉割,NVLink 峰值双向带宽 900GB/s),和 DeepSeek 官方在 H800(阉割版,NVLink 峰值双向带宽 400GB/s)上评测结果相比,NVLink 带宽有较大提升,符合预期。

低延迟测试

| Dispatch | Combine | Dispatch+Combine | | 带宽 | 37.843GB/s | 42.543.78GB/s | 42.96GB/s | | 延迟 | 172198us | 332342us | 513.24us | | Send/Recv 延迟 | 28.8830.64us | 39.2841.12us | N/A |

期待 DeepSeek 接下来的开源内容。


扫描下方

二维码

,关注“

慢慢学AIGC

picture.image

0
0
0
0
关于作者
关于作者

文章

0

获赞

0

收藏

0

相关资源
KubeZoo: 轻量级 Kubernetes 多租户方案探索与实践
伴随云原生技术的发展,多个租户共享 Kubernetes 集群资源的业务需求应运而生,社区现有方案各有侧重,但是在海量小租户的场景下仍然存在改进空间。本次分享对现有多租户方案进行了总结和对比,然后提出一种基于协议转换的轻量级 Kubernetes 网关服务:KubeZoo,该方案能够显著降低多租户控制面带来的资源和运维成本,同时提供安全可靠的租户隔离性。
相关产品
评论
未登录
看完啦,登录分享一下感受吧~
暂无评论