点击下方
卡片
,关注“
慢慢学AIGC
”
DeepEP 简介
开源周的第二天,DeepSeek 放出了用于加速 MoE 模型训练和推理的 EP 通信库——DeepEP,具备高效优化的 all-to-all 通信性能,支持节点内(基于 NVLink)和节点间(基于 RDMA)通信,具备高吞吐(适合训练和推理的 Prefill 阶段)和低延迟(适合推理 Decode 阶段)的 kernel 实现,支持原生 FP8 分发,具备灵活的 GPU 资源控制,便于实现计算-通信重叠。
Github:https://github.com/deepseek-ai/DeepEP
DeepSeek MoE 架构回顾
DeepSeek 最早在 2024 年 1 月份发布并开源面向 MoE 语言模型的 DeepSeekMoE 架构,通过细粒度的专家分割和共享专家隔离,DeepSeekMoE 相比主流 MoE 架构实现了显著更高的专家专业化程度和性能。DeepSeekMoE 的图解如下(图片来自论文:https://arxiv.org/pdf/2401.06066):
图(a)展示了采用传统 top-2 路由策略的 MoE 层。
图(b)说明了细粒度专家分割策略。
图(c)展示了共享专家隔离策略的整合,构成了完整的 DeepSeekMoE 架构。
注:三种架构的专家参数数量和计算成本保持不变。
DeepSeek 在 2T token 上训练了 DeepSeekMoE 16B,激活参数量 2.8B,仅使用了 DeepSeek 7B 和 LLaMA 2 7B 约 40% 的计算量,但评测性能相当。
16B 参数量的 MoE 模型已开源。
HuggingFace:
https://huggingface.co/deepseek-ai/deepseek-moe-16b-base
https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat
DeepSeek 在这篇论文中还探索将 DeepSeekMoE 扩展到 145B 参数量(激活参数量为 22.2B),展示了与 DeepSeek 67B 相当的性能,仅使用 28.5%(甚至可能只有18.2%)的计算量。
DeepSeek MoE 发布并未引起太大反响。真正引起行业震动的发布是 2024 年 5 月发布的 DeepSeek V2,以经济的训练和高效推理为特征的混合专家(MoE)语言模型。它总共
包含 236B 参数
,其中
每个 token 激活 21B 参数
,并
支持 128K tokens 的上下文长度
。
DeepSeek-V2 采用了
创新的架构
,包括
多头潜在注意力(MLA)和 DeepSeekMoE
。DeepSeek V2 引发了 LLM API 价格大战,其盛况笔者也曾有记录,详见《写在云厂商 LLM API 价格调整后》《盘点国内外大模型推理服务 API 价格》。
与 DeepSeek 67B 相比,DeepSeek-V2 实现了显著更强的性能,同时
节省了 42.5% 的训练成本
,减少了 93.3% 的 KV 缓存,并将
最大生成吞吐量提升了 5.76 倍
。
DeepSeek V2 模型 层数为 60 , 隐藏维度设置为 5120 。 将第一层之外的所有 FFN 替换为 MoE 层 。 每个 MoE 层由 2 个共享专家和 160 个路由专家组成,其中每个专家的中间隐藏维度为 1536。在路由专家中,每个 token 将激活 6 个专家 。
由于 DeepSeek-V2 对每个 token 激活的参数更少,所需的浮点运算也比 DeepSeek 67B 少,从理论上讲,训练 DeepSeek-V2 比训练 DeepSeek 67B 更经济。尽管训练 MoE 模型会引入额外的通信开销,但通过算子和通信优化,DeepSeek-V2 的训练可以达到相对较高的模型浮点运算利用率(MFU)。在 DeepSeek 使用 H800 集群的实际训练中,对于每万亿 token 的训练,DeepSeek 67B 需要 300.6K GPU 小时,而 DeepSeek-V2 仅需要 172.8K GPU 小时,即稀疏的 DeepSeek-V2 与密集的 DeepSeek 67B 相比可以节省 42.5% 的训练成本。
为了高效部署 DeepSeek-V2 服务,首先将其参数转换为 FP8 精度。此外,对 DeepSeek-V2 执行 KV 缓存量化,将其 KV 缓存中的每个元素进一步压缩至平均 6 位。得益于 MLA 和这些优化,实际部署的 DeepSeek-V2 比 DeepSeek 67B 需要的 KV 缓存显著更少,因此可以服务更大的批量大小。基于实际部署的 DeepSeek 67B 服务的提示和生成长度分布评估了 DeepSeek-V2 的生成吞吐量。在单机 8 卡 H800 GPU 节点上,DeepSeek-V2 的生成吞吐量超过每秒 50K 个 token,是 DeepSeek 67B 最大生成吞吐量的 5.76 倍。此外,DeepSeek-V2 的提示 prefill 吞吐量超过每秒 100K 个 token。
DeepSeek V2 真正做到了凭实力降本(输入每百万 tokens 1 元,输出每百万 tokens 2 元),而不是像其他友商一样靠补贴打价格战。
2024 年 9 月 5 日,DeepSeek 正式发布 DeepSeek-V2.5,不仅保留了原有 Chat 模型的通用对话能力和 Coder 模型的强大代码处理能力,还更好地对齐了人类偏好。此外,DeepSeek-V2.5 在写作任务、指令跟随等多个方面也实现了大幅提升。DeepSeek-V2.5 和 V2 架构完全相同,API 接口向前兼容。V2 到 V2.5 的血缘关系如下图所示:
DeepSeek V2 论文:https://arxiv.org/pdf/2405.04434
HuggingFace 权重(包括 V2、V2-Lite 和 V2.5):
https://huggingface.co/collections/deepseek-ai/deepseek-v2-669a1c8b8f2dbc203fbd7746
2024 年 12 月 26 日,深度求索上线并开源
全新系列模型 DeepSeek-V3,凭借 671B
总参数量,
37B 激活参数量 ,多项评测成绩超越了 Qwen2.5-72B 和 Llama-3.1-405B 等其他开源模型,并在性能上和世界顶尖的闭源模型 GPT-4o 以及 Claude-3.5-Sonnet 不分伯仲。
DeepSeek V3 API 定价为每百万输入 tokens 0.5 元(缓存命中)/ 2 元(缓存未命中),每百万输出 tokens 8 元。
和 DeepSeek V2.5 相比,V3 基本的 MLA/MoE 架构不变,但超参数有如下变化:
- first_k_dense_replace 从 1 增大到 3;
- hidden_size 从 5120 增大到 7168;
- intermediate_size 从 12288 增大到 18432;
- moe_intermediate_size 从 1536 增大到 2048;
- n_routed_experts 从 160 增大到 256;
- n_shared_experts 从 2 减小到 1;
- norm_topk_prob 从 false 改为 true;
- num_hidden_layers 从 60 改为 61;
- 设置 num_nextn_predict_layers 为 1;
- 新增 quantization_config,支持 FP8 量化;
- num_experts_per_tok 从 6 增大到 8;
- routed_scaling_factor 从 16.0 改为 2.5;
- scoring_func 从 softmax 改为 sigmoid;
- topk_group 从 3 增大到 4(和
num_experts_per_tok 相关)
;
-
topk_method 从 group_limited_greedy 改为 noaux_tc;
-
vocab_size 从 102400 增大到 129280;
DeepSeek V3 和后来的 R1 架构及超参数完全相同。不再赘述。
通过对 DeepSeek 主要模型的简单回顾,我们发现 DeepSeek 一直在模型效率和效果上追求极致的性能,使用创新的架构和高效的工程不断突破上限。
DeepEP 性能实测
从前一节我们了解到,MoE 架构具有稀疏性,不是每个专家都参与计算,而是需要由路由决定激活哪些专家。因此在工程实现时,通信主要涉及“Dispatch(分发)”和“Combine(聚合)”这两个步骤,由于采用细粒度专家,每个专家计算量相对较小,重点需要优化通信部分。DeepEP 即为优化 MoE 通信部分的加速库。
和前文《DeepSeek 开源周(一):FlashMLA 在 H100 上的性能实测》一样,本文实测需要在 H100 或 H800 机器上运行,而且需要使用 NVLink 版本,不能用 PCIe 版本。
笔者环境如下(2 台相同规格的 H100 服务器):
GPU | |
NVIDIA H100 80GB HBM3 *8 |
驱动版本:535.161.08
CUDA 版本:12.6 | | NIC | Mellanox ConnectX-7(400Gb) *8(IB)
驱动版本:MLNX_OFED_LINUX-23.10-0.5.5.0 | | OS | Ubuntu-Server 22.04.3 LTS amd64
内核版本:5.15.0-91-generic | | Python | 3.12.7 | | PyTorch | 2.6.0 |
在开始之前,请务必检查 GPU 驱动是否正常安装,nvidia-fabricmanager 服务是否正常启动,最好跑一些 cuda samples 验证。
还有很重要的一点,后续步骤要求有物理机 root 权限!如果你只能访问容器,是无法完成后续内核模块更新的。中间物理机需要重启一次,如果上面在跑重要的训练任务,建议提前保存并迁移。你如果不确定自己拿到的到底是物理机还是容器,有个简单的判断方法是看你能否在这个环境里再启动容器,如果可以启动说明是物理机,否则是容器。
安装必要的依赖
apt install
build-essential
git cmake ninja
-build
devscripts
编译安装 GDRCopy
GDRCopy 是由 Nvidia 开发的低延迟 GPU 显存拷贝库,基于 GPU Direct RDMA 技术。
以下步骤需要 root 权限,且需要在物理机上操作:
git clone https://github.com/NVIDIA/gdrcopy
cd gdrcopy
make -j8
make prefix=/opt/gdrcopy install
cd packages
CUDA=/usr/local/cuda ./build-deb-packages.sh
apt install ./*.deb
cd /var/lib/dkms/gdrdrv/
ln -sf 2.5 2.5-1
cd -
apt install ./*.deb
gdrcopy_copybw
输出如下:
# gdrcopy_copybw
GPU id:0; name: NVIDIA H100 80GB HBM3; Bus id: 0000:18:00
GPU id:1; name: NVIDIA H100 80GB HBM3; Bus id: 0000:2a:00
GPU id:2; name: NVIDIA H100 80GB HBM3; Bus id: 0000:3a:00
GPU id:3; name: NVIDIA H100 80GB HBM3; Bus id: 0000:5d:00
GPU id:4; name: NVIDIA H100 80GB HBM3; Bus id: 0000:9a:00
GPU id:5; name: NVIDIA H100 80GB HBM3; Bus id: 0000:ab:00
GPU id:6; name: NVIDIA H100 80GB HBM3; Bus id: 0000:ba:00
GPU id:7; name: NVIDIA H100 80GB HBM3; Bus id: 0000:db:00
selecting device 0
testing size: 131072
rounded size: 131072
gpu alloc fn: cuMemAlloc
device ptr: 7fc901e00000
map_d_ptr: 0x7fc92a85c000
info.va: 7fc901e00000
info.mapped_size: 131072
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer:0x7fc92a85c000
writing test, size=131072 offset=0 num_iters=10000
write BW: 14752.2MB/s
reading test, size=131072 offset=0 num_iters=100
read BW: 410.663MB/s
unmapping buffer
unpinning buffer
closing gdrdrv
编译安装 NVSHMEM
NVIDIA NVSHMEM 是一种编程接口,它在 NVIDIA GPU 集群中实现了分区全局地址空间(PGAS)模型。NVSHMEM 提供了一个易于使用的接口,用于分配对称地分布在各个 GPU 上的内存。除了 CPU 端接口外,NVSHMEM 还提供了 CUDA 内核端接口,允许 NVIDIA CUDA 线程访问对称分布内存中的任何位置。
NVSHMEM 源码需要从英伟达官方注册下载,链接为:
https://developer.nvidia.com/nvshmem-archive
你需要使用邮箱注册并登录,才能获得下载链接。
这里需要下载 3.1.7 版本。下载后解压并应用 DeepEP 中的 patch:
git clone https://github.com/deepseek-ai/DeepEP
tar xvf nvshmem_src_3.1.7-1.txz
cd nvshmem_src/
git apply ../DeepEP/third-party/nvshmem.patch
以下步骤需要物理机重启,高危操作:
# 开启 IBGDA
echo "options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords=\"PeerMappingOverride=1;\"" > /etc/modprobe.d/nvidia.conf
update-initramfs -u
reboot
重启后,编译 NVSHMEM
cd nvshmem_src/
MPI_HOME=/usr/mpi/gcc/openmpi-4.1.7a1
CUDA_HOME=/usr/local/cuda && \
GDRCOPY_HOME=/opt/gdrcopy && \
NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
NVSHMEM_USE_GDRCOPY=1 \
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/opt/nvshmem
cd build/
make -j8
make install
设置环境变量:
export NVSHMEM_DIR=/opt/nvshmem
export LD_LIBRARY_PATH="${NVSHMEM_DIR}/lib:$LD_LIBRARY_PATH"
export PATH="${NVSHMEM_DIR}/bin:$PATH"
验证这一步成功:
# nvshmem-info -a
NVSHMEM v3.1.7
Build Information:
CUDA API 12060
CUDA Driver 12020
Build Timestamp Feb 25 2025 14:32:50
Build Variables
NVSHMEM_DEBUG=OFF NVSHMEM_DEVEL=OFF NVSHMEM_DEFAULT_PMI2=OFF
NVSHMEM_DEFAULT_PMIX=OFF NVSHMEM_DEFAULT_UCX=OFF NVSHMEM_DISABLE_COLL_POLL=ON
NVSHMEM_ENABLE_ALL_DEVICE_INLINING=OFF NVSHMEM_GPU_COLL_USE_LDST=OFF
NVSHMEM_IBGDA_SUPPORT=ON NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=OFF
NVSHMEM_IBDEVX_SUPPORT=OFF NVSHMEM_IBRC_SUPPORT=ON
NVSHMEM_MPI_SUPPORT=ON NVSHMEM_NVTX=ON NVSHMEM_PMIX_SUPPORT=OFF
NVSHMEM_SHMEM_SUPPORT=OFF NVSHMEM_TEST_STATIC_LIB=OFF
NVSHMEM_TIMEOUT_DEVICE_POLLING=OFF NVSHMEM_TRACE=OFF NVSHMEM_UCX_SUPPORT=OFF
NVSHMEM_USE_DLMALLOC=OFF NVSHMEM_USE_NCCL=OFF NVSHMEM_USE_GDRCOPY=ON
NVSHMEM_VERBOSE=OFF CUDA_HOME=/usr/local/cuda GDRCOPY_HOME=/usr/local/gdrdrv
LIBFABRIC_HOME=/usr/local/libfabric MPI_HOME=/usr/local/ompi
NCCL_HOME=/usr/local/nccl NVSHMEM_PREFIX=/usr/local/nvshmem PMIX_HOME=/usr
SHMEM_HOME=/usr/local/ompi UCX_HOME=/usr/local/ucx
Standard options:
NVSHMEM_VERSION false (type: bool, default: false)
Print library version at startup
NVSHMEM_INFO false (type: bool, default: false)
Print environment variable options at startup
NVSHMEM_DISABLE_NVLS false (type: bool, default: false)
Disable NVLS SHARP resources for collectives, even if available for platform
NVSHMEM_SYMMETRIC_SIZE 1073741824 (type: size, default: 1073741824)
Specifies the size (in bytes) of the symmetric heap memory per PE. The
size is implementation-defined and must be at least as large as the integer
ceiling of the product of the numeric prefix and the scaling factor. The
character suffixes for the scaling factor are as follows:
* k or K multiplies by 2^10 (kibibytes)
* m or M multiplies by 2^20 (mebibytes)
* g or G multiplies by 2^30 (gibibytes)
* t or T multiplies by 2^40 (tebibytes)
For example, string '20m' is equivalent to the integer value 20971520, or 20
mebibytes. Similarly the string '3.1M' is equivalent to the integer value
3250586. Only one multiplier is recognized and any characters following the
multiplier are ignored, so '20kk' will not produce the same result as '20m'.
Usage of string '.5m' will yield the same result as the string '0.5m'.
An invalid value for NVSHMEM_SYMMETRIC_SIZE is an error, which the NVSHMEM
library shall report by either returning a nonzero value from
nvshmem_init_thread or causing program termination.
NVSHMEM_DEBUG "" (type: string, default: "")
Set to enable debugging messages.
Optional values: VERSION, WARN, INFO, ABORT, TRACE
Bootstrap options:
NVSHMEM_BOOTSTRAP "PMI" (type: string, default: "PMI")
Name of the default bootstrap that should be used to initialize NVSHMEM.
Allowed values: PMI, MPI, SHMEM, plugin, UID
NVSHMEM_BOOTSTRAP_PMI "PMI" (type: string, default: "PMI")
Name of the PMI bootstrap that should be used to initialize NVSHMEM.
Allowed values: PMI, PMI-2, PMIX
NVSHMEM_BOOTSTRAP_PLUGIN "" (type: string, default: "")
Absolute path to or name of the bootstrap plugin file to load when
NVSHMEM_BOOTSTRAP=plugin is specified
NVSHMEM_BOOTSTRAP_MPI_PLUGIN "nvshmem_bootstrap_mpi.so" (type: string, default: "nvshmem_bootstrap_mpi.so")
Absolute path to or name of the MPI bootstrap plugin file.
NVSHMEM will search for the plugin based on linux linker priorities. See man
dlopen
NVSHMEM_BOOTSTRAP_SHMEM_PLUGIN "nvshmem_bootstrap_shmem.so" (type: string, default: "nvshmem_bootstrap_shmem.so")
Absolute path to or name of the SHMEM bootstrap plugin file.
NVSHMEM will search for the plugin based on linux linker priorities. See man
dlopen
NVSHMEM_BOOTSTRAP_PMI_PLUGIN "nvshmem_bootstrap_pmi.so" (type: string, default: "nvshmem_bootstrap_pmi.so")
Absolute path to or name of the PMI bootstrap plugin file.
NVSHMEM will search for the plugin based on linux linker priorities. See man
dlopen
NVSHMEM_BOOTSTRAP_PMI2_PLUGIN "nvshmem_bootstrap_pmi2.so" (type: string, default: "nvshmem_bootstrap_pmi2.so")
Absolute path to or name of the PMI-2 bootstrap plugin file.
NVSHMEM will search for the plugin based on linux linker priorities. See man
dlopen
NVSHMEM_BOOTSTRAP_PMIX_PLUGIN "nvshmem_bootstrap_pmix.so" (type: string, default: "nvshmem_bootstrap_pmix.so")
Absolute path to or name of the PMIx bootstrap plugin file.
NVSHMEM will search for the plugin based on linux linker priorities. See man
dlopen
NVSHMEM_BOOTSTRAP_UID_PLUGIN "nvshmem_bootstrap_uid.so" (type: string, default: "nvshmem_bootstrap_uid.so")
Absolute path to or name of the UID bootstrap plugin file.
NVSHMEM will search for the plugin based on linux linker priorities. See man
dlopen
Additional options:
NVSHMEM_CUDA_PATH "" (type: string, default: "")
Path to directory containing libcuda.so (for use when not in default location)
NVSHMEM_DEBUG_ATTACH_DELAY 0 (type: int, default: 0)
Delay (in seconds) during the first call to NVSHMEM_INIT to allow for attaching
a debuggger (Default 0)
NVSHMEM_DEBUG_FILE "" (type: string, default: "")
Debugging output filename, may contain %h for hostname and %p for pid
NVSHMEM_MAX_TEAMS 32 (type: long, default: 32)
Maximum number of simultaneous teams allowed
NVSHMEM_MAX_P2P_GPUS 128 (type: int, default: 128)
Maximum number of P2P GPUs
NVSHMEM_MAX_MEMORY_PER_GPU 137438953472 (type: size, default: 137438953472)
Maximum memory per GPU
NVSHMEM_DISABLE_CUDA_VMM false (type: bool, default: false)
Disable use of CUDA VMM for P2P memory mapping. By default, CUDA VMM is enabled
on x86 and disabled on P9. CUDA VMM feature in NVSHMEM requires CUDA RT version
and CUDA Driver version to be greater than or equal to 11.3.
NVSHMEM_DISABLE_P2P false (type: bool, default: false)
Disable P2P connectivity of GPUs even when available
NVSHMEM_IGNORE_CUDA_MPS_ACTIVE_THREAD_PERCENTAGE false (type: bool, default: false)
When doing Multi-Process Per GPU (MPG) run, full API support is available only
if sum of CUDA_MPS_ACTIVE_THREAD_PERCENTAGE of processes running on a GPU is <=
100%. Through this variable, user can request NVSHMEM runtime to ignore the
active thread percentage and allow full MPG support. Users enable it at their
own risk as NVSHMEM might deadlock.
NVSHMEM_CUMEM_GRANULARITY 536870912 (type: size, default: 536870912)
Granularity for cuMemAlloc/cuMemCreate
NVSHMEM_PROXY_REQUEST_BATCH_MAX 32 (type: int, default: 32)
Maxmum number of requests that the proxy thread processes in a single iteration
of the progress loop.
Collectives options:
NVSHMEM_DISABLE_NCCL false (type: bool, default: false)
Disable use of NCCL for collective operations
NVSHMEM_BARRIER_DISSEM_KVAL 2 (type: int, default: 2)
Radix of the dissemination algorithm used for barriers
NVSHMEM_BARRIER_TG_DISSEM_KVAL 2 (type: int, default: 2)
Radix of the dissemination algorithm used for thread group barriers
NVSHMEM_FCOLLECT_LL_THRESHOLD 2048 (type: size, default: 2048)
Message size threshold up to which fcollect LL algo will be used
NVSHMEM_REDUCE_SCRATCH_SIZE 524288 (type: size, default: 524288)
Amount of symmetric heap memory (minimum 16B, multiple of 8B) reserved by
runtime for every team to implement reduce and reducescatter collectives
NVSHMEM_BCAST_ALGO 0 (type: int, default: 0)
Broadcast algorithm to be used.
* 0 - use default algorithm selection strategy
NVSHMEM_REDMAXLOC_ALGO 1 (type: int, default: 1)
Reduction algorithm to be used for MAXLOC operation.
* 1 - default, flag alltoall algorithm
* 2 - flat reduce + flat bcast
* 3 - topo-aware two-level reduce + topo-aware bcast
Transport options:
NVSHMEM_REMOTE_TRANSPORT "ibrc" (type: string, default: "ibrc")
Selected transport for remote operations: ibrc, ucx, libfabric, ibdevx, none
NVSHMEM_ENABLE_NIC_PE_MAPPING false (type: bool, default: false)
When not set or set to 0, a PE is assigned the NIC on the node that is closest
to it by distance. When set to 1, NVSHMEM either assigns NICs to PEs on a
round-robin basis or uses NVSHMEM_HCA_PE_MAPPING or NVSHMEM_HCA_LIST when they
are specified.
NVSHMEM_DISABLE_LOCAL_ONLY_PROXY false (type: bool, default: false)
When running on an NVLink-only configuaration (No-IB, No-UCX), completely
disable the proxy thread. This will disable device side global exit and device
side wait timeout polling (enabled by NVSHMEM_TIMEOUT_DEVICE_POLLING build-time
variable) because these are processed by the proxy thread.
NVSHMEM_IB_ENABLE_IBGDA false (type: bool, default: false)
Set to enable GPU-initiated communication transport.
Hidden options:
NVSHMEM_INFO_HIDDEN true (type: bool, default: false)
Print hidden environment variable options at startup
NVSHMEM_DISABLE_NVLS_SHARING true (type: bool, default: true)
Disable NVLS SHARP resource sharing for user-defined teams
NVSHMEM_HEAP_KIND "DEVICE" (type: string, default: "DEVICE")
Specify the memory kind used by the NVSHMEM symmetric heap.
Allowed values: VIDMEM, SYSMEM
NVSHMEM_ENABLE_RAIL_OPT false (type: bool, default: false)
Enable Rail Optimization when heap is in SYSMEM
NVSHMEM_BOOTSTRAP_TWO_STAGE false (type: bool, default: false)
Ignore CUDA device setting during initialization,forcing two-stage
initialization
NVSHMEM_DEBUG_SUBSYS "" (type: string, default: "")
Comma separated list of debugging message sources. Prefix with '^' to exclude.
Values: INIT, COLL, P2P, PROXY, TRANSPORT, MEM, BOOTSTRAP, TOPO, UTIL, ALL
NVSHMEM_ENABLE_ERROR_CHECKS false (type: bool, default: false)
Enable error checks
NVSHMEM_DISABLE_MNNVL false (type: bool, default: false)
Disable MNNVL connectivity for GPUs even when available
NVSHMEM_CUMEM_HANDLE_TYPE "FILE_DESCRIPTOR" (type: string, default: "FILE_DESCRIPTOR")
Handle type for cuMemCreate. Supported are - FABRIC or FILE_DESCRIPTOR
NVSHMEM_BYPASS_ACCESSIBILITY_CHECK false (type: bool, default: false)
Bypass peer GPU accessbility checks
NVSHMEM_FCOLLECT_NTHREADS 512 (type: int, default: 512)
Sets number of threads per block for fcollect collective.
By default, if no env is set, default value is min(max_occupancy per CTA, msg
size per PE).
If env is specified, value overrides the default irrespective of max occupancy
per CTA
NVSHMEM_REDUCESCATTER_NTHREADS 512 (type: int, default: 512)
Sets number of threads per block for reducescatter collective.
By default, if no env is set, default value is min(max_occupancy per CTA, msg
size per PE).
If env is specified, value overrides the default irrespective of max occupancy
per CTA
NVSHMEM_MAX_CTAS 1 (type: int, default: 1)
Sets number of blocks per grid for host onstream collective.
By default, if no env is set, default value to 1 CTA
If env is specified, value overrides the default value
NVSHMEM_REDUCE_RECEXCH_KVAL 2 (type: int, default: 2)
Radix of the recursive exchange reduction algorithm
NVSHMEM_FCOLLECT_LL128_THRESHOLD 0 (type: size, default: 0)
Message size threshold up to which the fcollect LL128 algo will be used.
LL128 will be used only when FCOLLECT_LL_THRESHOLD < size
NVSHMEM_FCOLLECT_NVLS_THRESHOLD 16777216 (type: size, default: 16777216)
Message size threshold up to which fcollect NVLS algo will be used
NVSHMEM_REDUCESCATTER_NVLS_THRESHOLD 16777216 (type: size, default: 16777216)
Message size threshold up to which reducescatter NVLS algo will be used
NVSHMEM_BCAST_TREE_KVAL 2 (type: int, default: 2)
Radix of the broadcast tree algorithm
NVSHMEM_FCOLLECT_ALGO 0 (type: int, default: 0)
Fcollect algorithm to be used.
* 0 - use default algorithm selection strategy
NVSHMEM_REDUCE_ALGO 0 (type: int, default: 0)
Allreduce algorithm to be used.
* 0 - use default algorithm selection strategy
NVSHMEM_REDUCESCATTER_ALGO 0 (type: int, default: 0)
Reduce Scatter algorithm to be used.
* 0 - use default algorithm selection strategy
NVSHMEM_ASSERT_ATOMICS_SYNC false (type: bool, default: false)
Bypass flush on wait_until at target
NVSHMEM_BYPASS_FLUSH false (type: bool, default: false)
Bypass flush in proxy when enforcing consistency
NVTX options:
NVSHMEM_NVTX "off" (type: string, default: "off")
Set to enable NVTX instrumentation. Accepts a comma separated list of
instrumentation groups. By default the NVTX instrumentation is disabled.
init : library setup
alloc : memory management
launch : kernel launch routines
coll : collective communications
wait : blocking point-to-point synchronization
wait_on_stream : point-to-point synchronization (on stream)
test : non-blocking point-to-point synchronization
memorder : memory ordering (quiet, fence)
quiet_on_stream : nvshmemx_quiet_on_stream
atomic_fetch : fetching atomic memory operations
atomic_set : non-fetchong atomic memory operations
rma_blocking : blocking remote memory access operations
rma_nonblocking : non-blocking remote memory access operations
proxy : activity of the proxy thread
common : init,alloc,launch,coll,memorder,wait,atomic_fetch,rma_blocking,proxy
all : all groups
off : disable all NVTX instrumentation
编译安装 DeepEP
这一步相对简单,需要准备一个 PyTorch 环境:
conda create -n deepep python=3.12
conda activate deepep
pip install torch torchvision torchaudio
之后进入 DeepEP 代码库,执行:
python setup.py build
ln -s build/lib.linux-x86_64-cpython-312/deep_ep_cpp.cpython-312-x86_64-linux-gnu.so
运行节点内测试(一个节点即可)
python tests/test\_intranode.py
测试结果如下:
# export PYTHONPATH=$(pwd)
# python tests/test_intranode.py
[config] num_tokens=4096, hidden=7168, num_topk=8
[layout] Kernel performance: 0.050 ms
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed
[tuning] SMs 24, NVL chunk 4: 276.67 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8: 288.95 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12: 268.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16: 262.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20: 254.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24: 251.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28: 248.25 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32: 246.71 GB/s (NVL)
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 8, 288.95 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4: 295.29 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8: 267.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12: 249.41 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16: 245.60 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20: 240.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24: 237.01 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28: 232.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32: 229.54 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 4, 295.29 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1: 159.08 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2: 285.78 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3: 322.23 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4: 331.61 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 4: 331.61 GB/s (NVL)
[rank0]:[W225 14:50:19.764550397 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
运行节点间测试(需要至少两个节点)
# node 0
export MASTER_ADDR=master_ip
export WORLD_SIZE=2
export RANK=0
python tests/test_internode.py
# node 1
export MASTER_ADDR=master_ip
export WORLD_SIZE=2
export RANK=1
python tests/test_internode.py
实测结果:
# python tests/test_internode.py
[config] num_tokens=4096, hidden=7168, num_topk_groups=2, num_topk=8
[layout] Kernel performance: 0.050 ms
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 10.97 GB/s (RDMA), 35.81 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 17.77 GB/s (RDMA), 58.01 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 22.53 GB/s (RDMA), 73.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 25.73 GB/s (RDMA), 83.97 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 26.87 GB/s (RDMA), 87.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 32.59 GB/s (RDMA), 106.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 36.38 GB/s (RDMA), 118.74 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 38.04 GB/s (RDMA), 124.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 11.20 GB/s (RDMA), 36.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 18.50 GB/s (RDMA), 60.37 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 23.93 GB/s (RDMA), 78.12 GB/s (NVL)
^@[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 28.92 GB/s (RDMA), 94.41 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 32.54 GB/s (RDMA), 106.21 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 35.75 GB/s (RDMA), 116.68 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 37.07 GB/s (RDMA), 120.98 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 39.43 GB/s (RDMA), 128.70 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 10.57 GB/s (RDMA), 34.51 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 17.82 GB/s (RDMA), 58.17 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 22.11 GB/s (RDMA), 72.17 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 26.70 GB/s (RDMA), 87.16 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 32.15 GB/s (RDMA), 104.95 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 36.08 GB/s (RDMA), 117.78 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 37.26 GB/s (RDMA), 121.60 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 39.20 GB/s (RDMA), 127.95 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 11.29 GB/s (RDMA), 36.84 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 18.36 GB/s (RDMA), 59.93 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 23.36 GB/s (RDMA), 76.25 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 29.11 GB/s (RDMA), 95.03 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 33.19 GB/s (RDMA), 108.35 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 36.45 GB/s (RDMA), 118.98 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 38.80 GB/s (RDMA), 126.65 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 40.93 GB/s (RDMA), 133.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 11.23 GB/s (RDMA), 36.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 17.81 GB/s (RDMA), 58.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 24.11 GB/s (RDMA), 78.70 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 28.71 GB/s (RDMA), 93.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 33.50 GB/s (RDMA), 109.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 35.70 GB/s (RDMA), 116.53 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 38.64 GB/s (RDMA), 126.14 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 40.59 GB/s (RDMA), 132.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 11.20 GB/s (RDMA), 36.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 18.50 GB/s (RDMA), 60.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 24.05 GB/s (RDMA), 78.50 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 29.41 GB/s (RDMA), 96.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 32.84 GB/s (RDMA), 107.18 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 35.91 GB/s (RDMA), 117.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 38.63 GB/s (RDMA), 126.09 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 39.95 GB/s (RDMA), 130.40 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 11.01 GB/s (RDMA), 35.94 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 18.05 GB/s (RDMA), 58.93 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 23.78 GB/s (RDMA), 77.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 28.78 GB/s (RDMA), 93.94 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 33.62 GB/s (RDMA), 109.73 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 36.42 GB/s (RDMA), 118.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 38.38 GB/s (RDMA), 125.26 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 40.04 GB/s (RDMA), 130.70 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 11.28 GB/s (RDMA), 36.83 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 18.35 GB/s (RDMA), 59.90 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 23.94 GB/s (RDMA), 78.15 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 29.36 GB/s (RDMA), 95.83 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 33.50 GB/s (RDMA), 109.33 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 35.51 GB/s (RDMA), 115.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 38.34 GB/s (RDMA), 125.15 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 39.71 GB/s (RDMA), 129.61 GB/s (NVL)
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 16, RDMA chunk 32: 40.93 GB/s (RDMA), 133.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 4: 17.94 GB/s (RDMA), 58.57 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 24.68 GB/s (RDMA), 80.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 33.81 GB/s (RDMA), 110.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 39.38 GB/s (RDMA), 128.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 44.11 GB/s (RDMA), 143.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 48.23 GB/s (RDMA), 157.41 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 51.26 GB/s (RDMA), 167.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 51.84 GB/s (RDMA), 169.20 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4: 17.93 GB/s (RDMA), 58.51 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8: 25.27 GB/s (RDMA), 82.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12: 34.88 GB/s (RDMA), 113.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16: 41.21 GB/s (RDMA), 134.53 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20: 45.80 GB/s (RDMA), 149.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24: 50.08 GB/s (RDMA), 163.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28: 52.69 GB/s (RDMA), 171.99 GB/s (NVL)
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32: 53.79 GB/s (RDMA), 175.58 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4: 17.96 GB/s (RDMA), 58.63 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8: 25.22 GB/s (RDMA), 82.32 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12: 34.93 GB/s (RDMA), 114.03 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16: 41.28 GB/s (RDMA), 134.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20: 45.32 GB/s (RDMA), 147.94 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24: 50.14 GB/s (RDMA), 163.66 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28: 53.04 GB/s (RDMA), 173.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32: 53.74 GB/s (RDMA), 175.42 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4: 17.88 GB/s (RDMA), 58.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8: 25.23 GB/s (RDMA), 82.36 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12: 34.79 GB/s (RDMA), 113.54 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16: 41.19 GB/s (RDMA), 134.43 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20: 46.38 GB/s (RDMA), 151.38 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24: 50.38 GB/s (RDMA), 164.45 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28: 52.95 GB/s (RDMA), 172.83 GB/s (NVL)
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32: 53.61 GB/s (RDMA), 174.99 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4: 17.47 GB/s (RDMA), 57.02 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8: 24.37 GB/s (RDMA), 79.53 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12: 33.86 GB/s (RDMA), 110.52 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16: 40.59 GB/s (RDMA), 132.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20: 45.57 GB/s (RDMA), 148.75 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24: 48.75 GB/s (RDMA), 159.12 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28: 50.47 GB/s (RDMA), 164.72 GB/s (NVL)
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32: 53.81 GB/s (RDMA), 175.64 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4: 18.07 GB/s (RDMA), 58.97 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8: 25.25 GB/s (RDMA), 82.43 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12: 34.92 GB/s (RDMA), 113.99 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16: 40.75 GB/s (RDMA), 133.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20: 46.85 GB/s (RDMA), 152.91 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24: 49.97 GB/s (RDMA), 163.09 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28: 52.37 GB/s (RDMA), 170.92 GB/s (NVL)
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32: 54.19 GB/s (RDMA), 176.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4: 17.92 GB/s (RDMA), 58.49 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8: 25.43 GB/s (RDMA), 83.00 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12: 34.62 GB/s (RDMA), 112.99 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16: 41.16 GB/s (RDMA), 134.35 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20: 46.43 GB/s (RDMA), 151.56 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24: 49.62 GB/s (RDMA), 161.96 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28: 51.97 GB/s (RDMA), 169.65 GB/s (NVL)
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32: 53.93 GB/s (RDMA), 176.02 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4: 18.03 GB/s (RDMA), 58.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8: 25.24 GB/s (RDMA), 82.39 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12: 35.02 GB/s (RDMA), 114.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16: 41.45 GB/s (RDMA), 135.30 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20: 45.33 GB/s (RDMA), 147.95 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24: 49.53 GB/s (RDMA), 161.68 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28: 50.26 GB/s (RDMA), 164.06 GB/s (NVL)
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32: 54.14 GB/s (RDMA), 176.72 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 24, RDMA chunk 32: 54.19 GB/s (RDMA), 176.89 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 8: 28.80 GB/s (RDMA), 94.01 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 12: 35.33 GB/s (RDMA), 115.31 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 16: 39.71 GB/s (RDMA), 129.60 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 20: 42.86 GB/s (RDMA), 139.88 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 24: 45.25 GB/s (RDMA), 147.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 28: 46.71 GB/s (RDMA), 152.46 GB/s (NVL)
[tuning] SMs 24, NVL chunk 1, RDMA chunk 32: 48.84 GB/s (RDMA), 159.43 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 8: 28.52 GB/s (RDMA), 93.09 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 12: 36.09 GB/s (RDMA), 117.79 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 16: 39.77 GB/s (RDMA), 129.82 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 20: 43.46 GB/s (RDMA), 141.85 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 24: 45.53 GB/s (RDMA), 148.61 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 28: 47.19 GB/s (RDMA), 154.02 GB/s (NVL)
[tuning] SMs 24, NVL chunk 2, RDMA chunk 32: 49.68 GB/s (RDMA), 162.14 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 8: 27.93 GB/s (RDMA), 91.17 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 12: 35.50 GB/s (RDMA), 115.87 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 16: 40.53 GB/s (RDMA), 132.28 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 20: 43.10 GB/s (RDMA), 140.69 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 24: 45.95 GB/s (RDMA), 149.98 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 28: 47.53 GB/s (RDMA), 155.13 GB/s (NVL)
[tuning] SMs 24, NVL chunk 3, RDMA chunk 32: 50.00 GB/s (RDMA), 163.19 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8: 27.71 GB/s (RDMA), 90.45 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12: 35.08 GB/s (RDMA), 114.50 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16: 39.93 GB/s (RDMA), 130.32 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20: 43.41 GB/s (RDMA), 141.71 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24: 46.71 GB/s (RDMA), 152.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28: 48.25 GB/s (RDMA), 157.48 GB/s (NVL)
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32: 50.77 GB/s (RDMA), 165.71 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 4, RDMA chunk 32: 50.77 GB/s (RDMA), 165.71 GB/s (NVL)
[rank0]:[W225 16:44:30.570331162 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
运行低延迟测试(至少需要两个节点)
# python tests/test_low_latency.py
Allocating buffer size: 2116.291072 MB ...
[rank 3] Dispatch + combine bandwidth: 42.96 GB/s, avg_t=513.21 us, min_t=508.26 us, max_t=522.21 us
[rank 5] Dispatch + combine bandwidth: 42.95 GB/s, avg_t=513.34 us, min_t=507.52 us, max_t=518.69 us
[rank 0] Dispatch + combine bandwidth: 42.95 GB/s, avg_t=513.41 us, min_t=505.60 us, max_t=522.69 us
[rank 1] Dispatch + combine bandwidth: 42.97 GB/s, avg_t=513.13 us, min_t=505.92 us, max_t=520.77 us
[rank 7] Dispatch + combine bandwidth: 42.97 GB/s, avg_t=513.08 us, min_t=506.66 us, max_t=519.55 us
[rank 2] Dispatch + combine bandwidth: 42.96 GB/s, avg_t=513.24 us, min_t=508.83 us, max_t=518.02 us
[rank 6] Dispatch + combine bandwidth: 42.97 GB/s, avg_t=513.07 us, min_t=505.66 us, max_t=520.06 us
[rank 4] Dispatch + combine bandwidth: 42.95 GB/s, avg_t=513.39 us, min_t=504.03 us, max_t=523.84 us
[rank 1] Dispatch bandwidth: 37.80 GB/s, avg_t=198.75 us | Combine bandwidth: 43.78 GB/s, avg_t=332.01 us
[rank 0] Dispatch bandwidth: 38.72 GB/s, avg_t=193.99 us | Combine bandwidth: 42.75 GB/s, avg_t=340.03 us
[rank 3] Dispatch bandwidth: 39.60 GB/s, avg_t=189.69 us | Combine bandwidth: 43.21 GB/s, avg_t=336.44 us
[rank 5] Dispatch bandwidth: 40.54 GB/s, avg_t=185.31 us | Combine bandwidth: 43.33 GB/s, avg_t=335.49 us
[rank 2] Dispatch bandwidth: 38.80 GB/s, avg_t=193.61 us | Combine bandwidth: 43.27 GB/s, avg_t=335.99 us
[rank 4] Dispatch bandwidth: 40.37 GB/s, avg_t=186.09 us | Combine bandwidth: 43.17 GB/s, avg_t=336.73 us
[rank 6] Dispatch bandwidth: 41.84 GB/s, avg_t=179.55 us | Combine bandwidth: 43.08 GB/s, avg_t=337.43 us
[rank 7] Dispatch bandwidth: 43.50 GB/s, avg_t=172.67 us | Combine bandwidth: 42.51 GB/s, avg_t=341.97 us
[rank 3] Dispatch send/recv time: 29.08 us | Combine send/recv time: 39.44 us
[rank 0] Dispatch send/recv time: 29.10 us | Combine send/recv time: 39.68 us
[rank 2] Dispatch send/recv time: 29.50 us | Combine send/recv time: 40.20 us
[rank 5] Dispatch send/recv time: 30.64 us | Combine send/recv time: 41.12 us
[rank 6] Dispatch send/recv time: 28.97 us | Combine send/recv time: 39.28 us
[rank 4] Dispatch send/recv time: 28.88 us | Combine send/recv time: 39.44 us
[rank 1] Dispatch send/recv time: 29.80 us | Combine send/recv time: 40.03 us
[rank 7] Dispatch send/recv time: 29.49 us | Combine send/recv time: 39.66 us
[rank0]:[W225 16:48:30.888565952 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
测试结果汇总
节点内/节点间带宽测试
num_tokens | |||||
hidden | |||||
num_topk | |||||
Dtype | |||||
CommType | |||||
Bandwidth(GB/s) | |||||
4096 | |||||
7168 | |||||
8 | |||||
BF16 | |||||
Dispatch Intranode | |||||
288.95(NVLink) | |||||
4096 | |||||
7168 | |||||
8 | |||||
FP8 | |||||
Dispatch Intranode | |||||
295.29(NVLink) | |||||
4096 | |||||
7168 | |||||
8 | |||||
N/A | |||||
Combine Intranode | |||||
331.61(NVLink) | |||||
4096 | |||||
7168 | |||||
8*2(group) | |||||
FP8 | |||||
Dispatch Internode | |||||
40.93(RDMA) | |||||
4096 | |||||
7168 | |||||
8*2(group) | |||||
BF16 | |||||
Dispatch Internode | |||||
54.19(RDMA) | |||||
4096 | |||||
7168 | |||||
8*2(group) | |||||
N/A | |||||
Combine Internode | |||||
50.77(RDMA) | |||||
注意,由于本文使用 H100(未阉割,NVLink 峰值双向带宽 900GB/s),和 DeepSeek 官方在 H800(阉割版,NVLink 峰值双向带宽 400GB/s)上评测结果相比,NVLink 带宽有较大提升,符合预期。
低延迟测试
|
Dispatch
|
Combine
|
Dispatch+Combine
|
|
带宽
|
37.843GB/s
|
42.543.78GB/s
|
42.96GB/s
|
|
延迟
|
172198us
|
332342us
|
513.24us
|
|
Send/Recv 延迟
|
28.8830.64us
|
39.2841.12us
|
N/A
|
期待 DeepSeek 接下来的开源内容。
扫描下方
二维码
,关注“
慢慢学AIGC
”