AI 新云操作指南与架构分析【技术篇】

GPUNoSQL数据库云安全

点击下方 卡片 ,关注“ 慢慢学AIGC ”

AI 新云操作指南与架构分析

AI Neocloud Playbook and Anatomy

关键词 :H100 租赁价格削减、AI 新云巨头和竞争者、H100 集群物料清单与集群部署、日常运营、成本优化、拥有成本与回报

H100 Rental Price Cuts, AI Neocloud Giants and Emerging Neoclouds, H100 Cluster Bill of Materials and Cluster Deployment, Day to Day Operations, Cost Optimizations, Cost of Ownership and Returns

作者 :Dylan Patel 和 Daniel Nishball

原文发表于 :2024 年 10 月 3 日

原文链接 (需向 Semianalysis 付费才能查看全部内容,关注本公众号并后台回复“AI 新云”或“Neocloud” 即可可获取完整报告):https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy

AI 新云的崛起引起了整个计算行业的广泛关注。从企业到初创公司,大家都在使用它们来获取 GPU 计算能力。即使拥有自己的数据中心建设和运营团队,微软每月仍在通过 AI 新云花费约 2 亿美元用于 GPU 计算。英伟达通过直接投资、大量分配其 GPU 资源,并在各种演讲和活动中给予赞誉,促进了多个 AI 新云的快速增长。

The rise of the AI Neoclouds has captivated the attention of the entire computing industry. Everyone is using them for access to GPU compute, from enterprises to startups. Even Microsoft is spending ~$200 million a month on GPU compute through AI Neoclouds despite having their own datacenter construction and operation teams. Nvidia has heralded the rapid growth of several AI Neoclouds through direct investments, large allocations of their GPUs, and accolades in various speeches and events.

所谓 AI 新云 ,指的是一种 专注于提供 GPU 计算租赁的新型云计算服务提供商 。这些纯 GPU 云为客户提供尖端的性能和灵活性,但支撑其运作的经济模型仍在不断演变,市场也在逐步了解其商业模式如何运作。

An AI Neocloud is defined as a new breed of cloud compute provider focused on offering GPU compute rental . These pure play GPU clouds offer cutting edge performance and flexibility to their customers, but the economics powering them are still evolving just as the market is learning how their business models work.

在这篇深度探讨的前半部分,我们将揭开运行新云的各个层面,包括如何制定集群物料清单 (BoM)、应对部署、资金和日常运营的复杂性。我们还将提供关于 BoM 和集群架构的几项重要建议。

In the first half of this deep dive, we will peel back the layers of running a Neocloud, from crafting a cluster Bill of Materials (BoM), to navigating the complexities of deployment, funding, and day-to-day operations. We will provide several key recommendations in terms of BoM and cluster architecture.

在报告的下半部分,我们将解释 AI 新云的经济模式,并详细讨论这些新云的市场策略、总体拥有成本 (TCO)、利润率、商业案例,以及在各种情况下的潜在投资回报。

In the second half of the report. we explain the AI Neocloud economy and discuss in detail these Neoclouds’ go to market strategies, total cost of ownership (TCO), margins, business case and potential return on investment for a variety of situations.

最后,我们将探讨 H100 GPU 租赁价格的快速变化,涵盖多个超大规模云服务商和新云提供商,讨论过去一个月按需定价的显著下降,以及 H100 GPU 合约定价的期限结构变化,并展望随着 Blackwell GPU 部署的市场演变。

Lastly, we will address the rapid shifts in H100 GPU rental pricing from a number of different hyperscale and neoclouds , discussing the meaningful declines in on-demand pricing in just the past month, as well as shifts in the term structure for H100 GPU contract pricing and how the market will evolve with upcoming deployments of Blackwell GPUs.

我们还提供更细致、更高频率的 GPU 定价数据,涵盖多种 SKU,您可以在我们的 AI GPU 租赁价格追踪器中找到。关于未来计算能力、计算成本以及对多个当前和未来 GPU SKU 的租赁价格预测的详细数据和模型,您可以参考我们的 AI 云 TCO 模型。

Further granularity and higher frequency data on GPU pricing across many SKUs is available in our AI GPU Rental Price Tracker. Granular data and modeling of future compute capacity, cost of compute and estimations of future GPU rental pricing for multiple current and future GPU SKUs can be found in our AI Cloud TCO model.


目录 Table of Contents

前言:巨头和竞争者 The Giant and the Emerging

  • 第一部分:如何构建 AI 新云 Part 1: How to build an AI Neocloud
  • 理解集群物料清单 (BoM) Understanding Cluster Bill of Materials机箱级物料清单 Compute Chassis Bill of Materials
  • 集群级别 - 网络物料清单 Cluster Level - Networking Bill of Materials
  • 优化后端网络 Optimizing the Back-end Network
  • 光纤与电缆网络的优化 Optimizing Optical vs Electrical Networking
  • 虚拟模块化交换机 Virtual Modular Switch
  • 超额订阅后端网络优化 Oversubscribing Backend Network Optimization
  • AI 新云存储 AI Neocloud Storage
  • 更多网络管理和软件包 More Network Management and Software Packages
  • 集群 BoM 资本支出总结:参考架构 vs SemiAnalysis 优化架构 Summary of Cluster BoM Capex: Reference Architecture vs SemiAnalysis Optimized Architecture
  • 驱动程序、用户体验和软件 Drivers, User Experience and Software
  • 多租户架构 Multitenancy
  • 裸机或虚拟化 Bare Metal or Virtualization
  • 监控和常见错误 Monitoring and Common Errors
  • 更多技巧与测试 More Tips and Tests
  • 集群部署与验收测试 Cluster Deployment and Acceptance Test
  • 日常运营 Day to Day Operations
  • 第二部分:AI 新云经济 Part 2: The AI Neocloud Economy
  • 需求生态系统 Demand Ecosystem
  • 决策链、购物流程 Decision Chain, Shopping Process
  • 定价与合同类型 Pricing and Contract Types
  • 定价趋势 Pricing Trends
  • 集群资本拥有成本 Cluster Capital Cost of Ownership
  • 资本成本 Cost of Capital
  • 集群运营拥有成本 Cluster Operating Cost of Ownership
  • 项目和股权回报、总拥有成本与新云商业案例 Project and Equity Returns, Total Cost of Ownership and Neocloud Business Case
  • 上市时间 Time to Market
  • 未来定价展望 Future Pricing Thoughts

巨头与竞争者

The Giant and the Emerging

AI 新云市场由四类主要提供商组成:传统的超大规模云服务商(Hyperscalers)、新云巨头、正在崛起的新云,以及经纪人/平台/聚合商。

The AI Neocloud market is served by four main categories of providers, Traditional Hyperscalers, Neocloud Giants, Emerging Neoclouds, and Brokers/Platforms/Aggregators.

AI 新云市场规模庞大,是推动 GPU 需求增长的最重要动力之一。总体来看,新云的需求将增长到占总需求的三分之一以上。

The AI Neocloud market is huge and is the most meaningful incremental driver of GPU demand. In broad strokes, we see the Neoclouds growing to more than a third of total demand.

提供 AI 云服务的传统超大规模云服务商包括 Google Cloud (GCP)、Microsoft Azure、Amazon Web Services (AWS)、Oracle、腾讯、百度、阿里巴巴。相比之下,尽管 Meta、xAI、字节跳动和特斯拉也拥有强大的 GPU 集群并计划大幅扩容,但它们目前并不提供 AI 服务,因此不属于这一类。

Traditional hyperscalers offering AI cloud services include Google Cloud (GCP), Microsoft Azure, Amazon Web Services (AWS), Oracle, Tencent, Baidu, Alibaba. In contrast, Meta, xAI, ByteDance and Tesla, despite also having formidable GPU fleets and considerable capacity expansion plans, do not currently offer AI services, and thus do not fall into this group.

传统超大规模云服务商多元化的商业模式使其资本成本最低,但由于它们拥有集成的生态系统、数据湖以及现有的企业客户群,定价通常比其他服务商高出许多。此外,超大规模云服务商在其云业务上通常拥有高利润率,因此定价远高于 AI 云的合理需求水平。

Traditional hyperscalers’ diversified business models allow them the lowest cost of capital, but their integrated ecosystem and data lakes, existing enterprise customer base mean very premium pricing compared to others. Hyperscalers also tend to earn high margins on their cloud business and so pricing is set much higher than reasonable for AI cloud purposes.

AI 新云巨头不同于传统的超大规模云服务商,它们几乎完全专注于 GPU 云服务。最大的几家公司在所有数据中心的现有或计划中的 GPU 规模总和远超 10 万 H100 等效 GPU,一些公司计划为 OpenAI 配备数十万台 Blackwell GPU。三大新云巨头分别是 Crusoe、Lambda Labs 和 Coreweave,后者是其中规模最大的。与超大规模云服务商相比,它们的资本成本更高,但通常可以以更合理的利率获取资本,这使得它们的拥有成本低于正在崛起的新云服务商。

AI Neocloud Giants, unlike traditional hyperscalers, focus almost exclusively on GPU Cloud services. The largest have current or planned capacity in the next few years well in excess of 100k H100 equivalents in aggregate across all their sites, with some planning for hundreds of thousands of Blackwell GPUs for OpenAI. The main three Neocloud Giants are Crusoe, Lambda Labs, and Coreweave, which is by far the largest. They have a higher cost of capital compared to the hyperscalers but usually have a better access to capital at a reasonable rate vs Emerging AI Neoclouds, which means a lower comparative cost of ownership for Neocloud Giants.

正在崛起的 AI 新云包括几十家仍处于初期阶段的小型云提供商,它们的容量有限,并且在数据中心基础设施的运营方面经验不足。这些新兴企业的资本成本通常较高,是我们今天重点讨论的类别。此外,还有许多区域性玩家被归入主权 AI 的范畴,所谓主权 AI 是指那些专注于为美国或中国以外的次级区域提供 AI 云服务的新云。

Emerging AI Neoclouds includes a long tail of a couple dozen clouds that we track that still have small amounts of capacity and are relatively inexperienced at running datacenter infrastructure. These upstarts usually have a higher cost of capital and are the category we will focus most of our time on today. Also included amongst Emerging Neoclouds are many regional players that fall under the Sovereign AI umbrella, which is defined as any AI Neocloud that focuses its business model on the provision of AI Cloud services to a secondary regions outside of the US or China.

这些区域在 AI 技术上远远落后,涵盖欧洲、印度、中东、马来西亚等地。它们的客户通常希望将 GPU 计算保留在美国或中国以外,原因包括监管、隐私、数据安全或其他商业考虑。尽管大多数新兴新云的 GPU 规模不足 1 万台,或尚未部署 GPU,但其中一些公司有极具雄心的计划,可能很快跻身新云巨头之列。

These regions are currently far behind in AI technology and include Europe, India, Middle East, Malaysia, etc. In particular their customers generally would like to keep their GPU compute out of the US or China due to regulatory, privacy, data security or other business reason. While most Emerging Neoclouds either have less than 10k GPUs or have yet to deploy GPUs, many of them have extremely ambitious plans that could soon catapult a few of them into the same league as the Neocloud Giants.

最后,经纪人、平台和聚合商主要负责需求和供给的聚合,但通常是轻资本运作,避免直接承担 GPU 租赁价格的风险,因此它们本身不拥有任何 GPU。这类模式主要分为两种:一类是类似 Shopify 的平台模式,帮助 GPU 拥有者和数据中心代为营销和匹配计算资源;另一类是类似 Amazon 的市场聚合模式,允许 GPU 拥有者向不同买家提供计算服务。

Lastly, are the Brokers, Platforms and Aggregators, who generally aggregate demand and supply but tend to be capital light and shy away from taking direct GPU rental price exposure and as such, do not own any GPUs themselves. There are two main business models within this category, Platforms models that provide a Shopify-like platform to help GPU owners and data centers market and matchmake their compute resources on their behalf, and Aggregators that use an Amazon-like Marketplace model for GPU owners that allows them to offer compute to different buyers.

平台可以提供基础设施即服务 (IaaS) 以及设置和采购支持,帮助那些希望拥有 GPU 计算资源但缺乏集群部署或营销经验的主机。相比于类似 Amazon 的市场聚合模式,经纪人和平台通常需要更多人工介入,类似于房地产经纪人,会从交易金额中抽取一定的佣金。与任何经纪或市场服务一样,收入中的经纪费用对终端客户而言可能并不透明。

Platforms can provide IaaS infrastructure as well as setup and procurement support for hosts that would like to own GPU compute, but do not have any expertise deploying or marketing clusters. Brokers and Platforms generally require more human touchpoints compared to just an Amazon-like marketplace aggregator are similar to real estate agents that help you find homes for a cut of the transaction value. As with any Brokering or Marketplace service, the broker’s cut of the revenue can be opaque to the end customer.

还有一种有趣的新兴商业模式超出了上述类别,即风险投资集群(VC Clusters)。这种模式下,风险投资公司 (VC) 或类似实体为其投资组合公司或其他关联公司专门设立集群。知名案例包括 Andromeda、Computefund.ai 以及 Andreessen Horowitz 计划中的 GPU 集群。通过内部集群,这些风险投资公司可以为计算租赁提供非常灵活的选择,允许短期内租用 512 台或 1 千台 GPU 的大型集群,租金远低于其他新云提供商,交换条件是股份。他们还可以为关联公司提供更慷慨的租赁条款。

Another interesting emerging business model that sits outside the above categories are VC Clusters, whereby a Venture Capital (VC) or VC-like entity sets up clusters for the exclusive use of portfolio or other affiliated companies. Notable examples include Andromeda, Computefund.ai, and Andreesen Horowitz’s planned GPU Cluster. With in-house clusters, these VCs can provide very flexible options for compute rental – offering large 512 or 1k GPU clusters for short periods of times well below what other Neoclouds would charge in exchange for equity. They can also offer generous rental terms to lean further into portfolio or affiliated companies.

picture.image来源:SemiAnalysis

第一部分:如何构建 AI 新云

Part 1: How to build an AI Neocloud

理解集群物料清单

Understanding Cluster Bill of Materials

我们先从简单的框架开始。如果你想创建一个 AI 新云,你会怎么做?这是我们的分步指南,从物料清单 (BoM) 开始,到最终设置新云。

Let’s start with a simple framing. So, you want to start an AI Neocloud? What would you do? This is our step-by-step guide, starting with the BoM and concluding with setting up the Neocloud.

理解并定制 AI 集群的报价和物料清单是新云部署中最重要的因素之一,做好这一步可以决定盈利还是陷入财务困境。我们建议从 CEO 到工程师和销售人员都应了解其物料清单中的每一项。

Understanding and customizing an AI Cluster quote and Bill of Materials (BoM) is one of the most important factors in a Neocloud deployment, and getting it right can be the difference between strong profit margins or financial distress. We recommend that everyone from the CEO to engineers and sales staff understand every single item line in their BoM.

今天部署的大多数新云集群包含 2048 个或更少的 GPU。最常见的物理集群规模为 2048、1024、512 和 256 个 GPU,部署成本与 GPU 数量呈线性关系。在此分析中,我们将以 1024 个 GPU 的部署作为分析的基准。

Most Neocloud Clusters being deployed today have 2048 or fewer GPUs. The most common physical cluster sizes are 2048, 1024, 512, and 256 GPUs, with deployment costs for clusters 2048 GPUs and under scaling linearly with respect to number of GPUs. For this analysis we will focus our analysis on a 1024 GPU deployment as a common denominator for emerging Neoclouds.

OEM 和英伟达在报价时自然会倾向于向上销售。物料清单通常分为四类:机箱级、机架级、集群级和软件级。

OEMs and Nvidia will naturally seek to upsell when quoting out a BoM. The BoM is usually subdivided up into four categories: compute chassis level, rack level, cluster level and software level.

picture.image

来源:SemiAnalysis

计算机机箱物料清单 Compute Chassis Bill of Materials

我们将从最低抽象级别——计算机机箱的物料清单(BoM)开始,它是集群中最昂贵的部分。默认的计算机机箱 BoM 报价往往采用顶级组件——OEM 厂商如 Supermicro、Dell 等,通常会最初报价接近顶级的英特尔 Emerald Rapids CPU,并搭配 2 TB 的 RAM 和 30 TB 本地 NVMe SSD 闪存。

We will start at the lowest level of abstraction, the compute chassis bill of materials (BoM), the most expensive part of cluster. The default compute chassis BoM quote tends to use top of the line components - OEMs such as Supermicro, Dell, etc. will initially quote a near top-of-the-line Intel Emerald Rapids CPU, and a system build that comes with 2TB of RAM and 30 TBytes of local NVMe SSD flash storage.

优化此报价是 AI 云计算的最简单优化步骤。首先可以选择中等水平的英特尔 CPU,因为许多客户的工作负载对 CPU 的要求不高。大模型训练是非常依赖 GPU 的工作负载,而 CPU 的工作强度则非常轻微。CPU 主要执行简单任务,例如控制 GPU 的 PyTorch 等进程、初始化网络和存储调用,或者运行虚拟化程序。

Fine tuning this quote is the easiest optimization available to an AI Neocloud. The step in this optimization is to choose a mid-level Intel CPU given many customer’s workload will not use the CPU much anyways. LLM training is a very GPU intensive workload but for the CPU, the workload intensity is incredibly light. A CPU will mostly be running simple tasks such as the PyTorch and other processes that are controlling the GPU, initializing network and storage calls, and potentially running a hypervisor.

picture.image 来源:SuperMicro

总体而言,虽然 AMD 的 CPU 在大多数仅依赖 CPU 的任务中表现优越,但我们推荐使用英特尔 CPU,因为在英特尔 CPU 上实现 NCCL 性能更容易,虚拟化更简单,整体体验也更稳定。

In general, while AMD CPUs are superior for most CPU only tasks, we recommend using Intel CPUs as on Intel it is easier to get NCCL performance correct, easier to do virtualization, and the experience overall is less buggy.

例如,在 AMD CPU上,你需要使用 NCCL_IB_PCI_RELAXED_ORDERING 并调整不同的 NUMA NPS 设置,以实现可接受的性能。如果你计划进行虚拟化,还需要正确绑定虚拟核心到正确的 NUMA 区域,否则设备到主机及主机到设备的带宽和延迟表现会不理想。当然,如果你有足够的技术能力,这些问题是可以解决的。

For example, on AMD CPUs, you need to use NCCL_IB_PCI_RELAXED_ORDERING and play around with different NUMA NPS settings to achieve acceptable performance. If you plan on doing virtualization, you need to correctly pin your virtual cores to the correct NUMA regions or else your Device to Host and Host to Device bandwidth and latency will be not ideal. To be clear, if you are skilled, this is doable.

许多标准配置提供 2 TB 的 CPU DDR5 RAM,但大多数客户不会使用这么多。RAM 是计算机机箱 BoM 中第四贵的部件。我们建议将标准的 2 TB RAM 降至 1 TB。大多数新云客户的工作负载对 CPU RAM 的需求并不大。

Many standard offerings have 2TB of CPU DDR5 RAM, but most of your customers will not be using that much. RAM is the 4 th most expensive part of the compute chassis BoM. We recommend downgrading from the standard 2 TBytes to only 1TByte of RAM. Most customers of your Neocloud are not likely to ask about RAM capacity as their workloads are not CPU RAM limited at all.

picture.image

来源:SemiAnalysis

picture.image

来源:SuperMicro

除了核心计算组件,另一个潜在的成本节约点是移除标准报价中包含的两个 NVIDIA Bluefield-3 DPU。这些 DPU 最初是为传统 CPU 云计算开发的,旨在节省成本,使其能租用更多 CPU 核心,而不是让这些核心用于网络虚拟化。

Moving beyond core compute components, another potential cost saving is to remove the two NVIDIA Bluefield-3 DPU present in a standard quote. These DPUs were originally developed and pitched more as a cost savings technique for traditional CPU clouds that would allow them to rent out more CPU cores instead of encumbering those CPU cores with having it run network virtualization.

但是你的新云客户不会使用大量的 CPU 计算资源,因此使用部分主机 CPU 核心进行网络虚拟化并无大碍。许多情况下,你将直接交付裸金属服务器给客户,这样一来,根本不需要任何网络虚拟化。此外,Bluefield-3 DPU 相当昂贵,购买一个 54 核 CPU 的成本还低于一个 Bluefield-3。因此,可以跳过 Bluefield-3,使用标准的 ConnectX 前端即可。

But your Neocloud customers are not going to be using much CPU compute anyway, so it doesn’t matter if you are using some of the host CPU cores for network virtualization. In many cases you will handing over bare metal servers to your customers anyways, obviating the need for any network virtualization. Moreover, Bluefield-3 DPUs are considerably expensive to the extent that buying another 54-core CPU is cheaper than purchasing a Bluefield-3. Skip the Bluefield-3 altogether and go with standard ConnectX for front end.

picture.image

来源:Nvidia

通过前几步优化,我们估算每个计算节点(即一台服务器)的成本从 27 万美元降至 25.64 万美元,节省约 5%。在一个拥有 1024 个 H100 GPU 的集群中,共有 128 个计算节点,这意味着可以节省 174 万美元。随着批量的增加,价格还会进一步降低。如需帮助谈判和设计,请联系我们。

Putting these first few cost optimizations together, we estimate that there is a savings of 13.6k,bringingthecostofonecomputenode(i.e.oneserver)downfrom13.6k, bringing the cost of one compute node (i.e. one server) down from 270k USD to 256.4kUSDroughlya5256.4k USD - roughly a 5% savings. In a 1024 H100 cluster with 128 compute nodes, that is a savings of 1.74M USD. This pricing goes lower lower with solid volume. Contact us with help negotiating and designing.

picture.image

来源:SemiAnalysis

在典型的 BoM 中,每个 H100 计算服务器将配备 8 个 400Gbit/s 的 ConnectX-7 NIC,总带宽达 3,200 Gbit/s。一些新云只选择了 4 个 NIC,这将导致后端网络带宽减少 50%。

In a typical BoM, each H100 compute server will have eight 400Gbit/s ConnectX-7 NICs leading to a total bandwidth per server of 3,200Gbit/s. Some Neoclouds have only opted for four NICs which would be a 50% reduction in backend networking bandwidth.

虽然我们认为对于某些工作负载而言,这可能提供更好的总拥有成本(TCO)性能比,但大多数新云的目标客户不会接受低于 8x400 Gbit/s InfiniBand 带宽的计算服务器,因为这会影响工作负载性能。这也是许多公司对 Google Cloud 反感的主要原因之一。Google Cloud 在部署 H100 时使用 8x200G 的以太网,这在某些情况下会影响性能,即便谷歌因此可以节省成本。

While we believe that this might present a better performance per total cost of ownership for certain workloads, most Neoclouds’ target customers are not interested in having anything less than 8x400Gbit/s InfiniBand bandwidth per compute server. Because it does impact workload performance. This is one of the primary reasons why many firms are allergic to Google Cloud. Google Cloud deploys H100s with 8x200G Ethernet using Falcon/GRD. This impacts performance in some cases even if Google does get to save money.

picture.image

来源:Nvidia

我们暂时跳过机架级别,继续讨论集群级别的物料清单(BoM),从网络开始,这是计算节点之后集群最大的成本驱动因素。

Skipping rack level for now, we will move onto the cluster level BoM, starting with networking, which is the largest cluster cost driver after compute nodes.

集群级别 - 网络物料清单 Cluster Level - Networking Bill of Materials

在 H100 集群中,有三种不同的网络:

  1. 前端网络(以太网)
  2. 后端网络(InfiniBand 或 RoCEv2 以太网)
  3. 带外管理网络

There are three different networks in H100 clusters:

  • Frontend Networking (Ethernet)
  • Backend Networking (InfiniBand or RoCEv2 Ethernet)
  • Out of Band Management Networking

我们先来快速回顾一下。前端网络是一个普通的以太网络,用于连接互联网、SLURM/Kubernetes,以及用于加载训练数据和模型 checkpoint 的网络存储。通常,该网络的速度为每个 GPU 25-50 Gb/s,因此在一台 HGX H100 服务器上,前端网络带宽总量为 200-400 Gbit/s。

As a quick refresher, the frontend networking is simply a normal ethernet network that is used to connect to the internet, SLURM/Kubernetes, and to networked storage for loading training data and model checkpoints. This network typically runs at 25-50Gb/s per GPU, so on an HGX H100 server, this will amount to 200-400Gbit/s per server.

相比之下,后端计算网络用于扩展 GPU-GPU 通信,从几十个机架到成千上万个机架。这个网络可以使用 Nvidia 的 InfiniBand、Nvidia 的 Spectrum-X 以太网,或使用像 Broadcom 等厂商提供的以太网。Nvidia 的解决方案相对更昂贵,尽管以太网的性能/总拥有成本较优,我们仍然建议新云使用 InfiniBand 或 Spectrum X,因为它们提供最好的性能,并且更容易销售,客户通常认为 InfiniBand 代表最佳性能。许多客户假设以太网性能“远低于 InfiniBand”,尽管这与事实并不完全一致,更多是因为优化 NCCL 需要新云及客户具备良好的工程能力和时间。而且很多人认为 Nvidia 会优先分配购买他们网络解决方案的客户。

In contrast, the backend compute fabric is used to scale out GPU-GPU communications from tens of racks to thousands of racks. This network could either use Nvidia’s InfiniBand or Nvidia’s Spectrum-X Ethernet or Ethernet from a switch vendor such as Broadcom through a variety of vendors including Arista, Cisco, and various OEM/ODMs. The options from Nvidia are more expensive compared to the Broadcom Ethernet solutions. Despite Ethernet’s perf per TCO, we would still recommend that Neoclouds use InfiniBand or Spectrum X since it has the best performance and will be the easiest to sell as customers associate InfiniBand with the best performance. Customers often assume Ethernet has “way lower performance” even though this does not reflect reality. It mostly stems from the fact there is engineering optimizations that the Neocloud and customer must do to optimize NCCL. We've done these before and it isn't easy unless you have good engineering talent and time. Furthermore many believe Nvidia gives preferred allocations to those buying their networking solutions.

最后是带外管理网络,它用于重新加载操作系统,监控节点状态(如风扇速度、温度、功率消耗等)。服务器、PDU、交换机、CDU 上的基板管理控制器(BMC)通常通过该网络进行监控和控制。

Lastly, there is your out of band management network. This is used for re-imaging your operating system, monitoring node health such as fan speed, temperatures, power draw, etc. The baseboard management controller (BMC) on servers, PDUs, switches, CDUs are usually connected to this network to monitor and control servers and various other IT equipment.

Nvidia 及 OEM/系统集成商通常为每台服务器提供 2x200GbE 的前端网络连接,使用 Nvidia Spectrum 以太网 SN4600 交换机部署网络。不过我们建议避免这种配置,因为每台 HGX 服务器 400Gbit/s 的带宽远超客户的实际需求。客户仅会使用前端网络进行存储和互联网连接调用,以及通过 SLURM 和 Kubernetes 进行带内管理。由于前端网络不会用于敏感的低延迟、高带宽的梯度通信,400 Gbit/s 对于每台服务器来说过于夸张。因此,我们建议在前端网络部署中使用 Arista、思科等厂商的通用以太网交换机,并将每台 HGX 服务器的前端带宽限制为 2x100GbE。

For the frontend network, Nvidia and the OEM/system integrator will usually have 2x200GbE frontend network connectivity on the server, deploying the network using Nvidia Spectrum Ethernet SN4600 switches. However, we would recommend against doing this as having 400Gbit/s per HGX server is way more network bandwidth than what your customer is likely to use. Customers will only be using the frontend network for storage and internet network calls as well as for in-band management for SLURM and Kubernetes. Because the front-end network will not be used for latency sensitive and bandwidth intensive gradient all reduce collective communications, 400Gbit/s per server is going to be overkill. As such, for the overall front-end network deployment we recommend using a generic ethernet switch from vendors like Arista, Cisco, or various OEMs/ODMs instead and only having 2x100GbE per HGX Server.

下一个可以轻松优化的部分是带外管理网络。默认 BoM 包含了 Nvidia Spectrum SN2201 1GbE 交换机,但这些交换机的价格溢价相当高,难以为一个如此简单的管理网络功能(如带外管理)辩护。这类似于购买昂贵的品牌药物而不是普通药品。使用任何通用的 1 GbE 交换机都可以降低带外网络成本,因此我们建议使用通用的 1 GbE 交换机。

The next low hanging fruit would be from out of band management networking. The default BoM includes SN2201 Nvidia Spectrum 1GbE switches, but the pricing of these switches is at a considerable premium which is hard to justify for something as simple as out of band networking. This would the equivalent of buying branded Advil instead of the generic Ibuprofen. Using any generic out of band switch will reduce out of band network costs, and as such, we would recommend using a generic 1GbE switch.

picture.image来源:SemiAnalysis

后端网络优化 Optimizing the Back-end Network

后端网络的选择更加复杂,需要对高性能网络有更深入的理解。后端网络的流量模式与传统云网络完全不同,通常会运行大量集体通信操作(例如All Reduce、All Gather、Reduce Scatter),这些操作的突发性数据流量非常大。

The Backend network is where the choices get more complicated and require a far deeper understanding of high-performance networking, which can at times be lacking amongst newer Emerging Neoclouds firms. This network will run elephant size bursts of All Reduce, All Gather, Reduce Scatter, i.e. your collective communications. Due to the burstiness of these collectives, the back-end network has a completely different traffic pattern compared to traditional cloud networking.

首先我们来谈谈 Nvidia 的参考网络拓扑。该拓扑采用了两层 8 轨优化的 fat tree 架构,具有非阻塞连接。在一个非阻塞的 fat tree 网络中,如果你将节点随机分为对,所有对之间可以同时以全带宽通信。尽管在实际应用中,由于拥堵、不完美的自适应路由及额外的交换机跳数延迟,这并非总是能够实现。

First, we will talk about the Nvidia reference network topology. The reference topology is a two tier 8-rail optimized fat tree with non-blocking connectivity. In a non-blocking fat tree network, if you arbitrarily divide nodes into pairs, then all pairs should be able to communicate to each other at full bandwidth at the same time. Although in practice, this is often not exactly the case due to congestion, imperfect adaptive routing and additional latency of additional switch hops.

picture.image 来源: Nvidia

当网络是 8 轨优化时,32 台服务器中的所有 32 个 GPU 中的每个索引(例如 GPU#0)都有自己的叶交换机。换句话说,所有服务器的 GPU#0 连接到叶交换机 #0,所有服务器的 GPU#1 连接到叶交换机 #1,依此类推。

When a network is 8-rail optimized, instead of all 32 GPUs from 4 servers connected into a Top of Rack (ToR) switch, each GPU index out of 8 GPU index from 32 servers has their own switch. i.e. all GPU #0 from all 32 servers connect to leaf switch #0, all GPU #1 from all 32 servers connect to leaf switch #1, and so on.

这种优化的主要好处是减少拥堵。如果所有 GPU 都连接到同一个顶层交换机,当它们同时向网络发送流量时,使用相同链路的概率很高,从而导致拥堵。对于 AI 训练中的 GPU,它们通常会同时发送数据,以执行集体操作并交换梯度更新新参数。

The main benefit of a rail optimized network is to reduce congestion. If all GPUs from the same server were connected to the same ToR switch, when they all try to send traffic into the network at the same time, then the probability that they would attempt to use the same links to traverse the fat tree network would be very high, resulting in congestion. GPUs used for AI training should be expected to routinely send data all at once as collective operations are needed to exchange gradients and update new parameters.

第一个图展示了一个 8 轨优化的网络,每个 GPU 集体通信操作有 8 个并行流连接到 8 个不同的叶交换机;第二个图展示了非轨优化设计,服务器连接到同一个顶层交换机。

The first diagram below illustrates an 8-rail optimized network in which there are 8 parallel flows from collective communication used to connect to 8 different leaf switches, while the second diagram illustrates a non-rail optimized design with servers connecting to a ToR switch.

picture.image来源:SemiAnalysis

picture.image来源:SemiAnalysis

Nvidia 的参考架构还将集群划分为 4 个 pod(也称为可扩展单元或 SU),每个 pod 包含 32 台 HGX 服务器(256 个 H100 GPU)和 8 条轨道。每个 GPU 索引总是与同一 pod 中另一台服务器的同一 GPU 索引仅一跳距离。这很重要,因为它减少了主干交换机的网络流量,而主干交换机经常成为拥堵热点(即使在非阻塞网络中)。

The Nvidia reference architecture also divides the cluster into 4 pods (also known as scalable units or SU), with each pod containing 32 HGX servers (256 H100s) and 8 rails. Each GPU index is always one hop away from the same GPU index in another server within the pod. This is important because it reduces network traffic on the spine switches which can easily be a congestion hotspot (even on non-blocking networks).

尽管很多人认为 8 轨优化仅适用于单租户环境,但它在多租户环境(如GPU 新云)中尤为重要。在 8 轨优化网络中,每个工作负载的 8 个流都被物理分离,避免了路由/交换冲突。在我们即将发布的 Nvidia NCCL 和 AMD RCCL 集合通信深度探讨中,我们将讨论 8 轨优化配置的优势,以及为什么拥塞可能是一个严重问题,尤其是对于 AI 新云等多租户环境。

Contrary to popular belief, being rail optimized and reducing top level traffic/congestion is especially important in multi-tenant environments such as an GPU Neoclouds where you will very often have multiple tenants/customers. In an 8-rail optimized network, all 8 flows from each workload are physically separated, thus routing/switching collisions cannot occur. In our upcoming Nvidia NCCL and AMD RCCL collective deep dive, we will discuss the benefits of rail optimized configurations and why congestion can be a serious problem, especially for multi-tenant environments such as AI Neoclouds.

不幸的是,拥塞并不是可以通过 nccl-tests 轻易测量的,而是需要真实世界的并发工作负载来观察嘈杂邻居/拥塞问题如何影响端到端工作负载吞吐量。如果没有租户之间的物理隔离,嘈杂邻居总是会存在。基于我们对拥塞的观察,我们强烈建议采用某种形式的 8 轨优化拓扑。

Unfortunately, congestion is not something that can be easily measured through nccl-tests and instead requires real world concurrent workloads to see how the noisy neighbor/congestion problems affect end to end workload throughput. Without physical isolation between tenants, noisy neighbors will always exist. Given what we have seen on congestion, we would strongly recommend some form of 8-rail optimized topology.

8 轨优化拓扑的另一个好处是,由于大部分流量将局限于叶子交换机,因此可以对网络的主干层实施超卖,这是一种我们将在本文后面讨论的架构优化。

One other benefit of a rail optimized topology is that since most of the traffic will be local to the leaf switches, it is possible to oversubscribe the spine layer of your network, an architectural optimization that we will discuss later on in this article.

picture.image来源:Nvidia

优化光纤与电气网络
Optimizing Optical vs Electrical Networking

使用光纤进行网络连接的优势在于它的传输距离更长,但缺点是增加了功耗和光学收发器的高昂成本,特别是通过 Nvidia 直接购买时,这对 InfiniBand 网络来说基本上是必须的。优化物理网络拓扑和机架布局可以减少光学收发器的使用,仅在确实需要长距离传输时才使用它们。

The use of optics for networking has the advantage of much longer reach, but the drawback is in its added power requirements and very high cost of optical transceivers, particularly when purchasing through Nvidia directly, which is basically a must for InfiniBand networking. Optimizing the physical network topology and rack layout can allow you to reduce the use of optical transceivers, saving them only for when the longer reach is actually required.

picture.image来源:Daniel Gross

在 Nvidia 参考设计中,叶子交换机位于单独的网络机架上,主干交换机位于专用网络机架上,这意味着需要使用 100% 的光纤。

In the Nvidia Reference Design, the leaf switches are on a separate networking rack and the spine switches are on a dedicated networking rack meaning that using 100% optics is required.

picture.image来源:SemiAnalysis

为此,可以考虑一种 非阻塞的机架顶部(ToR)设计拓扑 。大多数来自传统网络背景的人会立即认出这种设计,因为它是传统网络中最常见的设计,在机架中间或顶部有一个交换机连接机架中的所有服务器。由于从 ToR 交换机到服务器的距离小于 3 米,我们可以使用"便宜的"无源铜缆(称为直连铜缆,DAC)来连接服务器和叶子交换机。对于这种设计,我们建议将 InfiniBand 交换机放在中间,以缩短 DAC 线缆需要传输的距离。

One network topology that can be considered to this end is a non-blocking Top of Rack (ToR) design . Most people coming from a traditional networking background will instantly recognize this design as it is the most common design in traditional networking where there is a switch in the middle or at the top of the rack that connects to all the servers in the rack. Since distances from the ToR switch to the server are less than 3 meters, we can use “cheap” passive copper cables called Direct Attach Copper (DAC) cables to connect from the server to the leaf switch. For this design, we recommend placing the InfiniBand switch in the middle to shorten the distance that the DAC cables need to travel.

picture.image

来源:SemiAnalysis

从叶子交换机到顶层主干交换机,我们将不得不使用光纤。这很昂贵,但至少 50% 的连接现在将被更便宜的 DAC 铜缆替代。

From the leaf switch to the top tier spine switches we will have to use optics. This is expensive, but at least 50% of your connections will now be replaced with cheaper DAC copper cables.

picture.image来源:SemiAnalysis

不幸的是,对于这种设计,你将无法实现 8 轨优化网络,因此即使你的主干层是非阻塞的,你也会经常遇到拥塞热点,因为现在有 8 个数据流跨越多层交换机,这意味着每个数据流都需要动态使用不同的路径来避免拥塞。在一个完美的世界里,如果你有完美的自适应路由,ToR 作为一种拓扑将工作得很好,因为路由将始终避开拥塞的路线。但在现实中,由于完美的自适应路由并不存在,实施这种拓扑将严重损害网络性能。

Unfortunately for this design, you will not be able to implement 8-rail optimized networking, and as such you will commonly run into congestion hotspots at your spine layer even if it is non-blocking as there are now 8 flows going across multiple levels of switches, meaning that each flow will need to dynamically use different paths to avoid congestion. In a perfect world where you have perfect adaptive routing, ToR will work well as a topology since the routing will always avoid a congested route. But in reality, because perfect adaptive routing does not exist, and implementing this topology will hurt network performance a lot.

picture.image来源:Nvidia

下图是我们模拟的这种非阻塞机架顶部结构的热图,其中浅蓝色表示由于拥塞而带宽较低,深蓝色表示接近全线速率。如你所见,使用 ToR 拓扑可以达到线速率,但由于所有 8 个数据流都进入一个交换机,仍然存在相当大的拥塞,这些数据流的吞吐量变得更加不稳定,带宽也更低。

In the diagram below is our simulated heatmap of this non-blocking top of rack fabric where the lighter blue color indicates less bandwidth due to congestion and dark blue means near full line rate. As you can see, using a ToR topology, it is possible to reach line rate but there is still considerable congestion due to all 8 flows going into one switch, with throughput becoming far more jittery and less bandwidth with these flows due to congestion.

picture.image来源:SemiAnalysis

尽管这种设计对于新云等多租户环境的性能并不特别好,但成本节省是巨大的,可以节省 34.8% 的后端 InfiniBand 结构成本。

Even though the performance of this design is not particularly good for multi-tenant environments like Neoclouds, the cost savings are huge, saving 34.8% of the backend InfiniBand fabric cost.

picture.image来源:SemiAnalysis

虚拟模块化交换机
Virtual Modular Switch

世间安得双全法,不负优化不负省?

那么,如果我们能够兼顾两全 - 8 轨优化的性能优势和 ToR 的成本节省呢?

Now, what if we could have the best of both worlds - the performance benefit of 8-rail optimized while also having the cost saving of ToR?

这就是虚拟模块化交换机的用武之地。它具有与 Nvidia 参考设计相同的逻辑拓扑,但由于巧妙的平面规划和交换机位置规划,可以使用铜缆从叶子交换机连接到主干交换机。

This is where a virtual modular switch comes in. It has the same logical topology as the Nvidia reference design but can use copper from the leaf switches to the spine switches due to clever floor planning and switch location planning.

picture.image来源:SemiAnalysis

基本思路是将交换机机架直接放置在彼此之间,使主干交换机位于中间机架,而叶子交换机位于左右机架,如下图所示。这样,叶子交换机和主干交换机之间的连接可以全部使用铜缆,而服务器和叶子交换机之间的连接仍将使用光纤。

The basic idea here is to place the switch racks directly between each other such that the spine switches are in the middle rack while the leaf switches are the left and right rack as illustrated in the diagram below. This way, the connections between the leaf and the spine switches can be all copper while the connections between the servers and the leaf switches will still use optics.

由于拓扑仍然是 8 轨优化的,8 个数据流中的每一个都将被物理分离,大大减少了拥塞。

Since the topology is still 8-rail optimized, each one of the 8 flows will be physically separated, significantly reducing congestion.

这种设计应该能给我们两全其美,但这种拓扑有什么缺点呢?

This design should give us the best of both worlds, but what are the drawbacks of this topology?

不幸的是,这些交换机到交换机的 DAC 铜缆通常弯曲半径较差,而且非常粗,会阻碍气流。我们曾见过这种设计在生产中部署,如果你做好线缆管理,这些问题是可以克服的。这个问题也可以通过使用有源铜缆(ACC)来解决,ACC 几乎和多模光纤一样细,并且有很好的弯曲半径。不幸的是,我们听说的一个潜在问题是 Nvidia 的 LinkX NDR ACC 线缆的错误率不是很好。

Unfortunately, these switch-to-switch DAC copper cables often tend to have a poor bend radius and are very thick, leading to blocking of air flow. We have seen designs like this being deployed in production before, and if you cable manage it well, these issues can be overcome. This problem can also be tackled using active copper cables (ACC), which are almost as thin as multimode fiber and have a good blend radius. Unfortunately, one potential issue that we heard about is that the error rate on Nvidia’s LinkX NDR ACC cables is not very good.

picture.image来源:SemiAnalysis

使用这种非阻塞虚拟模块化交换机设计,我们可以在后端网络上节省 24.9% 的成本,同时保持相同的性能。另一个巨大的好处是,无源铜缆通常比光学收发器更可靠。收发器的故障率很高,激光器是主要的故障组件。这种高故障率会带来更换收发器零件、集群停机和维修所需劳动力方面的成本。

Using this non-blocking virtual modular switch design, we can save 24.9% on the Backend network compared to the reference architecture while maintaining the same performance. One other huge benefit is that passive copper is generally way more reliable than optical transceivers. Transceiver failure rate is high with the lasers being the primary component of failure. This high failure rate introduces costs in terms of the replacement transceiver parts, cluster downtime and labor needed for repairs.

picture.image来源:SemiAnalysis

基于超卖的后端网络优化

Oversubscribing Backend Network Optimization

我们可以通过突破非阻塞网络的限制来进一步优化成本。由于在 8 轨优化设计中,大部分流量都局限于 32 台服务器的单元内,并且 InfiniBand 具有相当不错的自适应路由,你可以在叶子交换机到主干的连接中设计一定程度的超卖比。即使集群将被单个租户用于运行单个工作负载,这也是有好处的。当使用 1024 个 GPU 时,你永远不会有单个模型副本大于 256 个 GPU。这意味着张量、专家和流水线并行(这些往往更加带宽密集)将在 32 台服务器的单元内运行。

We can take cost optimizations a step further by stepping out of our constraint of having a non-blocking network. Since most of the traffic is local to the pod of 32 servers in an 8-rail optimized design, and because InfiniBand has decent enough adaptive routing, you can design in an oversubscription from the leaf switches to the spine. This has benefits even if the cluster will be used by single tenant running only one workload. When using 1024 GPUs, you will never have a single model replica be larger than 256 GPUs. That means that tensor, expert and pipeline parallelisms, which tend to be more bandwidth intensive, will run inside a pod of 32 servers.

这些流量将保持在第一级交换机的本地,而你的带宽要求较低的数据并行、梯度和所有归约将通过主干交换机进行。由于主干层的带宽需求处于较低的范围,并且 InfiniBand 有相当不错的自适应路由,你可以通过设计本身就实现超卖。

That traffic will stay local to the first level of switches, while your less bandwidth intensive data parallelism, gradient, and all reductions will happen across the spine switches. Since bandwidth requirements at the spine layer is on the lower end of the spectrum and there is decent enough adaptive routing with InfiniBand, you can have subscription through design alone.

在 Meta 的 24k H100 集群(详见《Meta 面向生成式 AI 的基础设施建设》)上,他们在单元之间实施了 7:1 的超卖比,但我们认为设计一个更保守的超卖比更有意义,我们建议对小型集群使用 2:1 的超卖比。

On Meta’s 24k H100 cluster, they implemented a 7:1 oversubscription between pods, but we believe that designing in a more conservative oversubscription makes more sense, and we recommend using just a 2:1 oversubscription for small clusters.

picture.image来源:SemiAnalysis

这种设计的好处是,对于 1024 个 H100,你只需要 8 个主干交换机,而不是 16 个。当将 2:1 超卖与虚拟模块化交换机设计结合使用时,我们可以在中间机架中使用更少的交换机。这意味着线缆管理更容易。另一个好处是叶子交换机上会有空闲端口,因此将来如果你有更多的单元间流量,你可以轻松添加更多主干交换机并减少超卖的程度。

The benefit of this design is that instead of requiring 16 spine switches for 1024 H100s, you only need 8 spine switches. When combining a 2:1 oversubscription with the Virtual Modular Switch design, we can have fewer switches in the middle rack. This means cable management is much easier. Another benefit is empty ports on your leaf switches so in the future, when you have heavier inter-pod traffic, you can easily add more spine switches and reduce the degree of oversubscription.

picture.image来源:SemiAnalysis

我们估计,与参考架构相比,使用 2:1 超卖比的虚拟模块化交换机可以节省 31.6% 的成本,相比仅使用非阻塞虚拟模块化交换机设计的 24.9% 节省有所提高。非阻塞设计的唯一缺点(除了更高的成本外)是你需要相当好地将客户分配到物理服务器上,并避免单元边界之间的碎片化。我们相信,有一个称职的团队就可以轻松实现这一点。

We estimate that the cost saving for 2:1 oversubscription with the virtual modular switch will be 31.6% compared to the reference architecture , an improvement over the 24.9% savings when only using the non-blocking virtual modular switch design. The only drawback of a non-blocking design (other than the higher cost) is that you need to allocate your customers to physical servers decently well and avoid fragmentation between pod boundaries. We believe that with a competent team, this can be easily achieved.

picture.image来源:SemiAnalysis

Nvidia 还通过 CS9500 系列为 NDR InfiniBand 提供了自己的物理模块化交换机。你可以使用这种交换机创建相同的 8 轨优化胖树拓扑,如果需要还可以进行超卖。这种模块化交换机最多可支持 2048 个 400 Gbit/s 外部端口,因此可扩展到连接多达 2048 个 H100。主干交换机 ASIC 位于机架背面,而叶子交换机 ASIC 和 OSFP 笼位于机架正面。主干交换机 ASIC 通过类似于 NVL72 背板的铜背板连接到叶子交换机 ASIC。不幸的是,Nvidia 只提供了液冷解决方案。

Nvidia also offers their own physical modular switch for NDR InfiniBand through the CS9500 series. You can use this switch to create the same 8-rail optimized fat tree topology and also do an oversubscription if preferred. This modular switch can support up to 2048 400Gbit/s external ports and thus is expandable to connect up to 2048 H100s. The spine switch ASICs are on the backside of the rack while the leaf switch ASICs and OSFP cages are on the front side of the rack. The spine switch ASICs are connected to the leaf switch ASICs through a copper backplane similar to the NVL72 backplane. Unfortunately, only a liquid cooling solution is offered.

CS9500 需要液冷是我们建议大多数新云部署虚拟模块化交换机而不是物理模块化交换机的原因。当前由 GB200 驱动的对液冷就绪机房的需求,以及机房供应的普遍紧缩意味着新云初创者将难以找到价格合理的容量。由于 Nvidia 根据最终用户的价值来定价,并且这种物理模块化交换机对大型集群部署(考虑O(10k)到O(100k))可能非常有价值,我们认为这比自己制作虚拟模块化交换机要昂贵得多。

The CS9500’s liquid cooling requirement is why we recommend just deploying a virtual modular switch instead of a physical modular switch for most Neoclouds. The current GB200 driven demand for liquid cooling-ready colocation, and the crunch of colocation supply in general means there will not be much reasonably priced capacity for emerging Neoclouds. Since Nvidia prices based on value to the end user, and as this physical modular switch may be very valuable to large cluster deployments (think O(10k) to O(100k)), we believe that this costs more than just making your own virtual modular switch.

picture.image来源:FRONTERA

不幸的是,使用 InfiniBand 的一个缺点是,要拥有一个像样的 REST 接口,你需要购买 UFM 管理许可证。统一结构管理器(UFM)是 Nvidia 提供的一个软件包,用于处理网络管理、性能优化和监控。对于 2048 个 GPU 以下的集群,建议使用 UFM,对于更大规模的集群,则是硬性要求。UFM 许可证按每个 NIC 端点收费,这意味着对于 1024 个 GPU 的集群,你需要购买 1024 个许可证。

Unfortunately, one of the downsides of using InfiniBand is that to have a decent REST interface, you need to buy UFM management licenses. Unified Fabric Manager (UFM) is a software package offered by Nvidia that handles network management, performance optimization and monitoring. Using UFM is recommended for clusters below 2048 GPUs and is a hard requirement for a cluster of larger size. UFM licenses are charged on a per NIC endpoint basis, meaning that for a 1024 GPU cluster, you will need to buy 1024 licenses.

购买 UFM 的一个替代方案是使用开放子网管理器,它只能通过终端命令行界面使用,但幸运的是,你可以创建一个简单的 REST 服务器来封装命令行,并使用 python 的 subprocess 库来执行命令。对于你的第一个集群,我们建议直接购买 UFM 许可证,但对于未来的集群,这是我们建议新云考虑的一种成本节省方案。

An alternative to purchasing UFM would be to use the open subnet manager which is only available through a terminal command line interface, but fortunately you can create a simple REST server that wraps the command line and uses a subprocess python library to execute the commands for you. For your first cluster, we would recommend just buying a UFM license but for future clusters, this is something we recommend Neoclouds look into for cost savings.

AI 新云存储
AI Neocloud Storage

我们将讨论 H100 集群中第二昂贵的部分,网络化 NVMe 存储。这是所有客户都想要的,而且对于运行 SLURM 来说实际上是必需的。存储部署基本上只有两个方面:物理存储服务器和存储软件供应商许可证,如 Weka 或 Vast Data 等。这些是最受欢迎的供应商,因为它们与 OEM 有渠道合作关系。

We will talk about the next most expensive part of an H100 cluster, networked NVMe storage. This is something that all customers want and is practically a requirement for running SLURM. There are basically only two line items for a storage deployment, your physical storage servers and your storage software vendor licenses such as with Weka or Vast Data, etc. These are the most popular vendors due to their channel partnerships with OEMs.

picture.image 来源: Weka

为了实现高可用性,大多数存储软件供应商建议至少部署 8 台存储服务器。事实上,大多数新云只部署最少的 8 台存储服务器。使用 8 台存储服务器,在大块大小下,你可以在所有存储服务器上获得 250 GByte/s 到 400 GByte/s 的聚合存储带宽。这足以满足 1024 个 H100 上可能运行的大多数合理或不合理的 AI 工作负载。

For high availability, most storage software vendors recommend you deploy at least 8 storage servers. Indeed, most Neoclouds only deploy the bare minimum of 8 storage servers. With 8 storage servers, you will get between 250GByte/s to 400GByte/s of aggregated storage bandwidth at big block sizes across all storage servers. That’s more than enough to cater to most reasonable or unreasonable AI workloads one could possibly run on 1024 H100s.

picture.image 来源: SuperMicro

由于存储的交付周期很短,我们建议对 1024 个 H100 集群从 2 PB 的总存储容量开始,因为如果你发现客户正在利用你部署的容量,你可以轻松扩展存储。我们的建议是在存储部署中留出足够的端口、NVMe 驱动器托架、电源和机架空间,以便于扩展。存储成本的大部分在于存储软件许可证,而不是物理存储服务器本身。

Because lead times for storage are very short, we recommend you start off with 2 PetaBytes of total storage capacity for a 1024 H100 cluster as you can easily expand storage if you see your customers are utilizing your deployed capacity. Our recommendation is to leave enough ports, NVMe drive bays, power and rack space within your storage deployment to allow for easy expansion. Most of the storage cost is in the storage software license and not the physical storage servers itself.

picture.image

来源:SemiAnalysis

尽管你的存储服务器可以在 InfiniBand 后端计算结构上运行,但那些尝试过的人已经损失了很多头发!这种部署通常会将 GPU 0 的 IB NIC 也绑定为存储 NIC。在存储性能测试中,这将提供出色的延迟和高带宽,但在实际工作负载中,这将导致 GPU 0 成为瓶颈,因为将 IB NIC 用于存储会造成冲突。当存储集群中的磁盘发生故障时,将触发重建,这将在计算结构上产生大量网络流量,造成更多拥塞。你可以购买单独的专用存储结构,但这有点大材小用,因为你可以将存储流量放在前端网络上。

Although your storage servers could run on InfiniBand backend compute fabric, those who have tried have lost a lot of hair! This deployment will typically bind your IB NIC for GPU 0 to also act as your storage NIC. In hero storage benchmarking, this will deliver great latency and high bandwidth, but in real world workloads, this will cause your GPU 0 to be a straggler as utilizing the IB NIC for storage will create collisions. When disks fail in your storage cluster, a rebuild will be triggered, which will cause a meaningful amount of network traffic on your compute fabric, causing even more congestion. You could buy a separate dedicated storage fabric but this is overkill since you can just have storage traffic on your frontend networking.

我们建议将存储服务器和流量放在前端网络上。前端网络经常处于利用不足的状态,因为它主要用于互联网流量、SLURM/Kubernetes 管理和拉取容器镜像。

Our recommendation is that you put your storage servers and traffic on the frontend network. The frontend network often sits underutilized as it is used primarily for internet traffic, SLURM/Kubernetes management and pulling container images.

更多网络管理和软件包 More Network Management and Software Packages

在带内管理方面,为了运行高可用性 UFM 和 CPU 管理节点,我们建议至少部署三个 CPU 节点。在这三个节点中,两个需要 ConnectX NIC 来管理 InfiniBand 结构。第三个 CPU 节点将仅用于其他非 InfiniBand 管理任务。此外,还需要其他杂项 IT 设备,如物理防火墙、42U 机架、监控 PDU 等,但这些项目的价格并不会显著增加集群的总资本支出成本。

In terms of in-band management to run high availability UFM and CPU management nodes, we recommend deploying at least three CPU nodes. Out of these three nodes, two will require a ConnectX NIC to manage the InfiniBand fabric. The third CPU node will only be used for other non-InfiniBand management tasks. Furthermore, there are other miscellaneous IT equipment required such as physical firewalls, 42U Racks, monitored PDUs, among other items, but the price point for these items doesn’t add significantly to total cluster capex cost.

在默认的 Superpod 参考架构中,Nvidia 及其 OEM 合作伙伴会试图向你推销名为"Nvidia AI Enterprise"或"Base Command Manager (BCM)"的产品,其建议零售价为每个 GPU 每年 4,500 美元。BCM 是一个提供 AI 工作流和集群管理的软件包,但由于大多数客户会满足自己的工作流需求,这对新云业务来说并不是一个有价值的软件,但销售代表仍然会将其作为初始采购订单的一部分进行营销。这是我们 SemiAnalysis 优化集群物料清单(BoM)中另一个巨大的成本节省来源。

In the default Superpod Reference Architecture, Nvidia and their OEM partners will try to sell you something called “Nvidia AI Enterprise” or “Base Command Manager (BCM)”, for which the MSRP is at $4,500 per GPU per year. BCM is a software package that provides AI Workflow & Cluster management, but as most clients will cater to their own workflow needs, this is not a valuable piece of software to a Neocloud business, but sales reps will nonetheless market this as part of their initial purchase order. This is another source of huge cost savings in our SemiAnalysis Optimized Cluster BoM.

集群 BoM 资本支出总结:参考架构 vs SemiAnalysis 优化架构 Summary of Cluster BoM Capex: Reference Architecture vs SemiAnalysis Optimized Architecture

如下所示,使用 Nvidia Superpod 参考架构(RA),集群的全包成本约为每台计算服务器 31.8 万美元(不包括存储),但使用 SemiAnalysis 优化架构和 2:1 超卖比,全包成本仅为每台计算服务器 28.3 万美元(也不包括存储)。通过谈判帮助和进一步的成本削减,特别是在更大的集群上,我们已经帮助新云进行了更多优化。

As you can see below, with the Nvidia Superpod Reference Architecture (RA), the all-in cost for the cluster comes up to ~318kpercomputeserver(excludingstorage),butusingtheSemiAnalysisOptimizedArchitecturewitha2:1oversubscription,totalallincostwilljustbe318k per compute server (excluding storage), but using the SemiAnalysis Optimized Architecture with a 2:1 oversubscription, total all-in cost will just be 283k per compute server (also excluding storage). We have helped Neoclouds optimize further beyond shown through negotiation help, and further cost cutting especially on larger clusters.

picture.image来源:SemiAnalysis

驱动程序、用户体验和软件 Drivers, User Experience and Software

如果你来自大型科技公司或国家 HPC 实验室,用户需求很简单。用户希望 GPU 正常工作、网络正常运行、驱动程序正确安装、共享存储正常运行,以及像 SLURM 或 Kubernetes 这样的调度程序。然而,现实是绝大多数新云无法满足这些用户需求,这导致了糟糕的用户体验。

If you come from big tech or from a national HPC lab, user requirements are straightforward. Users want functioning GPUs, networking, properly installed drivers, a functioning shared storage and a scheduler such as SLURM or Kubernetes. However, the reality is that a vast majority of Neoclouds are not able to meet these user requirements which makes for poor user experiences.

首先是运行 GPU 所需的 GPU 驱动程序 - 我们需要 cuda-drivers-5xx 和 fabricmanager-5xx,以及 cuda-toolkit-12-x。

Starting with GPU drivers required to run the GPUs – we need cuda-drivers-5xx and fabricmanager-5xx, and cuda-toolkit-12-x.

Cuda-drivers-5xx 是 ubuntu/Linux 与 GPU 接口所需的内核空间 Nvidia 驱动程序。接下来是 fabricmanager-5xx,这是一个负责配置节点内 NV link 结构的软件包。没有 fabricmanager-5xx 包,节点内的 8 个 GPU 将无法通过 NV link 相互通信。Cuda-toolkit-12-x 是包含所有用户空间工具和 API 的工具包,如 NVCC,它是将 CUDA C++ 代码编译成 PTX 汇编和 Nvidia 机器代码的编译器。

Cuda-drivers-5xx is the kernel space Nvidia drivers needed for ubuntu/Linux to interface with the GPUs. Next is fabricmanager-5xx, a software package responsible for configuring the intra-node NV link fabric. Without the fabricmanager-5xx package, the 8 GPUs within a node would not able to communicate with one another over NV link. Cuda-toolkit-12-x which is the toolkit that contains all the userspace tools and APIs like NVCC which is the compiler that compiles CUDA C++ code into PTX assembly and Nvidia machine code.

对于网络,需要在每个 GPU 服务器上安装 Mellanox OpenFabrics Enterprise Distribution (MLNX_OFED)驱动程序。这个包是 ConnectX-7 InfiniBand NIC 进行 RDMA(远程直接内存访问)和 OS 内核绕过所需的驱动程序。为了让你的 GPU 直接与 NIC 通信,你还需要 GPUDirect RDMA,这是包含在 cuda-drivers-5xx 中但默认未启用的额外内核驱动程序。没有这个驱动程序,GPU 将需要在 CPU RAM 中缓冲消息,然后这些消息才能到达 NIC。启用 GPUDirect RDMA 的命令是"sudo modprobe nvidia-peermem"。要进一步优化 GPU 到 NIC 的通信,你需要下载一个名为 Nvidia HPC-X 的包。

For the networking, Mellanox OpenFabrics Enterprise Distribution (MLNX_OFED) drivers are required to be installed on each GPU server. This package are the drivers for the ConnectX-7 InfiniBand NICs to do RDMA-ing (Remote Direct Memory Access-ing) and OS kernel bypassing. For your GPUs to talk directly to your NIC, you also need GPUDirect RDMA, an additional kernel driver that is included in cuda-drivers-5xx but not enabled by default. Without this driver, the GPUs will need to buffer messages in CPU RAM before these messages can go to the NIC. The command to enable GPUDirect RDMA is “sudo modprobe nvidia-peermem”. To further optimize your GPU to NIC communication, you need to download a package called Nvidia HPC-X.

如果没有上述 GPUDirect RDMA 和 HPC-X 包,你的 GPU 只能以 80 Gbit/s 的速度发送和接收流量,而不是每个 GPU 400 Gbit/s 的线速率。启用这些包后,你的点对点发送和接收速率应该达到 391 Gbit/s,接近 400 Gbit/s 的线速率。

Without the aforementioned GPUDirect RDMA and HPC-X packages, your GPUs will only be able to send and receive traffic at 80Gbit/s out of the line rate of 400Gbit/s per GPU. With these packages enabled, your point to point send and receive rate should reach 391Gbit/s out of the line rate of 400Gbit/s.

接下来,用户会希望有调度和启动软件包。在新云市场,70% 的用户希望 SLURM 开箱即用,另 20% 希望 Kubernetes 开箱即用,最后 10% 主要希望安装自己的调度程序。

Next, users will want a scheduling and launching software package. In the Neocloud market, 70% of users want SLURM working out of the box, another 20% want Kubernetes working out of the box and the last 10% mostly want to install their own scheduler.

对新云来说,拥有开箱即用的 SLURM 或 Kubernetes 非常重要,因为最终用户通常没有安装这类调度程序的经验。这是因为来自大型科技公司或国家/大学实验室背景的用户通常有专门负责安装和操作这些 SLURM 软件的人。最终用户花 1-2 天时间自己安装 SLURM 的成本是相当高的,因为他们实际上是在为安装期间闲置的 GPU 集群付费。

It is quite important for Neoclouds to have SLURM or Kubernetes working out of the box as the end user is usually not experienced in installing these types of schedulers. This is because users who come from big tech, or a national/university lab background, which usually have a dedicated person in charge of installing and operating these SLURM software. The cost for an end user having to spend 1-2 days to install SLURM themselves is significant as they will effectively be paying for a GPU cluster that is sitting idle during the installation time.

最后,100% 的客户还必须能够在需要时手动获得 GPU 节点的交互式终端(即 ssh) - 有管理的 SLURM 提供了这个功能。使用 SLURM,你可以运行"srun –gres=gpu=8 -w NODE_NAME –pty bash"来获得任何节点的交互式终端。

Finally, 100% of customers also must be able to manually get an interactive terminal (i.e. ssh) into their GPU nodes if needed - having managed SLURM provides this feature. With SLURM, you can run “srun –gres=gpu=8 -w NODE_NAME –pty bash” to get an interactive terminal into any node.

像 Crusoe 和 TogetherAI 这样的新云是金标准。因为它们开箱即用地安装了所有必需的 InfiniBand 驱动程序、GPU 驱动程序和调度软件,它们可以比竞争对手收取更高的溢价,并且客户流失率更低。

Neoclouds like Crusoe and TogetherAI are the gold standard. Because they have all the required InfiniBand drivers, GPU drivers, and scheduling software installed out of the box, they can charge a premium over their competitors and have lower churn.

picture.image来源:TogetherAI

对于最低可行体验的下一个用户要求是拥有快速响应的共享主目录和共享数据存储目录。所有 GPU 节点和登录节点都将在 /home/$USER/ 和 /data 挂载共享存储。这实际上意味着当最终用户可以启动任何 GPU 节点的交互式终端时,该节点将具有相同的主目录和文件。这非常棒,因为它意味着分配给用户的每个 GPU 节点都是可互换的,用户不需要关心他们使用的具体是哪个 GPU 服务器。此外,在启动多节点训练作业时,用户的所有代码都会自动出现在每个 GPU 节点上,因此用户不需要手动通过 ssh 复制代码(scp)到每个节点。

The next user requirement for a minimum valuable experience is having a snappy shared home directory and shared data storage directory. All GPU nodes and login nodes will have shared storage mounted at /home/$USER/ and at /data. What this really means is that when the end user can launch an interactive terminal into any GPU node, the node will have the same home directory and files. This is fantastic as it means that every GPU Node allocated to the user is fungible and the user need not care about exactly which GPU server they are using. Furthermore, when launching multi-node training jobs, all of the user’s code is automatically on every GPU node so the user doesn’t need to manually copy code over ssh (scp) to each node.

picture.image来源:SemiAnalysis

对于新云存储,用户挫败感的两个主要来源是文件卷随机卸载和用户遇到大量小文件(LOSF)问题。随机卸载问题的解决方案是使用一个名为"autofs"的程序,它会自动保持你的共享文件系统挂载。

With Neocloud storage, the main two sources of user frustration of storage are when file volumes randomly unmount and when users encounter the lots of small file (LOSF) problem. The solution to the random unmounting issue is to use a program called “autofs” that will automatically keep your shared filesystem mounted.

接下来,LOSF 问题很容易避免,因为它只有在你决定自己构建存储解决方案(如 NFS 服务器)而不是购买 Weka 或 Vast 等存储软件供应商时才会出现问题。如果集群存在 LOSF 问题,最终用户很快就会注意到,因为即使只是将 PyTorch 导入 Python 也会导致完全卡顿。

Next, the LOSF problem can easily be avoided as it is only an issue if you decide to roll your own storage solution like an NFS-server instead of paying for a storage software vendor like Weka or Vast. An end user will very quickly notice an LOSF problem on the cluster as the time to even import PyTorch into Python will lead to a complete lag out if an LOSF problem exists on the cluster.

下面的图表是我们在 Crusoe 集群上测试时产生的,展示了优化且没有 LOSF 问题的集群存储解决方案应该如何表现。如你所见,即使在扩大 GPU 数量时,将 PyTorch 导入 Python 进程所需的时间也保持相对稳定。

The below diagram, produced during our testing on Crusoe’s cluster, demonstrates how a cluster storage solution that is optimized and free of the LOSF problem should behave. As you can see, the time to complete importing PyTorch into the python process stays relatively flat even when scaling up GPU count.

picture.image来源:SemiAnalysis

这与运行在未优化共享存储上的集群有天壤之别,在那里,在 Python 多节点训练运行中导入 PyTorch 所需的时间会急剧增加,经常导致集群完全无法使用。注意 Crusoe(金标准)与另一个存在 LOSF 问题的集群的行为差异。

This is a world of difference to a cluster that is running on unoptimized shared storage, where the time required to import PyTorch in a Python multi node training run explodes, often causing the cluster to be completely unusable. Notice the difference between Crusoe, the gold standard, and how another cluster with LOSF issues would behave.

picture.image来源:SemiAnalysis

多租户 Multitenancy

除非整个客户(租户)长期租用整个物理集群,否则每个物理集群可能会有多个并发客户。这意味着你需要提供前端以太网和后端 InfiniBand 网络的隔离,并实现客户之间的存储隔离。每个客户通常会将每个 GPU 服务器作为一个整体单元租用,这意味着不需要严格的计算服务器虚拟化,因为每个物理服务器只有一个客户。花时间细分节点是不值得的。使用标准 vLAN 可以轻松设置前端以太网网络的隔离。在 vLAN 中,虽然物理以太网结构是共享的,但每个客户的节点只能与分配给同一客户的其他节点通信。

Unless an entire customer (tenant) rents the whole physical cluster out for a long term, each physical cluster will probably have multiple concurrent customers. This means that you need to provide isolation of the frontend Ethernet and backend InfiniBand networks as well as implement isolation of storage between customers. Each customer will typically be renting each GPU server as a whole unit, that means on compute server virtualization is not strictly needed as there is only one customer per physical server. Spending time on subdividing nodes is not worth it. Isolation is easy to set up for the frontend ethernet network using the standard vLANs. In vLAN, while the physical ethernet fabric is shared, each customer’s nodes are only able to talk to other nodes that are assigned to the same customer.

picture.image来源:SemiAnalysis

与以太网 vLAN 相比,InfiniBand 多租户的设置和自动化并不那么容易,但学习曲线很快。在 InfiniBand 世界中,网络隔离是通过分区键(pKeys)实现的 - 本质上与 vLAN 是相同的概念。每个客户通过 pKeys 获得自己的隔离 InfiniBand 网络,只有具有相同 pKeys 的节点才能相互通信。

InfiniBand multi- tenancy is not as easy to set up and automate when compared to Ethernet vLAN, but the learning curve is very quick. In the InfiniBand universe, network isolation is accomplished using Partition Keys (pKeys) - essentially the same concept as vLAN. Each customer gets its own isolated InfiniBand network through pKeys and only nodes with the same pKeys can talk to each other.

picture.image来源:SemiAnalysis

pKeys 的创建和附加可以通过 UFM UI 仪表板轻松完成,或者通过使用 UFM REST API 完成。对于许多工程师来说,这实际上可能比自动化以太网 vLAN 更容易,因为 InfiniBand pKeys 有一个易于使用的 POST/GET/DELETE API。

The creation and attachment of pKeys can either be easily done through the UFM UI dashboard or through using the UFM REST APIs. For many engineers, this may in fact be easier than automating Ethernet vLAN since there is an easy to use POST/GET/DELETE API for InfiniBand pKeys.

不幸的是,我们从自己的测试经验中看到,一些新云的 pkeys 设置不正确,允许一个客户的用户能够在 InfiniBand 网络上看到其他租户的节点。我们强烈建议客户亲自验证他们的 InfiniBand 网络是否与其他客户正确隔离。

Unfortunately, we have seen from our own experience testing that some Neoclouds have pkeys that are not properly set up, allowing one customers’ users to be able to see their other tenants’ nodes on the InfiniBand network. We highly recommend that customers personally verify that their InfiniBand network is properly isolated from other customers.

picture.image来源:Nvidia

picture.image来源:Nvidia

多租户在存储方面尤其重要。幸运的是,存储也相当简单管理,因为 AI 领域的主要存储提供商 Weka 和 Vast 都将多租户作为一级原语支持。

Multi-tenancy is especially important when it comes to storage. Fortunately, storage is also quite simple to manage as the major storage providers in the AI space, Weka and Vast both support multi-tenancy as a first-class primitive.

picture.image来源:SemiAnalysis

在 Weka 和 Vast 的 Data 软件中,你可以轻松创建租户(在 Weka 中称为组织),并为每个存储卷设置访问控制策略,仅分配给一个租户。这个软件提供了强有力的保证,如果策略设置正确,那么每个客户的用户将只能访问他们自己的存储卷。

Within Weka and Vast’s Data software, you can easily create Tenants (called Organizations in Weka) and set up an access control policy for each storage volume to be assigned to just one tenant. This software provides strong guarantees that if the policies are set up correctly, then each customer’s users will only be able to access their own storage volumes.

picture.image来源:Vast Data

picture.image来源:Weka

裸机或虚拟化 Bare Metal or Virtualization

对于 H100 SXM,最小计算单位是一台服务器,这意味着每台服务器在任何时候只会有一个客户。这意味着可以进行裸机部署,同时仍然保持安全性。裸机是可能的,而且确实很常见,但我们看到使用虚拟机有额外的好处,如更好的平均恢复时间和更强的可靠性。

For H100 SXM, the lowest unit of compute is one server, which means that each server will only ever have one customer at a time. This means that it is possible to do bare metal deployments while still maintaining security. Bare metal is possible and is indeed common, but we do see that utilizing VMs has added benefits such as superior mean time to recovery, and stronger reliability.

使用虚拟机时,如果客户使用的物理 GPU 服务器出现故障,那么新云可以轻松地将客户迁移或在热备用服务器上启动一个新的虚拟机。

When using VMs, if a physical GPU server being used by a customer breaks, then the Neocloud is able to easily migrate or spin up a new VM for the customer on a hot spare.

picture.image来源:SemiAnalysis

可以使用开源虚拟机监控程序(如 qemu-kvm)在 GPU 虚拟机上创建虚拟机,它将启动你的虚拟机,你可以将 vCPU 固定到物理 CPU,并留下几个未固定的核心来运行虚拟机监控程序。

Creating virtual machines on GPU VMs can be done using an open-source hypervisor such as qemu-kvm, which will start your VM where you pin vCPUs to the physical CPUs and leave a couple cores unpinned to run the hypervisor.

你还需要将 vLAN 以太网接口绑定到 GPU 虚拟机。使用常见的虚拟机监控程序创建 CPU 虚拟机是一项现在大多数计算机科学毕业生都能完成的简单任务。要将虚拟机变成 GPU 虚拟机,你还需要为 GPU 和 InfiniBand NIC 进行 PCIe 直通。幸运的是,对于新云来说,NVIDIA 还没有想出如何对其 GPU 和 NIC 的 PCIe 直通收费。我们还看到新云使用 SR-IOV 创建虚拟 InfiniBand NIC 并传递到虚拟机中,而不仅仅是物理 InfiniBand NIC,尽管使用 SR-IOV 并不是严格必需的。

You will also need to bind your vLAN ethernet interface to your GPU VM. Creating CPU VMs using the common hypervisor is a simple task that most Computer Science grads can do nowadays. To make a VM into a GPU VM, you also need to do PCIe Passthrough for your GPUs and InfiniBand NICs. Fortunately for Neoclouds, NVIDIA has yet to figure out a way to charge for PCIe passthrough on their GPUs and NICs. We have also seen Neoclouds use SR-IOV to create virtual InfiniBand NICs and pass though into the Virtual Machine instead of just the physical InfiniBand NIC, although using SR-IOV is not strictly needed.

picture.image来源:SemiAnalysis

你需要记住执行的另一个额外步骤是通过 NCCL_TOPO_FILE 变量手动传入 /etc/nccl.conf 中的 NUMA 区域和 PCIe 拓扑文件,因为 NCCL 和 Nvidia 驱动程序现在在 GPU 虚拟机内运行,因此无法自动检测 NUMA 区域和 PCIe 拓扑。如果没有这一步,NCCL 性能将只能达到应有带宽的 50%。

One additional step that you need to remember to carry out is to manually pass in the NUMA regions and PCIe topology file in /etc/nccl.conf through the NCCL_TOPO_FILE variable since NCCL and the Nvidia-drivers now operate inside that GPU VM and therefore are unable to auto detect the NUMA regions and the PCIe topology. Without this step, NCCL performance will operate at 50% the bandwidth of what it should be operating at.

picture.image

NCCL PCIe 拓扑文件,来源:SemiAnalysis

相比裸金属(Bare Metal),使用虚拟机(Virtual Machines)的一个缺点是,由于启用了 IOMMU,CPU 到 GPU 的传输带宽和延迟稍微慢一些。但我们认为使用虚拟机是值得的,因为它能为终端用户提供更快的平均恢复时间,而且通常 HostToDevice(HtoH)传输与计算是重叠进行的,因此用户可能不会察觉到明显的影响。

One of the downsides of doing Virtual Machines compared to Bare Metal is that the CPU to GPU transfer bandwidth and latency is slightly slower due to the enablement of IOMMU. But we believe it is worth using Virtual Machines due to a faster mean time to recovery for the end user and because often HostToDevice (HtoH) transfers are overlapped with compute anyways so there may not even be a noticeable effect to the end user.

由于 CPU 内存为 1-2 TB,kvm-qemu 管理程序在默认情况下需要很长时间才能启动虚拟机。相比之下,cloud-hypervisor 通过多线程并行预加载内存的优化,将 1TB 的预加载时间从 80 秒减少到了 6 秒。这个优化由 Crusoe Cloud 创建,并且幸运地被上游集成。根据我们的测试,Crusoe 的虚拟机能够在不到 90 秒的时间内启动。

Since there are 1-2TB of CPU RAM, kvm-qemu hypervisor out of the box takes a long time for the VM to boot up. In contrast, with cloud-hypervisor, which has an optimization where the system prefaults the memory in parallel using multiple pthreads, which has the effect of reducing memory prefault times for 1TB from 80seconds to just 6 seconds. This optimization was created by Crusoe Cloud and was fortunately up streamed. From our testing, Crusoe’s VMs are able to boot up in less than 90 seconds.

快速启动的一个重要好处是,当客户的 GPU 服务器不可避免地出现故障时,新云运营商可以非常迅速地将虚拟机部署到备用节点上,并将其添加到客户的 SLURM 集群中,使客户能够迅速恢复训练。

The important benefit of a fast boot up is that when a customer’s GPU server inevitably fails, the Neocloud operator can very quickly deploy a VM to their hot spare node and add it into the customer’s SLURM cluster, allowing the customer to be able to very quickly resume training.

picture.image来源:SemiAnalysis

监控与常见错误 Monitoring and Common Errors

在监控仪表板方面,最低限度,我们建议通过 Grafana 和 Prometheus 安装 Nvidia Datacenter Manager 仪表板,以允许用户跟踪 GPU 温度、功耗以及活动的 XID 错误。

In terms of monitoring dashboards, at a bare minimum, we recommend having Nvidia Datacenter Manager dashboard through Grafana and Prothemeus, allowing users to track GPU temperatures, Power Usage and active XID errors.

picture.image来源:SemiAnalysis Internal GPU Dashboard

此外,我们还建议新云安装 ipmi-exporter 来监控整体风扇速度、温度和其他 BMC(基板管理控制器)指标。在运行 CPU 部署时,标准做法是使用集中化的仪表板来跟踪这些指标。

Furthermore, we also recommend that Neoclouds install ipmi-exporter to monitor overall fan speeds, temperatures and other BMC metrics. It is standard practice when running CPU deployments to have some sort of centralized dashboard with all of these metrics.

picture.image来源:Grafana

监控软件架构包括在每个 GPU 节点上安装 IPMI exporter 和 DCGM exporter,然后在 CPU 管理节点上部署一个 Prometheus scraper 与 GPU exporter 通信,并将数据存储在 InfluxDB 数据库中。接下来,Grafana Web 服务器可以连接到 Prometheus,以可视化收集到的数据。

The software architecture for the monitoring involves having an IPMI exporter and DCGM exporter on each GPU node, then on a CPU management node deploying a Prometheus scraper to talk to the GPU exporters and store the data in an InfluxDB database. Next, the Grafana web server can be connected to Prometheus to visualize the collected data.

高级新云运营商还会有一个 promtail 日志记录器,用于汇总每个服务器的诊断消息(dmesg)日志。需要及时标记的两个常见问题是“电缆断开”以及“NIC 和/或收发器温度过高”。这些消息通常表明 InfiniBand 链路存在波动,需要在客户流失前及时处理。

Advanced NeoCloud operators will also have a promtail logger that aggregates each server’s diagnostics messages (dmesg) logs. Two common concerning dmesgs that should be promptly flagged are Cable being Unplugged as well as NIC and/or transceiver temperatures overheating. Either of these messages probably indicates that you have a flapping InfiniBand Link that needs to be promptly addressed before customers start churning.

picture.image来源:SemiAnalysis

另一个常见错误是,当 GPU 在 dmesg 或 DCGM XID 错误中没有报告任何错误时,但输出了错误的矩阵乘法结果。这些错误称为“静默数据损坏”(SDC)。检测 GPU 上是否存在 SDC 的最简单方法是使用 Nvidia 的 DCGMI 诊断工具(sudo dcgmi diag -r 4)。该工具可以捕获 95% 的常见 SDC,但仍有 5% 的 SDC 无法被检测到,导致非常长的调试过程和非常愤怒的客户。

Another common error encountered is when GPUs reporting no errors at all through dmesg or through DCGM XID errors but output wrong matrix multiplication results. These errors are called silent data corruptions (SDC). The easiest way to figure out if there are SDCs on your GPUs is with the Nvidia DCGMI diagnostics level 4 tool (sudo dcgmi diag -r 4). The tool will catch 95% of the most common SDCs, but will unfortunately miss the remaining 5% of SDCs leading to very long debugging processes and very angry customers.

NCCL 死锁和停滞是很常见的问题,可能会导致训练作业停滞 30-35 分钟,直到 PyTorch 的 NCCL watchdog 杀掉整个训练作业。我们认为,Neocloud 运营商可以通过添加自己的 NCCL 后台检查工具为客户提供价值,以检查是否有超过 150W 功耗的活动 SLURM 作业。如果功耗低于 150W,可能意味着 NCCL 正在挂起,并且存在某种死锁,自动程序应该发送电子邮件提醒客户重新启动 SLURM 作业。

NCCL deadlocking and stalling are both very common issues that can cause a training job to stall for 30-35 minutes before PyTorch’s NCCL watch dog kills the whole training job. We believe that this is an area that Neoclouds can add value to their customers in if they add their own background NCCL checker to check active SLURM jobs and see if the jobs have been using more than 150W within the last 4 minutes. If power usage is below 150W, this probably means that NCCL is hanging and there is some sort of deadlock and a bot should probably automatically email the customer alerting them to restart their SLURM job.

一些最常见的问题性 InfiniBand UFM 错误代码包括 110(符号错误)、112(链路断开)、329(链路断开)、702(端口不健康)和 918(符号位错误警告)。我们建议遇到这些错误代码时立即联系工程师进一步调查,然而,现实中这些问题可能已经为新云的许多客户带来了严重的影响,他们可能已经在不断地联系新云运营商。

Some of the most common problematic InfiniBand UFM error codes to track are 110 (Symbol error), 112 (Link downed), 329 (Link went down), 702 (Port is considered unhealthy), and 918 (Symbol bit error warning). We generally recommend that users immediately ping an engineer to investigate further should they encounter any of these above error codes when tracking the UFM error. Realistically, however, these issues will probably be already causing serious issues for many of the Neocloud’s customers who will already be spam pinging the Neocloud operator.

我们强烈建议新云运营商使用支持工单系统(如 Jira)来跟踪所有硬件故障和客户问题。如果没有工单和客户管理系统,问题可能会被忽略,导致客户流失增加。

We highly recommend that Neocloud operators have a support ticketing system like Jira to keep track of all hardware failures and customer issues. Without a ticketing and customer management system, issues will fall through the cracks and cause increased customer churn.

picture.image TensorWave Jira Portal, 来源:TensorWave

更多提示与测试 More Tips and Tests

我们不常见到新云运营商使用的另一个功能是 SLURM topology.conf。SLURM 拓扑配置功能可以启动用户的 SLURM 训练作业,并为每个等级分配一个 SLURM_ID,以减少骨干级别的流量。如果某些重要消息的 SLURM_ID 分配不理想,可能会导致 20-30% 的速度下降。我们将在即将进行的 Nvidia NCCL 和 AMD RCCL 集体通信深入分析中讨论更多内容。

Another feature that we don’t see many Neocloud operators use is SLURM topology.conf. The SLURM topology configure function will launch users’ SLURM training jobs and assign a SLURM_ID to each rank to reduce spine level traffic. For certain important messages, having a SLURM_ID assigned sub optimally will result in a 20-30% slowdown. We will talk more about this in our upcoming Nvidia NCCL and AMD RCCL collective communication deep dive.

通常,我们建议使用 nccl-tests 来跨集群进行分析,并与 Nvidia 和 OEM 的参考数据进行对比,以查看是否存在性能不足或退化情况。

In general, we recommend that you use nccl-tests to profile across your cluster and compare against Nvidia and your OEM’s reference numbers to see if there are any performance shortfalls or degradations.

为了使 NCCL 测试更容易,我们正在开发一个称为 ClusterMAX-NCCL 的一行命令工具,用于运行并将集群与一组参考结果进行对比。

In order to make NCCL testing easy, we are developing a one liner function called ClusterMAX-NCCL to run and compare your cluster against a set of reference results.

ClusterMAX-NCCL 在所有不同类型的集合操作中,测试了从 16MiB 到 256MiB 的所有重要消息大小。我们最近推出了支持单节点 NCCL 测试的测试工具的测试版。以下是一行命令来加载和运行 ClusterMAX-NCCL:

In ClusterMAX-NCCL, we test against all the important message sizes from 16MiB to 256MiB for all the different types of collectives. We have recently launched a beta version of this tool that supports single node NCCL testing. Below is the one-liner to load and run ClusterMAX-NCCL:


        
            

          docker run --gpus all --ipc=host --shm-size 192G -v $(pwd)/results:/workspace/results semianalysiswork/clustermax-nccl
        
      

如果节点配置正确,您应该会看到类似于以下的结果:

If your node is configured properly, you should see results similar to the below:

picture.image来源:SemiAnalysis

提供具有竞争力的定价、强大的可靠性以及正确设置的集群是大多数新云的核心价值区分。我们在此之外看到的唯一差异化价值来自一家名为 TogetherAI 的新云,Flash Attention 的发明者 Tri Dao 就在这家公司工作。TogetherAI 为他们的 GPU 客户提供了一套专属的超优化 CUDA 内核,这些内核可以轻松集成到客户现有的训练代码中,从而为客户提供 10-15% 的训练吞吐量性能提升。

Delivering competitive pricing, strong reliability and a properly set up cluster is the bulk of the value differentiation for most Neoclouds. The only differentiated value we have seen outside this set is from a Neocloud called TogetherAI where the inventor of Flash Attention, Tri Dao, works. TogetherAI provides their GPU customers a set of exclusive hyper optimized CUDA kernels that are made to be easily integrated into the customer’s existing training code, thus providing the customer with a quick 10-15% performance increase in training throughput.

基本上,通过将训练速度提高 10-15%,客户可以节省 10-15% 的 GPU 费用,或者在相同的 GPU 预算下,用更多的 token 来训练模型,从而提升模型的性能。我们认为,除非复制 Tri Dao,否则 TogetherAI 所创造的价值无法在其他地方复现。

Basically, by being able to speed up training by 10-15%, the customer can save 10-15% of their GPU spending or alternatively take the same GPU dollar budget and train their model on 10-15% more tokens leading to a model performance boost. We don't believe the value created by Together can be replicated elsewhere without cloning Tri Dao.

picture.image来源:TogetherAI

集群部署与验收测试 Cluster Deployment and Acceptance Test

集群部署通常依赖于 OEM 的机架级集成和部署团队。这些团队将在集成工厂对单个服务器和集群进行网络测试。我们建议集群广泛的高温烧机测试应持续至少 3-4 周,以捕获节点组件中的所有早期故障。集成团队非常常见地会推荐使用 LINPACK 作为其烧机和验收测试过程,但我们认为这并不是一个很好的测试方法,因为 LINPACK 并没有充分利用网络,也没有对 GPU 的 HBM 内存进行高强度测试,而只是利用并测试了 GPU 的 FP64 核心。相比之下,机器学习训练对网络、HBM 内存以及 BF16/FP16/FP8 张量核心的需求更高,因此我们认为需要一个真正针对相关组件进行高强度测试的烧机和验收测试。

Cluster deployments typically leverage OEMs’ rack scale integration and deployment teams. These teams will integrate and test at the individual server level and at the cluster wide level during which networking testing will be carried out at OEMs’ integration factory. We recommend that the cluster wide high temp burn in should last at least 3-4 weeks to catch all the infant mortality related failures among the node’s components. It is extremely common for integration teams to pitch using LINPACK as their burn in and acceptance process, but we don’t believe that this is a very good test as LINPACK does not utilize the network much nor does it sweat the GPU’s HBM memory very much, instead only utilizing and testing the GPU’s FP64 cores. ML Training by contrast is very network, HBM and BF16/FP16/FP8 tensor core intensive and as such, we believe that a burn in and acceptance test that actually burns in related components is needed.

picture.image来源:SemiAnalysis

在集成工厂完成集成和烧机测试后,OEM 会将所有机架和电缆打包,运送到新云的数据中心,之后还需要大约两周的时间来将集群部署到这个托管数据中心。我们建议新云在集群现场部署完成后,再进行为期 2-3 天的烧机/验收测试,尽管集成工厂已经进行了烧机测试。这是为了确保硬件在运输或现场部署过程中没有损坏。一个常见的问题是,由于运输和安装过程中光纤连接端点积尘,导致 InfiniBand 链路不稳定。解决方法是清洁出现问题的光纤端点。不过,有时可能还存在更深层次的问题需要被发现和解决。

After the integration and burn in is completed at the integration factory, the OEM will pack up all the racks and cabling to deliver to the Neocloud’s datacenter, after which it will take another two weeks to deploy the cluster into this colocation data center. We recommend Neoclouds conduct another 2-3 day burn-in/acceptance test once the cluster has been set up on site even though the integration factory burn in has already been carried out. This is to make sure that no hardware was damaged during transportation or on-site deployment. A very common issue that crops up is flapping InfiniBand links due to dust on the fiber connection endpoints that accumulated during transportation and setup. The fix to solving this is to clean the fiber ends of the endpoints that are flapping. Sometimes there are more deep issues though that must be found and solved.

日常操作 Day to Day Operations

新云的日常操作大多是解决一个又一个问题。如果有良好的内部管理和调试工具,整个过程会顺利进行且充满满足感,但大多数新云的工程师时间都花在处理问题上,而不是构建更好的问题解决工具。

Day to day operations at Neoclouds mostly consists of whacking moles one after another. Having good internal management and debugging tooling will make this process run smoothly and even quite satisfying/enjoyable, but a lot of times at Neoclouds, there are not enough engineers to build these tools as ironically most of the engineers’ time will be spent whacking moles instead of building better mole whacking tools.

在新云集群运营的日常工作中,通常会遇到各种问题,包括 InfiniBand 收发器波动、GPU “掉线”、GPU 高带宽内存(HBM)错误和静默数据损坏(SDC)。大多数时候,这些问题可以通过简单地重启物理服务器来解决。实际上,许多运营商会构建一个用户界面按钮,或者教客户自己硬重启服务器来解决问题。在其他情况下,解决方案可能涉及重新插入 InfiniBand 收发器或清理光纤电缆上的灰尘。还有一些情况下,需要联系 OEM 或系统集成商进行保修 RMA(退货授权)并更换整个服务器。

Some of the most common moles that will pop up around the cluster are flapping IB transceivers, GPUs “falling off the bus”, GPU HBM errors, and SDCs. Most of the time, these issues can be solved by just simply initiating a hard rebooting of the physical server or in many cases building a UI button or teaching the customer to hard power cycle the server themselves. In other cases, the resolution to the issues is to unplug and plug back in the InfiniBand transceiver or to clean the dust off of fiber cables. Other cases will require calling up the OEM or system integrator for a warranty RMA to replace the entire server completely.

如上所述,新云集群的早期阶段常常故障频发,因为大多数新云在将集群交付给客户之前并没有进行“烧机”测试(即负载测试)。正如 Yi Tay 所观察到的,没有经过烧机测试的集群在可靠性方面要比那些进行了烧机测试的集群差很多倍。

As mentioned above, failures are very common during the early phase of a Neocloud cluster as most Neoclouds do not burn in their cluster before giving them to customers. As Yi Tay noticed, clusters that do not do burn are orders of magnitude worse when it comes to reliability than clusters that do conduct burn in testing.

这是 TogetherAI 和 Crusoe 表现优异的另一个维度。它们是少数在将集群交付给客户之前会进行多周烧机测试的新云。此外,雇佣并留住那些拥有多年操作 Nvidia GPU 和 InfiniBand 网络经验的员工的公司,往往遇到的故障率也要低得多,因为设置可靠集群的许多知识是属于一种“部落知识”,通过经验积累掌握如何正确调试和预防 AI 集群中可能出现的错误。

This is another dimension where TogetherAI and Crusoe score strongly as they are some of the few Neoclouds that do multiple weeks long burn in prior to handing over clusters to customers. Furthermore, companies that have hired and retained people that have years of prior experience operating Nvidia GPUs and InfiniBand Networking will tend to encounter much lower failure rates since a lot of knowledge on setting up reliable clusters is part of an unwritten Tribal knowledge base on how to properly debug and prevent errors from happening for AI Clusters.

picture.image来源:Yi Tay

我们发现,顶级的 H100 运营商在 512 个 H100 集群中,平均故障时间(MTBF)大约是 7 天。对于这些顶级运营商来说,大多数时候,故障可以通过简单地重启节点来轻松修复。

We see that a top tier H100 operator typically experiences a mean time between failures of 7 days for a cluster that has 512 H100s. For these top tier operators, most of the time, failures are easily fixable by just restarting the node.

picture.image来源:SemiAnalysis

接下来,我们将进入深入分析的第二部分,主要侧重于 AI 新云的经济学和商业案例。这部分分析对投资者和商业策略分析师特别有用,同时也能为买家提供一些见解,帮助他们更好地理解其中的定价动态和经济效应。我们确实认为所有 Neocloud 的管理人员、工程师、客户和投资者都应该了解并深入理解第一部分中的 AI 新云部署分析。

We will now turn to the second part of our deep dive in which we will focus mainly on the economics of and business case for AI Neoclouds. The analysis here will be particularly useful for investors and business strategy analysts but can also provide some insights to buyers to better understand the pricing dynamics and economics in effect here. We do think that all managers, engineers, customers and investors of Neoclouds should understand and internalize the AI Neocloud deployment deep dive in Part 1 as well.

由于篇幅限制,文章后半部分 AI 新云操作指南与架构分析【商业篇】将在本公众号未来几天发出,无需付费,敬请期待。


点击下方 卡片 ,关注“ 慢慢学AIGC ”

0
0
0
0
评论
未登录
看完啦,登录分享一下感受吧~
暂无评论