GPU在Kubernetes中的使用与管理 | 社区征文 - 文章 - 开发者社区

前言

随着人工智能与机器学习技术的快速发展，在Kubernetes上运行模型训练、图像处理类程序的需求日益增加，而实现这类需求的基础，就是Kubernetes对GPU等硬件加速设备的支持与管理。在本文中我们就说一下在Kubernetes中启动并运行GPU程序的注意事项。

Kubernetes对GPU支持的不足之处

我们知道Kubernetes可以实现对宿主机的CPU、内存、网络实现精细化的控制，但是到本文书写为止，Kubernetes尚未实现像管理CPU那样来管理GPU，比如有如下限制：

对于GPU资源只能设置limit，这意味着requests不可以单独使用，要么只设置limit、要么同时设置二者，但二者值必须相等，不可以只设置request而不设置limit。
pod及容器之间，不可以共享GPU,且GPU也不可以过量分配（所以我们线上的程序采用daemonSet方式运行）。
不允许以小数请求GPU资源分配。

Kubernetes如何管理GPU资源

扩展资源(Extended Resources)

和CPU资源不同的是，硬件加速设备类型有多种，比如说GPUs、NICs、FPGAs，而且它们的厂商也不止一家，Kubernetes要想挨个支持是不现实的，所以Kubernetes就把这些硬件加速设备统一当做扩展资源来处理。

Kubernetes在Pod的API对象里并没有提供像CPU那样的资源类型，它使用我们刚说到的扩展资源资源字段来传递GPU信息，下面是官方给出的声明使用nvidia硬件的示例：

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU

要想使用上面yaml文件声明使用GPU设备，那么需要先在Node节点上安装设备插件 Device Plugin。

设备插件(Device Plugin)

设备插件与设备厂商绑定，这里使用nvidia提供的Device Plugin。
官方的 NVIDIA GPU 设备插件有以下要求：

Kubernetes 的节点必须预先安装了 NVIDIA 驱动
Kubernetes 的节点必须预先安装 nvidia-docker 2.0
Docker 的默认运行时必须设置为 nvidia-container-runtime，而不是 runc
NVIDIA 驱动版本 ~= 384.81

安装过程可以参考上面链接，这里就不在赘述，这里讨论Device Plugin做了哪些事及其实现方法。

暴露每个Node上的GPU个数
在Kubernetes上运行可以支持GPU的容器

Device Plugin工作流程图：

第一步：向kubelet的Device plugin Manager发起注册请求。

第二步：启动gRPC服务用于和kubelet进行通信。

第三步：kubelet通过ListAndWatch这个API定期获取设备信息列表。

第四步：kubelet将获取到的设备信息发送给API server。

不管是nvidia还是其它类型的硬件，如果要实现用于Kubernetes的自己的设备插件，都需要遵守Device Plugin的规范来实现如下代码中所示的 ListAndWatch 和 Allocate API。

// DevicePlugin is the service advertised by Device Plugins
service DevicePlugin {
	// GetDevicePluginOptions returns options to be communicated with Device
	// Manager
	rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}

	// ListAndWatch returns a stream of List of Devices
	// Whenever a Device state change or a Device disappears, ListAndWatch
	// returns the new list
	rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}

	// GetPreferredAllocation returns a preferred set of devices to allocate
	// from a list of available ones. The resulting preferred allocation is not
	// guaranteed to be the allocation ultimately performed by the
	// devicemanager. It is only designed to help the devicemanager make a more
	// informed allocation decision when possible.
	rpc GetPreferredAllocation(PreferredAllocationRequest) returns (PreferredAllocationResponse) {}

	// Allocate is called during container creation so that the Device
	// Plugin can run device specific operations and instruct Kubelet
	// of the steps to make the Device available in the container
	rpc Allocate(AllocateRequest) returns (AllocateResponse) {}

	// PreStartContainer is called, if indicated by Device Plugin during registeration phase,
	// before each container start. Device plugin can run device specific operations
	// such as resetting the device before making devices available to the container
	rpc PreStartContainer(PreStartContainerRequest) returns (PreStartContainerResponse) {}
}

总结

总的来讲，以Device Plugin方式来管理GPU等硬件设备，目前的控制还不够精细，粒度较大。所以很多情况下要把GPU用起来好像也不是非Device Plugin不可。我发现很多公司在使用时，并没有在YAML文件中指定GPU的个数，也没有在Kubernetes集群中安装Device Plugin插件，因为他们的程序以DaemonSet的方式运行，且每台机器上只有一块GPU，这样相当于一个程序独占一个GPU，至于把GPU设备及驱动加载到Docker容器内，可以通过在YAML文件中指定NVIDIA_DRIVER_CAPABILITIES环境变量来实现：

# 参考：https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html
containers:
- env:
  - name: NVIDIA_DRIVER_CAPABILITIES
    value: compute,utility,video

文章来源：https://xie.infoq.cn/article/360e3286b7670358825081446