问题描述
在安装了 Nvidia驱动和docker的主机上直接启动容器报错提示如下信息:
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
问题分析
需要安装nvidia-docker2或nvidia-container-runtime插件驱动,以便docker容器能够使用Nvidia驱动。
问题解决
一、安装nvidia-docker2
1.设置仓库和GPGkey
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
2.刷新缓存
yum clean expire-cache
3.安装nvidia-docker2
yum install -y nvidia-docker2
4.查看daemon.json文件 ⚠️安装完成会自动创建daemon.json文件,并且已经存在的daemon.json会被覆盖。
cat /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
5.重启dokcer
systemctl restart docker
6.验证
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Fri Dec 10 02:06:20 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:04:01.0 Off | 0 |
| N/A 36C P0 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
二、nvidia-container-runtime
1.设置仓库和GPG key
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
2.清理缓存
yum clean expire-cache
3.安装nvidia-container-runtime
yum install nvidia-container-runtime
4.重启docker
systemctl restart docker
5.验证
[root@iv-b5oz3v8bkbfse8ti19d9 ~]# docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Fri Dec 10 02:29:48 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:04:01.0 Off | 0 |
| N/A 51C P0 18W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
参考文档
如果您有其他问题,欢迎您联系火山引擎技术支持服务