Artificial Intelligence Computing Leadership from NVIDIA
Unable to run TensorFlow with vGPU
Running ESXi 6.5sp3 ESXi: NVIDIA-GRID-vSphere-6.5-440.53-440.56-442.06 Created a new VM with Ubuntu 18.04 In VM I installed: NVIDIA-Linux-x86_64-440.56-grid.run I can run in VM: [code]root@tfe-1:~# nvidia-smi Sun Feb 23 08:39:28 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.56 Driver Version: 440.56 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GRID P4-4C On | 00000000:02:00.0 Off | N/A | | N/A N/A P8 N/A / N/A | 336MiB / 4096MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ [/code] Now I try to run a Docker container on top of VM that contains CUDA/CuDNN and TensorFlow. [code]docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it --rm -v /tmp:/tmp nvcr.io/nvidia/tensorflow:19.12-tf1-py3 [/code] I get this Warning [code]================ == TensorFlow == ================ NVIDIA Release 19.12-tf1 (build 9258376) TensorFlow Version 1.15.0 Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. Copyright 2017-2019 The TensorFlow Authors. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION. All rights reserved. NVIDIA modifications are covered by the license terms that apply to the underlying project or file. WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use 'nvidia-docker run' to start this container; see https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker . NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced.[/code] I get this WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. When I run TensorFlow I get: [code]Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf 2020-02-23 08:38:41.963456: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2 >>> tf.test.gpu_device_name() 2020-02-23 08:38:54.074797: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194840000 Hz 2020-02-23 08:38:54.075181: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5507dd0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-02-23 08:38:54.075213: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-02-23 08:38:54.077045: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-02-23 08:38:54.077082: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303) 2020-02-23 08:38:54.077113: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist[/code] Troubleshooting: ----------------- [list] - Install `apt install nvidia-modprobe` in both VM and container - Inside container: [/list] [code]root@5a278668fe9c:/workspace# echo $LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 root@5a278668fe9c:/workspace# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Wed_Oct_23_19:24:38_PDT_2019 Cuda compilation tools, release 10.2, V10.2.89 root@5a278668fe9c:/usr# nvidia-smi bash: nvidia-smi: command not found [/code]
Running ESXi 6.5sp3

ESXi: NVIDIA-GRID-vSphere-6.5-440.53-440.56-442.06
Created a new VM with Ubuntu 18.04
In VM I installed: NVIDIA-Linux-x86_64-440.56-grid.run

I can run in VM:

root@tfe-1:~# nvidia-smi
Sun Feb 23 08:39:28 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.56 Driver Version: 440.56 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID P4-4C On | 00000000:02:00.0 Off | N/A |
| N/A N/A P8 N/A / N/A | 336MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+



Now I try to run a Docker container on top of VM that contains CUDA/CuDNN and TensorFlow.

docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it --rm -v /tmp:/tmp nvcr.io/nvidia/tensorflow:19.12-tf1-py3


I get this Warning

================
== TensorFlow ==
================

NVIDIA Release 19.12-tf1 (build 9258376)
TensorFlow Version 1.15.0

Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copyright 2017-2019 The TensorFlow Authors. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use 'nvidia-docker run' to start this container; see
https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker .

NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.



I get this WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
When I run TensorFlow I get:

Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-02-23 08:38:41.963456: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
>>> tf.test.gpu_device_name()
2020-02-23 08:38:54.074797: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194840000 Hz
2020-02-23 08:38:54.075181: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5507dd0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-23 08:38:54.075213: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-02-23 08:38:54.077045: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-02-23 08:38:54.077082: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-23 08:38:54.077113: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist



Troubleshooting:
-----------------
    - Install `apt install nvidia-modprobe` in both VM and container
    - Inside container:


root@5a278668fe9c:/workspace# echo $LD_LIBRARY_PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
root@5a278668fe9c:/workspace# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
root@5a278668fe9c:/usr# nvidia-smi
bash: nvidia-smi: command not found

#1
Posted 02/23/2020 08:55 AM   
I'm having the same issue, and from what I have found this is because Docker is not running with the "nvidia" runtime, it is still running with the "runc" runtime. I am having issues figuring out what documentation is correct for getting the nvidia runtime installed, the various docs i've read seem to contradict each other regarding what versions of what need to be installed. [code] nvidia-smi Thu Mar 5 13:56:07 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:0B:00.0 Off | 0 | | N/A 28C P0 24W / 250W | 0MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ [/code] [code] sudo docker info ... Server Version: 19.03.5 Storage Driver: overlay2 ... Runtimes: runc Default Runtime: runc [/code] I'll update as I find any useful info.
I'm having the same issue, and from what I have found this is because Docker is not running with the "nvidia" runtime, it is still running with the "runc" runtime. I am having issues figuring out what documentation is correct for getting the nvidia runtime installed, the various docs i've read seem to contradict each other regarding what versions of what need to be installed.

nvidia-smi
Thu Mar 5 13:56:07 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:0B:00.0 Off | 0 |
| N/A 28C P0 24W / 250W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+


sudo docker info
...
Server Version: 19.03.5
Storage Driver: overlay2
...
Runtimes: runc
Default Runtime: runc


I'll update as I find any useful info.

#2
Posted 03/05/2020 07:05 PM   
Got it working! basically needed to get the runtime installed and edit the Docker daemon.json to use the nvidia runtime. [code]{ "log-driver": "json-file", "log-opts": { "max-size": "100m", "max-file": "2" }, "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "storage-driver": "overlay2" }[/code]
Got it working! basically needed to get the runtime installed and edit the Docker daemon.json to use the nvidia runtime.

{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "2"
},
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"storage-driver": "overlay2"
}

#3
Posted 03/09/2020 07:59 PM   
Scroll To Top

Add Reply