Artificial Intelligence Computing Leadership from NVIDIA
ESXi 6.7 + Tesla V100 + 430.27 not working
Hello, we have a ESXi 6.7 installed on our Server. Now i wanted to passthrough the Tesla V100 to one VM. I installed the latest Host Driver for ESXi: NVIDIA-VMware_ESXi_6.7_Host_Driver-430.27-1OEM.670.0.0.8169922.x86_64.vib and reboot the machine. But when i run the nvidia-smi comes an error: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. Can anybody help?
Hello,

we have a ESXi 6.7 installed on our Server. Now i wanted to passthrough the Tesla V100 to one VM.

I installed the latest Host Driver for ESXi:
NVIDIA-VMware_ESXi_6.7_Host_Driver-430.27-1OEM.670.0.0.8169922.x86_64.vib and reboot the machine.

But when i run the nvidia-smi comes an error:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Can anybody help?

#1
Posted 07/16/2019 07:50 AM   
Hi If you're using Passthrough you don't need to install the .vib in the ESXi Host. Remove the GPU from running in Passthrough, and use a vGPU Profile instead. Then run nvidia-smi again. Regards MG
Hi

If you're using Passthrough you don't need to install the .vib in the ESXi Host.

Remove the GPU from running in Passthrough, and use a vGPU Profile instead. Then run nvidia-smi again.

Regards

MG

#2
Posted 07/16/2019 08:18 AM   
Hello thank you for your reply, maybe "passthrough" was not the correct word. I want to "attach" the vGPU to more than one VM like the Document 430.27-430.30-431.02-grid-software-quick-start-guide.pdf Chapter 3. INSTALLING AND CONFIGURING NVIDIA VGPU MANAGER AND THE GUEST DRIVER describes. I register me in nvidia license portal, download the package NVIDIA-GRID-vSphere-6.7-430.27-430.30-431.02.zip for ESXi 6.7. The installation with "esxcli software vib install –v NVIDIA-VMware_ESXi_6.7_Host_Driver-430.27-1OEM.670.0.0.8169922.x86_64.vib" was also successful. But nvidia-smi throw the error: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Hello thank you for your reply,

maybe "passthrough" was not the correct word. I want to "attach" the vGPU to more than one VM like the Document

430.27-430.30-431.02-grid-software-quick-start-guide.pdf
Chapter 3. INSTALLING AND CONFIGURING NVIDIA VGPU MANAGER AND THE GUEST DRIVER describes.

I register me in nvidia license portal, download the package NVIDIA-GRID-vSphere-6.7-430.27-430.30-431.02.zip for ESXi 6.7.

The installation with "esxcli software vib install –v NVIDIA-VMware_ESXi_6.7_Host_Driver-430.27-1OEM.670.0.0.8169922.x86_64.vib" was also successful.

But nvidia-smi throw the error:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

#3
Posted 07/16/2019 12:30 PM   
Would be helpful to know which server hardware you are using. If it is Dell you need to modify your BIOS to restrict MMIO. You should also run "dmesg" to get more information on the host what might be the issue
Would be helpful to know which server hardware you are using. If it is Dell you need to modify your BIOS to restrict MMIO.
You should also run "dmesg" to get more information on the host what might be the issue

#4
Posted 07/16/2019 02:44 PM   
The machine is a Dell PowerEdge R740xd. dmesg | grep NVIDIA shows: 2019-07-15T11:46:44.718Z cpu94:2101167)ALERT: NVIDIA: module load failed during VIB install/upgrade. 2019-07-15T11:46:44.722Z cpu109:2101168)NVIDIA: Starting vGPU Services. 2019-07-15T11:46:44.728Z cpu0:2101171)NVIDIA: Starting Xorg service. 2019-07-15T11:46:45.225Z cpu39:2101248)NVIDIA: Starting the DCGM node engine. It looks like the installation with VIB was not correct. I can not find information about "BIOS restrict MMIO". Is it what this nvidia support site describe? https://nvidia.custhelp.com/app/answers/detail/a_id/4119/~/incorrect-bios-settings-on-a-server-when-used-with-a-hypervisor-can-cause-mmio
The machine is a Dell PowerEdge R740xd.

dmesg | grep NVIDIA shows:

2019-07-15T11:46:44.718Z cpu94:2101167)ALERT: NVIDIA: module load failed during VIB install/upgrade.
2019-07-15T11:46:44.722Z cpu109:2101168)NVIDIA: Starting vGPU Services.
2019-07-15T11:46:44.728Z cpu0:2101171)NVIDIA: Starting Xorg service.
2019-07-15T11:46:45.225Z cpu39:2101248)NVIDIA: Starting the DCGM node engine.

It looks like the installation with VIB was not correct.

I can not find information about "BIOS restrict MMIO". Is it what this nvidia support site describe?

https://nvidia.custhelp.com/app/answers/detail/a_id/4119/~/incorrect-bios-settings-on-a-server-when-used-with-a-hypervisor-can-cause-mmio

#5
Posted 07/17/2019 06:23 AM   
Ok i found the solution. Passthrough was enabled in in ESXi. I disabled it and can see now information about my GPU with nvidia-smi.
Ok i found the solution. Passthrough was enabled in in ESXi. I disabled it and can see now information about my GPU with nvidia-smi.

#6
Posted 07/17/2019 11:59 AM   
Okay it still not works. I disabled the passthrough setting in ESXi Host by PCI-Devices. I change with vSphere Web Client the Settings for the Host Graphics and Graphics Devices to shared direct. Restart the ESXi Host. Now nvidia-smi works: [code][root@bigdata:~] nvidia-smi Fri Jul 19 10:21:25 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.27 Driver Version: 430.27 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 | | N/A 37C P0 28W / 250W | 39MiB / 32767MiB | 0% Default | +-------------------------------+----------------------+----------------------+[/code] vmkload_mod works: [code][root@bigdata:~] vmkload_mod -l | grep nvidia nvidia 13 17840 [/code] and dmesg has no errors: [code][root@bigdata:~] dmesg | grep nvidia 2019-07-18T16:52:41.547Z cpu0:2097152)VisorFSTar: 1856: nvidia_v.v00 for 0x48fd082 bytes 2019-07-18T16:52:48.072Z cpu37:2098396)Loading module nvidia ... 2019-07-18T16:52:48.098Z cpu37:2098396)Elf: 2101: module nvidia has license NVIDIA 2019-07-18T16:52:48.471Z cpu37:2098396)nvidia-nvlink core initialized 2019-07-18T16:52:48.471Z cpu37:2098396)Device: 192: Registered driver 'nvidia' from 21 2019-07-18T16:52:48.472Z cpu37:2098396)Mod: 4962: Initialization of nvidia succeeded with module ID 21. 2019-07-18T16:52:48.472Z cpu37:2098396)nvidia loaded successfully. 2019-07-18T16:52:48.477Z cpu27:2098226)Device: 327: Found driver nvidia for device 0x47bd4309e61c8101 2019-07-18T16:52:55.943Z cpu78:2098417)NVRM: nvidia_associate vmgfx0 2019-07-18T16:53:25.704Z cpu60:2100286)Starting service nvidia-init 2019-07-18T16:53:25.704Z cpu60:2100286)Activating Jumpstart plugin nvidia-init. 2019-07-18T16:53:35.787Z cpu109:2100286)Jumpstart plugin nvidia-init activated.[/code] But i still can't attach the graphic card. The menu option for Adding PCI-Devices is grey: [url]https://www.directupload.net/file/d/5518/ejz8rkxv_png.htm[/url] Any ideas?
Okay it still not works.

I disabled the passthrough setting in ESXi Host by PCI-Devices.
I change with vSphere Web Client the Settings for the Host Graphics and Graphics Devices to shared direct.
Restart the ESXi Host.

Now nvidia-smi works:

[root@bigdata:~] nvidia-smi 
Fri Jul 19 10:21:25 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.27 Driver Version: 430.27 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 |
| N/A 37C P0 28W / 250W | 39MiB / 32767MiB | 0% Default |
+-------------------------------+----------------------+----------------------+


vmkload_mod works:

[root@bigdata:~] vmkload_mod -l | grep nvidia
nvidia 13 17840


and dmesg has no errors:

[root@bigdata:~] dmesg | grep nvidia
2019-07-18T16:52:41.547Z cpu0:2097152)VisorFSTar: 1856: nvidia_v.v00 for 0x48fd082 bytes
2019-07-18T16:52:48.072Z cpu37:2098396)Loading module nvidia ...
2019-07-18T16:52:48.098Z cpu37:2098396)Elf: 2101: module nvidia has license NVIDIA
2019-07-18T16:52:48.471Z cpu37:2098396)nvidia-nvlink core initialized
2019-07-18T16:52:48.471Z cpu37:2098396)Device: 192: Registered driver 'nvidia' from 21
2019-07-18T16:52:48.472Z cpu37:2098396)Mod: 4962: Initialization of nvidia succeeded with module ID 21.
2019-07-18T16:52:48.472Z cpu37:2098396)nvidia loaded successfully.
2019-07-18T16:52:48.477Z cpu27:2098226)Device: 327: Found driver nvidia for device 0x47bd4309e61c8101
2019-07-18T16:52:55.943Z cpu78:2098417)NVRM: nvidia_associate vmgfx0
2019-07-18T16:53:25.704Z cpu60:2100286)Starting service nvidia-init
2019-07-18T16:53:25.704Z cpu60:2100286)Activating Jumpstart plugin nvidia-init.
2019-07-18T16:53:35.787Z cpu109:2100286)Jumpstart plugin nvidia-init activated.


But i still can't attach the graphic card. The menu option for Adding PCI-Devices is grey:
https://www.directupload.net/file/d/5518/ejz8rkxv_png.htm

Any ideas?

#7
Posted 07/19/2019 10:00 AM   
ECC memory disabled? Correct license present (Enterprise Plus) on vSphere?
ECC memory disabled? Correct license present (Enterprise Plus) on vSphere?

#8
Posted 07/21/2019 05:15 PM   
It was the License, we have only the Standard License in ESXi. I try it with passthrough in one VM till we have the right licenses.
It was the License, we have only the Standard License in ESXi. I try it with passthrough in one VM till we have the right licenses.

#9
Posted 07/23/2019 09:21 AM   
Scroll To Top

Add Reply