Artificial Intelligence Computing Leadership from NVIDIA
430.30 GRID vGPU driver fails to load on Linux (Tesla T4 x2)
Guest: $ lspci | grep -i nvidia 02:02.0 VGA compatible controller: NVIDIA Corporation Device 1eb8 (rev a1) $ tail /var/log/nvidia-installer.log ERROR: Unable to load the 'nvidia-drm' kernel module. ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com. $ dmesg | tail [ 585.162303] nvidia-nvlink: Nvlink Core is being initialized, major device number 241 [ 585.163241] vgaarb: device changed decodes: PCI:0000:02:02.0,olddecodes=none,decodes=none:owns=none [color="orange"][ 585.163387] NVRM: The NVIDIA GPU 0000:02:02.0 (PCI ID: 10de:1eb8) NVRM: installed in this system is not supported by the NVRM: NVIDIA 430.30 driver release.[/color] NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products' NVRM: in this release's README, available on the operating system NVRM: specific graphics driver download page at www.nvidia.com. [ 585.163715] nvidia: probe of 0000:02:02.0 failed with error -1 [ 585.163733] NVRM: The NVIDIA probe routine failed for 1 device(s). [ 585.163734] NVRM: None of the NVIDIA devices were initialized. [ 585.163946] nvidia-nvlink: Unregistered the Nvlink Core, major device number 241 $ uname -a Linux <REMOVED> 3.10.0-957.12.1.el7.x86_64 #1 SMP Tue Apr 23 12:06:18 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-36.0.1) (GCC) Host: uname -a VMkernel <REMOVED> 6.7.0 #1 SMP Release build-11675023 Jan 7 2019 19:29:34 x86_64 x86_64 x86_64 ESXi esxcli software vib list | grep -i nvidia NVIDIA-VMware_ESXi_6.7_Host_Driver 430.27-1OEM.670.0.0.8169922 NVIDIA VMwareAccepted 2019-07-22 nvidia-smi [code]Thu Jul 25 17:35:33 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.27 Driver Version: 430.27 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:05:00.0 Off | 0 | | N/A 46C P8 18W / 70W | 7648MiB / 15359MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 On | 00000000:07:00.0 Off | 0 | | N/A 49C P8 17W / 70W | 75MiB / 15359MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 2111564 C+G <REMOVED> 7566MiB | +-----------------------------------------------------------------------------+ [/code] nvidia-smi vgpu Thu Jul 25 17:36:07 2019 [code]+-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.27 Driver Version: 430.27 | |---------------------------------+------------------------------+------------+ | GPU Name | Bus-Id | GPU-Util | | vGPU ID Name | VM ID VM Name | vGPU-Util | |=================================+==============================+============| | 0 Tesla T4 | 00000000:05:00.0 | 0% | | 3251642093 GRID T4-8A | 2111565 <REMOVED> | 0% | +---------------------------------+------------------------------+------------+ | 1 Tesla T4 | 00000000:07:00.0 | 0% | +---------------------------------+------------------------------+------------+ [/code] PowerCLI: ( get-vmhost ).ExtensionData.Config | select GraphicsInfo, SharedPassthruGpuTypes GraphicsInfo SharedPassthruGpuTypes ------------ ---------------------- {NVIDIATesla T4, NVIDIATesla T4} {grid_t4-8q, grid_t4-8c, grid_t4-8a, grid_t4-4q...} ( get-vmhost ).ExtensionData.Config.GraphicsConfig HostDefaultGraphicsType SharedPassthruAssignmentPolicy DeviceType ----------------------- ------------------------------ ---------- sharedDirect performance {0000:05:00.0, 0000:07:00.0} ( ( $myvm | get-view ).Config.Hardware.Device | where-object Key -eq 13000 ).Backing Vgpu ---- grid_t4-8a
Guest:
$ lspci | grep -i nvidia
02:02.0 VGA compatible controller: NVIDIA Corporation Device 1eb8 (rev a1)

$ tail /var/log/nvidia-installer.log
ERROR: Unable to load the 'nvidia-drm' kernel module.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.


$ dmesg | tail
[ 585.162303] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[ 585.163241] vgaarb: device changed decodes: PCI:0000:02:02.0,olddecodes=none,decodes=none:owns=none
[ 585.163387] NVRM: The NVIDIA GPU 0000:02:02.0 (PCI ID: 10de:1eb8)
NVRM: installed in this system is not supported by the
NVRM: NVIDIA 430.30 driver release.

NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
NVRM: in this release's README, available on the operating system
NVRM: specific graphics driver download page at www.nvidia.com.

[ 585.163715] nvidia: probe of 0000:02:02.0 failed with error -1
[ 585.163733] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 585.163734] NVRM: None of the NVIDIA devices were initialized.
[ 585.163946] nvidia-nvlink: Unregistered the Nvlink Core, major device number 241

$ uname -a
Linux <REMOVED> 3.10.0-957.12.1.el7.x86_64 #1 SMP Tue Apr 23 12:06:18 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-36.0.1) (GCC)



Host:
uname -a
VMkernel <REMOVED> 6.7.0 #1 SMP Release build-11675023 Jan 7 2019 19:29:34 x86_64 x86_64 x86_64 ESXi

esxcli software vib list | grep -i nvidia
NVIDIA-VMware_ESXi_6.7_Host_Driver 430.27-1OEM.670.0.0.8169922 NVIDIA VMwareAccepted 2019-07-22

nvidia-smi
Thu Jul 25 17:35:33 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.27 Driver Version: 430.27 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:05:00.0 Off | 0 |
| N/A 46C P8 18W / 70W | 7648MiB / 15359MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:07:00.0 Off | 0 |
| N/A 49C P8 17W / 70W | 75MiB / 15359MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2111564 C+G <REMOVED> 7566MiB |
+-----------------------------------------------------------------------------+


nvidia-smi vgpu
Thu Jul 25 17:36:07 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.27 Driver Version: 430.27 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 Tesla T4 | 00000000:05:00.0 | 0% |
| 3251642093 GRID T4-8A | 2111565 <REMOVED> | 0% |
+---------------------------------+------------------------------+------------+
| 1 Tesla T4 | 00000000:07:00.0 | 0% |
+---------------------------------+------------------------------+------------+


PowerCLI:
( get-vmhost ).ExtensionData.Config | select GraphicsInfo, SharedPassthruGpuTypes

GraphicsInfo SharedPassthruGpuTypes
------------ ----------------------
{NVIDIATesla T4, NVIDIATesla T4} {grid_t4-8q, grid_t4-8c, grid_t4-8a, grid_t4-4q...}


( get-vmhost ).ExtensionData.Config.GraphicsConfig

HostDefaultGraphicsType SharedPassthruAssignmentPolicy DeviceType
----------------------- ------------------------------ ----------
sharedDirect performance {0000:05:00.0, 0000:07:00.0}



( ( $myvm | get-view ).Config.Hardware.Device | where-object Key -eq 13000 ).Backing

Vgpu
----
grid_t4-8a

#1
Posted 07/25/2019 05:53 PM   
Apparently the Linux driver does not like the 'a' profile. Switched to 'q' profile and the driver now loads, although I have not finished testing, fully.
Apparently the Linux driver does not like the 'a' profile. Switched to 'q' profile and the driver now loads, although I have not finished testing, fully.

#2
Posted 07/25/2019 07:54 PM   
Scroll To Top

Add Reply