NVIDIA Logo - Artificial Intelligence Computing Leadership from NVIDIA
NVIDIA driver on ESX 6.5 causing PSOD
Hi, We have a Dell R730 running VMware ESXi, 6.5.0, 13932383 with a Tesla M10 GPU installed. A colleague yesterday installed the NVIDIA ESX VIB onto it and configured GPU passthrough This morning it gave a purple screen of death as below. IOMMU Fault detected for (vmgfx1/nvidia) NOTE: Backtrace likely does not yield the culprit. PanicvPanicInt Panic_NoSave@vmkernel IOMMUProcessFaults@vmkernel helpFunc@vmkernel#nover CpuSched_StartWorld@vmkernel# I opened a case with VMWare support who responded with the following. [i]We had a vmkernel panic with IOMMU fault. The IOMMU fault happened because the PCI device (0000:06:00.0 which is the nvidia graphics pcie device) trying to access the memory address (IOaddr: 0x6055b0d000) via DMA operation which nvidia device (vmgfx1/nvidia) is NOT intended to access the memory and IOMMU unit faulted the illegal memory access and panic the system. The illegal DMA memory access may be caused by buggy nvidia driver or nvidia firmware running inside the card. Kindly check with you NVIDIA if there is further update on the driver and firmware that can be performed. [/i] We don't have an active NVIDIA support contract so I'm hoping someone here has experienced similar and has a solution? Thanks.
Hi,

We have a Dell R730 running VMware ESXi, 6.5.0, 13932383 with a Tesla M10 GPU installed.
A colleague yesterday installed the NVIDIA ESX VIB onto it and configured GPU passthrough
This morning it gave a purple screen of death as below.

IOMMU Fault detected for (vmgfx1/nvidia)
NOTE: Backtrace likely does not yield the culprit.

PanicvPanicInt
Panic_NoSave@vmkernel
IOMMUProcessFaults@vmkernel
helpFunc@vmkernel#nover
CpuSched_StartWorld@vmkernel#


I opened a case with VMWare support who responded with the following.

We had a vmkernel panic with IOMMU fault.

The IOMMU fault happened because the PCI device (0000:06:00.0 which is the nvidia graphics pcie device) trying to access the memory address (IOaddr: 0x6055b0d000) via DMA operation which nvidia device (vmgfx1/nvidia) is NOT intended to access the memory and IOMMU unit faulted the illegal memory access and panic the system.

The illegal DMA memory access may be caused by buggy nvidia driver or nvidia firmware running inside the card.
Kindly check with you NVIDIA if there is further update on the driver and firmware that can be performed.


We don't have an active NVIDIA support contract so I'm hoping someone here has experienced similar and has a solution?

Thanks.

#1
Posted 08/09/2019 03:52 AM   
Scroll To Top

Add Reply