NVIDIA
HP Server Crashed with Tesla M60,376.84 driver for windows server 2016 (GPU)
We have 3 nodes cluster and all the 3 nodes were crashed and generated Dump files. Looking at the crash error found that all the 3 nodes were crashed with the same error code. vGPU was enabled for all the 3 nodes. This are the crash dumps details; VIDEO_TDR_FAILURE (116) Attempt to reset the display driver and recover from timeout failed. Arguments: Arg1: ffff8d03a76a5010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT). Arg2: fffff80d3a752678, The pointer into responsible device driver module (e.g. owner tag). Arg3: ffffffffc000009a, Optional error code (NTSTATUS) of the last failed operation. Arg4: 0000000000000004, Optional internal context dependent data. Debugging Details: ------------------ TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\modclass.ini, error 2 FAULTING_IP: nvlddmkm+982678 fffff80d`3a752678 48ff25a145e7ff jmp qword ptr [nvlddmkm+0x7f6c20 (fffff80d`3a5c6c20)] DEFAULT_BUCKET_ID: GRAPHICS_DRIVER_TDR_FAULT BUGCHECK_STR: 0x116 Child-SP RetAddr Call Site 00 ffff8a00`aaa17a58 fffff806`44b3a298 nt!KeBugCheckEx 01 ffff8a00`aaa17a60 fffff806`44b1d13f dxgkrnl!TdrBugcheckOnTimeout+0xec 02 ffff8a00`aaa17aa0 fffff806`44b1a2ef dxgkrnl!ADAPTER_RENDER::Reset+0x153 03 ffff8a00`aaa17ad0 fffff806`44b39a85 dxgkrnl!DXGADAPTER::Reset+0x307 04 ffff8a00`aaa17b20 fffff806`44b39bc7 dxgkrnl!TdrResetFromTimeout+0x15 05 ffff8a00`aaa17b50 fffff802`e8ae2599 dxgkrnl!TdrResetFromTimeoutWorkItem+0x27 06 ffff8a00`aaa17b80 fffff802`e8b32965 nt!ExpWorkerThread+0xe9 07 ffff8a00`aaa17c10 fffff802`e8bd0e26 nt!PspSystemThreadStartup+0x41 08 ffff8a00`aaa17c60 00000000`00000000 nt!KiStartSystemThread+0x16 ----------------------------- 02 ffff8a00`aaa17aa0 fffff806`44b1a2ef dxgkrnl!ADAPTER_RENDER::Reset+0x153 1. All 3 crash dump points to same stack and register value. FAULTING_IP: nvlddmkm+982678 fffff80d`3a752678 48ff25a145e7ff jmp qword ptr [nvlddmkm+0x7f6c20 (fffff80d`3a5c6c20)] 2. Windbg stack points to VIDEO_TDR_FAILURE (116). 37: kd> !analyze -v >#******************************************************************************* >#* Bugcheck Analysis * >#******************************************************************************* VIDEO_TDR_FAILURE (116) Attempt to reset the display driver and recover from timeout failed. Arguments: Arg1: ffffdd84719ea010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT). Arg2: fffff80fe60e2678, The pointer into responsible device driver module (e.g. owner tag). Arg3: ffffffffc000009a, Optional error code (NTSTATUS) of the last failed operation. Arg4: 0000000000000004, Optional internal context dependent data. As per Microsoft documentation this is cause by following reasons https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0x116---video-tdr-error Refer to Resolution Section *Over-clocked components, such as the motherboard *Incorrect component compatibility and settings (especially memory configuration and timings) *Defective parts (memory modules, motherboards, etc.) *Insufficient system power *Insufficient system cooling We are using the HP Servers with following specification; HP ProLiant DL380 Gen 9, and the ROM version is P89 v2.30 (09/13/2016). And moreover when we tried to upgrade the drivers to the latest version 385.54 Release Date: 25.9.2017 they we were unable to run virtual GPU (Remote FX) as GPU does not show in the HyperV setting. Once we reverted to old driver 376.84, we could see physical GPUs under Hyper-V settings. Can any tell if someone has experience the same issue with the Driver version?
We have 3 nodes cluster and all the 3 nodes were crashed and generated Dump files. Looking at the crash error found that all the 3 nodes were crashed with the same error code.
vGPU was enabled for all the 3 nodes.

This are the crash dumps details;

VIDEO_TDR_FAILURE (116)
Attempt to reset the display driver and recover from timeout failed.
Arguments:
Arg1: ffff8d03a76a5010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff80d3a752678, The pointer into responsible device driver module (e.g. owner tag).
Arg3: ffffffffc000009a, Optional error code (NTSTATUS) of the last failed operation.
Arg4: 0000000000000004, Optional internal context dependent data.
Debugging Details:
------------------
TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\modclass.ini, error 2
FAULTING_IP:
nvlddmkm+982678
fffff80d`3a752678 48ff25a145e7ff jmp qword ptr [nvlddmkm+0x7f6c20 (fffff80d`3a5c6c20)]
DEFAULT_BUCKET_ID: GRAPHICS_DRIVER_TDR_FAULT
BUGCHECK_STR: 0x116

Child-SP RetAddr Call Site
00 ffff8a00`aaa17a58 fffff806`44b3a298 nt!KeBugCheckEx
01 ffff8a00`aaa17a60 fffff806`44b1d13f dxgkrnl!TdrBugcheckOnTimeout+0xec
02 ffff8a00`aaa17aa0 fffff806`44b1a2ef dxgkrnl!ADAPTER_RENDER::Reset+0x153
03 ffff8a00`aaa17ad0 fffff806`44b39a85 dxgkrnl!DXGADAPTER::Reset+0x307
04 ffff8a00`aaa17b20 fffff806`44b39bc7 dxgkrnl!TdrResetFromTimeout+0x15
05 ffff8a00`aaa17b50 fffff802`e8ae2599 dxgkrnl!TdrResetFromTimeoutWorkItem+0x27
06 ffff8a00`aaa17b80 fffff802`e8b32965 nt!ExpWorkerThread+0xe9
07 ffff8a00`aaa17c10 fffff802`e8bd0e26 nt!PspSystemThreadStartup+0x41
08 ffff8a00`aaa17c60 00000000`00000000 nt!KiStartSystemThread+0x16
-----------------------------
02 ffff8a00`aaa17aa0 fffff806`44b1a2ef dxgkrnl!ADAPTER_RENDER::Reset+0x153

1. All 3 crash dump points to same stack and register value.
FAULTING_IP:
nvlddmkm+982678
fffff80d`3a752678 48ff25a145e7ff jmp qword ptr [nvlddmkm+0x7f6c20 (fffff80d`3a5c6c20)]
2. Windbg stack points to VIDEO_TDR_FAILURE (116).
37: kd> !analyze -v
>#*******************************************************************************
>#* Bugcheck Analysis *
>#*******************************************************************************
VIDEO_TDR_FAILURE (116)
Attempt to reset the display driver and recover from timeout failed.
Arguments:
Arg1: ffffdd84719ea010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff80fe60e2678, The pointer into responsible device driver module (e.g. owner tag).
Arg3: ffffffffc000009a, Optional error code (NTSTATUS) of the last failed operation.
Arg4: 0000000000000004, Optional internal context dependent data.



As per Microsoft documentation this is cause by following reasons

https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0x116---video-tdr-error

Refer to Resolution Section
*Over-clocked components, such as the motherboard
*Incorrect component compatibility and settings (especially memory configuration and timings)
*Defective parts (memory modules, motherboards, etc.)
*Insufficient system power
*Insufficient system cooling


We are using the HP Servers with following specification;
HP ProLiant DL380 Gen 9, and the ROM version is P89 v2.30 (09/13/2016).


And moreover when we tried to upgrade the drivers to the latest version 385.54 Release Date: 25.9.2017 they we were unable to run virtual GPU (Remote FX) as GPU does not show in the HyperV setting. Once we reverted to old driver 376.84, we could see physical GPUs under Hyper-V settings.


Can any tell if someone has experience the same issue with the Driver version?

#1
Posted 11/06/2017 05:45 PM   
Hi Venky, As you are running RemoteFX on Tesla M60 I assume you have the required vPC licenses so please open a support ticket with ESP. You should run the supported driver from GRID5.0 package (R384 branch). Regards Simon
Hi Venky,

As you are running RemoteFX on Tesla M60 I assume you have the required vPC licenses so please open a support ticket with ESP. You should run the supported driver from GRID5.0 package (R384 branch).

Regards

Simon

#2
Posted 11/07/2017 08:07 PM   
Hello Simon, Thanks for getting back on this. We dnt use any licenses as we just use the driver from nvidia.com for any GRID software or something like that. We happen to use the RemoteFX with previous version of Tesla M60 drivers. Lately we observed some crashes and thought to update the driver version and bumped into issue as we were not able to use the vGPU. Regards, Venky
Hello Simon,

Thanks for getting back on this.
We dnt use any licenses as we just use the driver from nvidia.com for any GRID software or something like that.
We happen to use the RemoteFX with previous version of Tesla M60 drivers. Lately we observed some crashes and thought to update the driver version and bumped into issue as we were not able to use the vGPU.

Regards,
Venky

#3
Posted 11/09/2017 08:06 PM   
Hi Venky, so please check our Licensing/EULA as you need to buy licenses for your deployment with RemoteFX and Tesla M60! See here: http://images.nvidia.com/content/grid/pdf/161207-GRID-Packaging-and-Licensing-Guide.pdf And btw it doesn't matter what driver you're using. Regards Simon
Hi Venky,

so please check our Licensing/EULA as you need to buy licenses for your deployment with RemoteFX and Tesla M60!
See here:
http://images.nvidia.com/content/grid/pdf/161207-GRID-Packaging-and-Licensing-Guide.pdf

And btw it doesn't matter what driver you're using.

Regards

Simon

#4
Posted 11/10/2017 03:56 AM   
Scroll To Top

Add Reply