NVIDIA
Dell R730 with Tesla M60 on XenServer 7.0 unexpectedly reboot when a few VMs with vGPU are started
Hello. We have three new Dell R730 Servers with Tesla M60 Cards. They are installed with XenServer 7.0 - all Patches to 21. All three have the following Problem: As soon as a few VMs start they are just rebooting without showing any informations. In the event-log the following Problem is logged: A bus fatal error was detected on a component at slot 6. A fatal error was detected on a component at bus 0 device 2 function 0. The M60 is installed in Slot 6. The powerplug was already replaced (was not correct). The same happens if we move the Card to Slot 4. There is no XenServer Crashdump. http://nvidia-esp.custhelp.com/app/answers/detail/a_id/4249 did not fix it. Any hints where to search?
Hello.
We have three new Dell R730 Servers with Tesla M60 Cards. They are installed with XenServer 7.0 - all Patches to 21. All three have the following Problem:
As soon as a few VMs start they are just rebooting without showing any informations. In the event-log the following Problem is logged:
A bus fatal error was detected on a component at slot 6.
A fatal error was detected on a component at bus 0 device 2 function 0.
The M60 is installed in Slot 6.
The powerplug was already replaced (was not correct). The same happens if we move the Card to Slot 4.
There is no XenServer Crashdump.

http://nvidia-esp.custhelp.com/app/answers/detail/a_id/4249
did not fix it.

Any hints where to search?

#1
Posted 12/08/2016 12:18 PM   
New one on me....
New one on me....

#2
Posted 12/08/2016 12:36 PM   
Do you have SUMS support or is this pre-sales POC? Best wishes, Rachel
Do you have SUMS support or is this pre-sales POC?

Best wishes,
Rachel

#3
Posted 12/08/2016 12:38 PM   
Depends :) We have some Systems already licensed - but for this one we need to test which licenses are necessary - but for that we need to test on a working enviroment :)
Depends :)
We have some Systems already licensed - but for this one we need to test which licenses are necessary - but for that we need to test on a working enviroment :)

#4
Posted 12/08/2016 12:40 PM   
Details I'd like added: * the VDA / XD versions * The NVIDIA driver versions * Bios With all new M60 I'd recommend checking modeswitch has applied correctly: http://nvidia.custhelp.com/app/answers/detail/a_id/4106/kw/modeswitch
Details I'd like added:
* the VDA / XD versions
* The NVIDIA driver versions
* Bios

With all new M60 I'd recommend checking modeswitch has applied correctly: http://nvidia.custhelp.com/app/answers/detail/a_id/4106/kw/modeswitch

#5
Posted 12/08/2016 12:53 PM   
VDA 7.11 XD 7.11 Drivers: XenServer: NVIDIA vGPU (version 361.45.09) NVIDIA vGPU (version 367.43) VM: 369.17_grid_win8_win7_server2012R2_server2008R2_64bit_international Bios: 2.2.5 With compute mode the vms didn't start - only with graphics mode they start :)
VDA 7.11
XD 7.11

Drivers:
XenServer:
NVIDIA vGPU (version 361.45.09)
NVIDIA vGPU (version 367.43)
VM:
369.17_grid_win8_win7_server2012R2_server2008R2_64bit_international

Bios:
2.2.5

With compute mode the vms didn't start - only with graphics mode they start :)

#6
Posted 12/08/2016 01:18 PM   
Hi jhmeier Can you please let us know why you have 2 different host drivers and only 1 VM driver listed above? The drivers are released in pairs (Host / VM). If you are using multiple drivers, I would have expected to see them listed in pairs. [b]361.45.09[/b] is from the GRID 3.1 package, and should only be paired wtih [b]362.56[/b]. As you can see by version comparison, it's quite a way behind the current release. The latest drivers for Xen are [b]367.64[/b] paired only with [b]369.71[/b], available from here: https://nvidia.flexnetoperations.com/control/nvda/login Does the problem occur when you start only a single VM, or is it when multiple VMs are started and are you able to start any VMs at all with a vGPU assigned or do none power on successfully? When you run nvidia-smi on the Xen Hosts, what are the results? When you created your Master Image, the VM obviously had a vGPU assigned for you to install the NVIDIA drivers, did you experience any issues then? Are you running just XenDesktop or XenApp as well and which Operating Systems are you using? Does it do it with Passthrough as well? What is your provisioning method? MCS or PVS? Regards Ben
Hi jhmeier

Can you please let us know why you have 2 different host drivers and only 1 VM driver listed above? The drivers are released in pairs (Host / VM). If you are using multiple drivers, I would have expected to see them listed in pairs.

361.45.09 is from the GRID 3.1 package, and should only be paired wtih 362.56. As you can see by version comparison, it's quite a way behind the current release.

The latest drivers for Xen are 367.64 paired only with 369.71, available from here: https://nvidia.flexnetoperations.com/control/nvda/login

Does the problem occur when you start only a single VM, or is it when multiple VMs are started and are you able to start any VMs at all with a vGPU assigned or do none power on successfully?

When you run nvidia-smi on the Xen Hosts, what are the results?

When you created your Master Image, the VM obviously had a vGPU assigned for you to install the NVIDIA drivers, did you experience any issues then?

Are you running just XenDesktop or XenApp as well and which Operating Systems are you using?

Does it do it with Passthrough as well?

What is your provisioning method? MCS or PVS?



Regards

Ben

#7
Posted 12/08/2016 02:11 PM   
Hi, we have two host Drivers because we started with the old Version and updated to the new one. As far as I know it'S not possible to remove one Version from XenCenter (except with a full reinstallation). Thus we have 367.64 with 369.[b]17 [/b]in use. As far as I can see it only happened when a few vms have been started. Nvidia-smi: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.43 Driver Version: 367.43 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M60 On | 0000:05:00.0 Off | Off | | N/A 35C P8 25W / 150W | 14MiB / 8191MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M60 On | 0000:06:00.0 Off | Off | | N/A 31C P8 23W / 150W | 14MiB / 8191MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ No Problems with the master - but the master Image was created on other hosts (without the Problem). We are using both XA/XD - but in this Case it's XD 7.11 with Windows 7. MCS
Hi,
we have two host Drivers because we started with the old Version and updated to the new one. As far as I know it'S not possible to remove one Version from XenCenter (except with a full reinstallation).
Thus we have 367.64 with 369.17 in use.

As far as I can see it only happened when a few vms have been started.

Nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.43 Driver Version: 367.43 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 0000:05:00.0 Off | Off |
| N/A 35C P8 25W / 150W | 14MiB / 8191MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 On | 0000:06:00.0 Off | Off |
| N/A 31C P8 23W / 150W | 14MiB / 8191MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

No Problems with the master - but the master Image was created on other hosts (without the Problem).
We are using both XA/XD - but in this Case it's XD 7.11 with Windows 7.

MCS

#8
Posted 12/09/2016 06:47 AM   
Just crashed a Server only with the master vm - no other vms - so it also happens with just one vm.
Just crashed a Server only with the master vm - no other vms - so it also happens with just one vm.

#9
Posted 12/09/2016 08:50 AM   
To remove the NVIDIA driver from XenServer - - Query the NVIDIA driver: [b]rpm -qa | grep -i nvidia[/b] Let's [u]assume[/u] it comes back with: [b]NVIDIA-vgx-xenserver-7.0-361.45.09[/b] [i](Adjust to your version if it is different)[/i] - Remove NVIDIA driver: [b]rpm -ev NVIDIA-vgx-xenserver-7.0-361.45.09[/b] I typically put a reboot in here after removal. - Copy new NVIDIA driver to Xen host using WinSCP and install: [b]rpm -iv /change-to-your-path/NVIDIA-vgx-xenserver-7.0. . . .rpm[/b] [i](Adjust to your driver version)[/i] Or you can use the GUI method of mounting using a .iso Reboot after install completes. ===== ===== When you say "crashed a server", do you mean the R730 rebooted? I take it all 3 XenServer hosts are correctly licensed? Which vGPU profiles are you using? Have you made any changes to the R730 BIOS? Can you try a Passthrough profile for me and let me know what happens? Regards Ben
To remove the NVIDIA driver from XenServer -

- Query the NVIDIA driver: rpm -qa | grep -i nvidia

Let's assume it comes back with: NVIDIA-vgx-xenserver-7.0-361.45.09 (Adjust to your version if it is different)

- Remove NVIDIA driver: rpm -ev NVIDIA-vgx-xenserver-7.0-361.45.09

I typically put a reboot in here after removal.

- Copy new NVIDIA driver to Xen host using WinSCP and install: rpm -iv /change-to-your-path/NVIDIA-vgx-xenserver-7.0. . . .rpm (Adjust to your driver version)

Or you can use the GUI method of mounting using a .iso

Reboot after install completes.

=====
=====

When you say "crashed a server", do you mean the R730 rebooted?

I take it all 3 XenServer hosts are correctly licensed? Which vGPU profiles are you using?

Have you made any changes to the R730 BIOS?

Can you try a Passthrough profile for me and let me know what happens?

Regards

Ben

#10
Posted 12/09/2016 09:54 AM   
yes - the r730 rebooted. Yes all licensed. M60-0b - preparing a test with m60-1b but deployment takes some time. No - I found a hint that there should be a Dell document available with bios Settings for grid - but I can't find that. Is the m60-1b test also ok?
yes - the r730 rebooted.
Yes all licensed. M60-0b - preparing a test with m60-1b but deployment takes some time.
No - I found a hint that there should be a Dell document available with bios Settings for grid - but I can't find that.

Is the m60-1b test also ok?

#11
Posted 12/09/2016 10:09 AM   
Ok, thanks for the additional info. All you're doing is increasing the framebuffer from 512MB to 1GB. The reason I asked for a Passthrough test, is that Passthrough will not use the driver in the hypervisor, whereas any other profile will. I don't think increasing the framebuffer will stop this issue. Let me do some investigation ... Regards Ben
Ok, thanks for the additional info.

All you're doing is increasing the framebuffer from 512MB to 1GB. The reason I asked for a Passthrough test, is that Passthrough will not use the driver in the hypervisor, whereas any other profile will.

I don't think increasing the framebuffer will stop this issue.

Let me do some investigation ...

Regards

Ben

#12
Posted 12/09/2016 10:23 AM   
Hi Can you please review this and let me know what you think: http://nvidia.custhelp.com/app/answers/detail/a_id/4163/~/nvidia-grid-vgpu-on-dell-r730-%2F-r720-servers,-on-upgrade-to-citrix-xenserver May well be worth an update to your current BIOS version... You can also help validate that by trying a Passthrough profile. There are also some other suggestions at the bottom of that page. Regards Ben
Hi

Can you please review this and let me know what you think: http://nvidia.custhelp.com/app/answers/detail/a_id/4163/~/nvidia-grid-vgpu-on-dell-r730-%2F-r720-servers,-on-upgrade-to-citrix-xenserver

May well be worth an update to your current BIOS version... You can also help validate that by trying a Passthrough profile.

There are also some other suggestions at the bottom of that page.

Regards

Ben

#13
Posted 12/09/2016 10:29 AM   
Thanks for the hin - already checked that - bios etc is all up to date. The Major different to Most hints is that our vms start - in most Scenarios the vms don't boot.
Thanks for the hin - already checked that - bios etc is all up to date.
The Major different to Most hints is that our vms start - in most Scenarios the vms don't boot.

#14
Posted 12/09/2016 10:51 AM   
just tried to remove one of the old Nvidia suplemental packs: error: package NVIDIA-vgx-xenserver-7.0-361.45.09 is not installed I guess they are removed during upgrade - but not fully thus old Version is still visible in XenCenter.
just tried to remove one of the old Nvidia suplemental packs:
error: package NVIDIA-vgx-xenserver-7.0-361.45.09 is not installed
I guess they are removed during upgrade - but not fully thus old Version is still visible in XenCenter.

#15
Posted 12/09/2016 11:28 AM   
Scroll To Top

Add Reply