NVIDIA
Tesla M60 Freeze, 100% Load Issue
Hi, Maybe someone experienced a similar issue. We use: [list] [.]HP Proliant DL380 Gen9 Servers latest firmware (881936_001_spp-2017.07.2-SPP2017072.2017_0922.6)[/.] [.]Mesla M60 latest drivers (NVIDIA-vGPU-xenserver-7.1-384.73.x86_64.rpm)[/.] [.]Windows 7 Enterprise 64Bit, latest driver (385.41_grid_win8_win7_server2012R2_server2008R2_64bit_international.exe)[/.] [.]Xendesktop (Win7)and XenApp (Windows Server 2012 R2), 7.13[/.] [.]XenServer 7.1, latest updates applied{/.] [.]GRID M60-0B profiles 512MB[/.] [/list] Since we updated to the latest driver NVIDIA-GRID-XenServer-7.1-384.73-385.41 we see various VM's just freezing while people are working on it. The Win7 OS crashes. We also see in Citrix XenCenter the following issue when Delivery Controller tries to boot new VM's: An emulator required to run this VM failed to start. Same applies to for XenApp Servers, freeze and Vis hanging, finally crashes. In the console of the XenServer, nvidia-smi shows that one card is at 100% vgpu use. Mon Oct 2 11:54:15 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.73 Driver Version: 384.73 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M60 On | 00000000:86:00.0 Off | Off | | N/A 45C P8 25W / 150W | 3066MiB / 8191MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M60 On | 00000000:87:00.0 Off | Off | | N/A 48C P0 58W / 150W | 18MiB / 8191MiB | 100% Default | +-------------------------------+----------------------+----------------------+ One could think that this is some kind of memory exhaust, but we see that this just happens out of the blue when memory and gpu is not fully under load. Here the state just before this happens: timestamp name pci.bus_id driver_version pstate pcie.link.gen.max pcie.link.gen.current temperature.gpu utilization.gpu [%] utilization.memory [%] memory.total [MiB] memory.free [MiB] memory.used [MiB] 02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 40 1% 0% 8191 MiB 3093 MiB 5098 MiB 02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 40 3% 0% 8191 MiB 3093 MiB 5098 MiB 02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 41 16% 1% 8191 MiB 3093 MiB 5098 MiB 02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 41 100% 0% 8191 MiB 3093 MiB 5098 MiB 02.10.2017 09:04 Tesla M60 00000000:87:00.0 384.73 P0 3 3 43 100% 0% 8191 MiB 3093 MiB 5098 MiB 02.10.2017 09:04 Tesla M60 00000000:87:00.0 384.73 P0 3 3 43 100% 0% 8191 MiB 3093 MiB 5098 MiB 02.10.2017 09:04 Tesla M60 00000000:87:00.0 384.73 P0 3 3 44 100% 0% 8191 MiB 3093 MiB 5098 MiB As you can see the load was not as much before this happened. The sad thing about this, users lose their work as VM's crash. On top of that VM's cannot start again, the only thing that resolves the issue on a temporary basis is to reboot the XenServer. Sadly enough this will not help, since it will happen again quickly. We had to remove all our GPU's from VM's,... Citrix claims this issue not their problem. Everything points to Nvidia at the moment. We saw this issue first 27.09.2017.
Hi,

Maybe someone experienced a similar issue. We use:
  • HP Proliant DL380 Gen9 Servers latest firmware (881936_001_spp-2017.07.2-SPP2017072.2017_0922.6)
  • Mesla M60 latest drivers (NVIDIA-vGPU-xenserver-7.1-384.73.x86_64.rpm)
  • Windows 7 Enterprise 64Bit, latest driver (385.41_grid_win8_win7_server2012R2_server2008R2_64bit_international.exe)
  • Xendesktop (Win7)and XenApp (Windows Server 2012 R2), 7.13
  • XenServer 7.1, latest updates applied{/.]
    [.]GRID M60-0B profiles 512MB


Since we updated to the latest driver NVIDIA-GRID-XenServer-7.1-384.73-385.41 we see various VM's just freezing while people are working on it. The Win7 OS crashes. We also see in Citrix XenCenter the following issue when Delivery Controller tries to boot new VM's: An emulator required to run this VM failed to start. Same applies to for XenApp Servers, freeze and Vis hanging, finally crashes.

In the console of the XenServer, nvidia-smi shows that one card is at 100% vgpu use.


Mon Oct 2 11:54:15 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.73 Driver Version: 384.73 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 00000000:86:00.0 Off | Off |
| N/A 45C P8 25W / 150W | 3066MiB / 8191MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 On | 00000000:87:00.0 Off | Off |
| N/A 48C P0 58W / 150W | 18MiB / 8191MiB | 100% Default |
+-------------------------------+----------------------+----------------------+


One could think that this is some kind of memory exhaust, but we see that this just happens out of the blue when memory and gpu is not fully under load. Here the state just before this happens:

timestamp name pci.bus_id driver_version pstate pcie.link.gen.max pcie.link.gen.current temperature.gpu utilization.gpu [%] utilization.memory [%] memory.total [MiB] memory.free [MiB] memory.used [MiB]
02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 40 1% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 40 3% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 41 16% 1% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:03 Tesla M60 00000000:87:00.0 384.73 P0 3 3 41 100% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0 384.73 P0 3 3 43 100% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0 384.73 P0 3 3 43 100% 0% 8191 MiB 3093 MiB 5098 MiB
02.10.2017 09:04 Tesla M60 00000000:87:00.0 384.73 P0 3 3 44 100% 0% 8191 MiB 3093 MiB 5098 MiB

As you can see the load was not as much before this happened.

The sad thing about this, users lose their work as VM's crash. On top of that VM's cannot start again, the only thing that resolves the issue on a temporary basis is to reboot the XenServer. Sadly enough this will not help, since it will happen again quickly. We had to remove all our GPU's from VM's,...

Citrix claims this issue not their problem. Everything points to Nvidia at the moment.

We saw this issue first 27.09.2017.

#1
Posted 10/02/2017 12:25 PM   
Maybe as an addon, we don't use HDX PRO 3D, we use standard VDA deployments forour XenDesktop environement.
Maybe as an addon, we don't use HDX PRO 3D, we use standard VDA deployments forour XenDesktop environement.

#2
Posted 10/02/2017 12:42 PM   
Hi Have you tried a different vGPU profile size? Maybe the 1B profile? I take it you have SUMs, have you raised it with NVIDIA? Failing both of the above, can you not role back to the previous driver that was working to give you stability whilst you troubleshoot on a Dev platform? Regards
Hi

Have you tried a different vGPU profile size? Maybe the 1B profile?

I take it you have SUMs, have you raised it with NVIDIA?

Failing both of the above, can you not role back to the previous driver that was working to give you stability whilst you troubleshoot on a Dev platform?

Regards

#3
Posted 10/02/2017 05:56 PM   
Thanks for your reply. Yes we have SUMS and yes, we've raised the issue with NVIDIA (no solution so far). Roll back could be an option, we simply removed the GPU for now, since we had to have quick solution. The 1GB profile could be tried as well, but then I can run only 64 users, so I would need more M60ties for that. Since we use Win7 in standard VDA mode we thought the 512 Profile will just be right. Our test environment does not have any M60 in it for the moment, those cards are quite expensive :-).
Thanks for your reply. Yes we have SUMS and yes, we've raised the issue with NVIDIA (no solution so far). Roll back could be an option, we simply removed the GPU for now, since we had to have quick solution. The 1GB profile could be tried as well, but then I can run only 64 users, so I would need more M60ties for that. Since we use Win7 in standard VDA mode we thought the 512 Profile will just be right. Our test environment does not have any M60 in it for the moment, those cards are quite expensive :-).

#4
Posted 10/02/2017 08:48 PM   
PM Sent ...
PM Sent ...

#5
Posted 10/03/2017 07:46 AM   
Something else to consider after my PM ... You may be better investigating using M10s rather than M60s. These are cheaper than M60s but have twice the Framebuffer and twice the amount of GPUs, so you would be able to give your users 1GB whilst maintaining current VM / server density. The M10 offers less performance than the M60, but if you're only allocating 512MB, then these are clearly not high performance users. Also, if you're only allocating 512MB, then you're not even using NVEnc, as this is only available on 1GB profiles and higher. If you want better density per physical server, then you might want to look at a XenApp model (again using the M10). Obviously depending on applications being used, security requirements etc etc. Have a look at an M10 on your dev platform and see what you think ... Use my PM as guidance for locating one for testing ... Regards
Something else to consider after my PM ... You may be better investigating using M10s rather than M60s. These are cheaper than M60s but have twice the Framebuffer and twice the amount of GPUs, so you would be able to give your users 1GB whilst maintaining current VM / server density. The M10 offers less performance than the M60, but if you're only allocating 512MB, then these are clearly not high performance users. Also, if you're only allocating 512MB, then you're not even using NVEnc, as this is only available on 1GB profiles and higher.

If you want better density per physical server, then you might want to look at a XenApp model (again using the M10). Obviously depending on applications being used, security requirements etc etc.

Have a look at an M10 on your dev platform and see what you think ... Use my PM as guidance for locating one for testing ...

Regards

#6
Posted 10/03/2017 10:08 AM   
Yes, got your point. For the test environment we will go for the M10 I think. Since it is not the same card we might not have the same issue. We will assign the 1GB profile to some test users now. Just to see if we can reproduce the issue. We also use XenApp to push apps to the XenDesktop, but only XenApp will not work for our users I'm afraid - we need to keep both XenDesktop and XenApp. You are right, our users are not high end users in that sense. We can flatten CPU Usage in general with the Nvidia cards, users have a better GUI experience for sure. We also use Bloomberg, Thomsone Reuters etc. which benefit from the cards as well,... What is a bit disappointing is that such a severe issue is happening in the first place and Nvidia support is a bit limited so far,... Best regards
Yes, got your point. For the test environment we will go for the M10 I think. Since it is not the same card we might not have the same issue.

We will assign the 1GB profile to some test users now. Just to see if we can reproduce the issue. We also use XenApp to push apps to the XenDesktop, but only XenApp will not work for our users I'm afraid - we need to keep both XenDesktop and XenApp.

You are right, our users are not high end users in that sense. We can flatten CPU Usage in general with the Nvidia cards, users have a better GUI experience for sure. We also use Bloomberg, Thomsone Reuters etc. which benefit from the cards as well,...

What is a bit disappointing is that such a severe issue is happening in the first place and Nvidia support is a bit limited so far,...

Best regards

#7
Posted 10/04/2017 06:30 AM   
When did you raise the call with NVIDIA Support (Date / Time)? Have you had any response back yet? I take it you have a case number? Feel free to PM me that if you like ... Regards
When did you raise the call with NVIDIA Support (Date / Time)?

Have you had any response back yet? I take it you have a case number?

Feel free to PM me that if you like ...

Regards

#8
Posted 10/04/2017 08:06 AM   
09/27/2017, 03:37 AM ticket ID: 170927-000048 Guess what, not solution yet, very very disappointed by support of Nvidia I must say. Regards
09/27/2017, 03:37 AM
ticket ID: 170927-000048

Guess what, not solution yet, very very disappointed by support of Nvidia I must say.

Regards

#9
Posted 10/06/2017 07:39 AM   
I'll ask someone to take a look and see what's happening with the ticket ... Regards
I'll ask someone to take a look and see what's happening with the ticket ...

Regards

#10
Posted 10/06/2017 08:15 AM   
Scroll To Top

Add Reply