Artificial Intelligence Computing Leadership from NVIDIA
Xenserver 7.4 error message Gpumon_interface.Internal_error("(Failure \"No vGPU Available\
We are having issues using XenMotion and periodically with system restarts getting the error Gpumon_interface.Internal_error("(Failure \"No vGPU Available\")") Our environment is as follows: 15 XenServer hosts with the latest patch of 7.4 HP DL380 Gen10 dual Xeon Gold 6152 CPU's (44 cores) 768Gb Ram dual Tesla M10 Adapters (8 GPU) We have 140 Servers 2012 R2 VM's running 11 cores and 48Gb memory using XenApp 7.15 CU2. NVIDIA-SMI output from one of the hosts +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.57 Driver Version: 390.57 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M10 Off | 00000000:39:00.0 Off | N/A | | N/A 54C P0 21W / 53W | 8141MiB / 8191MiB | 19% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M10 Off | 00000000:3A:00.0 Off | N/A | | N/A 44C P8 10W / 53W | 4077MiB / 8191MiB | 9% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla M10 Off | 00000000:3B:00.0 Off | N/A | | N/A 38C P0 20W / 53W | 4077MiB / 8191MiB | 13% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla M10 Off | 00000000:3C:00.0 Off | N/A | | N/A 39C P0 19W / 53W | 8141MiB / 8191MiB | 6% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla M10 Off | 00000000:88:00.0 Off | N/A | | N/A 41C P0 19W / 53W | 4077MiB / 8191MiB | 5% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla M10 Off | 00000000:89:00.0 Off | N/A | | N/A 40C P0 19W / 53W | 4077MiB / 8191MiB | 4% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla M10 Off | 00000000:8A:00.0 Off | N/A | | N/A 29C P0 19W / 53W | 4077MiB / 8191MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla M10 Off | 00000000:8B:00.0 Off | N/A | | N/A 30C P8 10W / 53W | 4077MiB / 8191MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 4803 C+G /usr/lib64/xen/bin/vgpu 4064MiB | | 0 14386 C+G /usr/lib64/xen/bin/vgpu 4064MiB | | 1 26677 C+G /usr/lib64/xen/bin/vgpu 4064MiB | | 2 30559 C+G /usr/lib64/xen/bin/vgpu 4064MiB | | 3 14713 C+G /usr/lib64/xen/bin/vgpu 4064MiB | | 3 15610 C+G /usr/lib64/xen/bin/vgpu 4064MiB | | 4 31087 C+G /usr/lib64/xen/bin/vgpu 4064MiB | | 5 23430 C+G /usr/lib64/xen/bin/vgpu 4064MiB | | 6 9862 C+G /usr/lib64/xen/bin/vgpu 4064MiB | | 7 6231 C+G /usr/lib64/xen/bin/vgpu 4064MiB | +-----------------------------------------------------------------------------+ NVidia-smi vgpu output Thu Sep 6 08:56:48 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.57 Driver Version: 390.57 | |-------------------------------+--------------------------------+------------+ | GPU Name | Bus-Id | GPU-Util | | vGPU ID Name | VM ID VM Name | vGPU-Util | |===============================+================================+============| | 0 Tesla M10 | 00000000:39:00.0 | 43% | | 14386 GRID M10-4A | 28 | 37% | | 4803 GRID M10-4A | 48 | 7% | +-------------------------------+--------------------------------+------------+ | 1 Tesla M10 | 00000000:3A:00.0 | 6% | | 26677 GRID M10-4A | 49 | 13% | +-------------------------------+--------------------------------+------------+ | 2 Tesla M10 | 00000000:3B:00.0 | 7% | | 30559 GRID M10-4A | 27 | 8% | +-------------------------------+--------------------------------+------------+ | 3 Tesla M10 | 00000000:3C:00.0 | 19% | | 14713 GRID M10-4A | 45 | 19% | | 15610 GRID M10-4A | 53 | 0% | +-------------------------------+--------------------------------+------------+ | 4 Tesla M10 | 00000000:88:00.0 | 18% | | 31087 GRID M10-4A | 50 | 15% | +-------------------------------+--------------------------------+------------+ | 5 Tesla M10 | 00000000:89:00.0 | 2% | | 23430 GRID M10-4A | 31 | 1% | +-------------------------------+--------------------------------+------------+ | 6 Tesla M10 | 00000000:8A:00.0 | 0% | | 9862 GRID M10-4A | 51 | 3% | +-------------------------------+--------------------------------+------------+ | 7 Tesla M10 | 00000000:8B:00.0 | 0% | | 6231 GRID M10-4A | 33 | 0% | xen_commandline: watchdog dom0_max_vcpus=16 crashkernel=192M,below=4G console=vga vga=mode-0x0311 iommu=dom0-passthrough cpufreq=xen:performance dom0_mem=17179869184B,max:17179869184B
We are having issues using XenMotion and periodically with system restarts getting the error Gpumon_interface.Internal_error("(Failure \"No vGPU Available\")")

Our environment is as follows:

15 XenServer hosts with the latest patch of 7.4
HP DL380 Gen10
dual Xeon Gold 6152 CPU's (44 cores)
768Gb Ram
dual Tesla M10 Adapters (8 GPU)

We have 140 Servers 2012 R2 VM's running 11 cores and 48Gb memory using XenApp 7.15 CU2.

NVIDIA-SMI output from one of the hosts
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.57 Driver Version: 390.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M10 Off | 00000000:39:00.0 Off | N/A |
| N/A 54C P0 21W / 53W | 8141MiB / 8191MiB | 19% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M10 Off | 00000000:3A:00.0 Off | N/A |
| N/A 44C P8 10W / 53W | 4077MiB / 8191MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M10 Off | 00000000:3B:00.0 Off | N/A |
| N/A 38C P0 20W / 53W | 4077MiB / 8191MiB | 13% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M10 Off | 00000000:3C:00.0 Off | N/A |
| N/A 39C P0 19W / 53W | 8141MiB / 8191MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla M10 Off | 00000000:88:00.0 Off | N/A |
| N/A 41C P0 19W / 53W | 4077MiB / 8191MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla M10 Off | 00000000:89:00.0 Off | N/A |
| N/A 40C P0 19W / 53W | 4077MiB / 8191MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla M10 Off | 00000000:8A:00.0 Off | N/A |
| N/A 29C P0 19W / 53W | 4077MiB / 8191MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla M10 Off | 00000000:8B:00.0 Off | N/A |
| N/A 30C P8 10W / 53W | 4077MiB / 8191MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 4803 C+G /usr/lib64/xen/bin/vgpu 4064MiB |
| 0 14386 C+G /usr/lib64/xen/bin/vgpu 4064MiB |
| 1 26677 C+G /usr/lib64/xen/bin/vgpu 4064MiB |
| 2 30559 C+G /usr/lib64/xen/bin/vgpu 4064MiB |
| 3 14713 C+G /usr/lib64/xen/bin/vgpu 4064MiB |
| 3 15610 C+G /usr/lib64/xen/bin/vgpu 4064MiB |
| 4 31087 C+G /usr/lib64/xen/bin/vgpu 4064MiB |
| 5 23430 C+G /usr/lib64/xen/bin/vgpu 4064MiB |
| 6 9862 C+G /usr/lib64/xen/bin/vgpu 4064MiB |
| 7 6231 C+G /usr/lib64/xen/bin/vgpu 4064MiB |
+-----------------------------------------------------------------------------+

NVidia-smi vgpu output

Thu Sep 6 08:56:48 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.57 Driver Version: 390.57 |
|-------------------------------+--------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|===============================+================================+============|
| 0 Tesla M10 | 00000000:39:00.0 | 43% |
| 14386 GRID M10-4A | 28 | 37% |
| 4803 GRID M10-4A | 48 | 7% |
+-------------------------------+--------------------------------+------------+
| 1 Tesla M10 | 00000000:3A:00.0 | 6% |
| 26677 GRID M10-4A | 49 | 13% |
+-------------------------------+--------------------------------+------------+
| 2 Tesla M10 | 00000000:3B:00.0 | 7% |
| 30559 GRID M10-4A | 27 | 8% |
+-------------------------------+--------------------------------+------------+
| 3 Tesla M10 | 00000000:3C:00.0 | 19% |
| 14713 GRID M10-4A | 45 | 19% |
| 15610 GRID M10-4A | 53 | 0% |
+-------------------------------+--------------------------------+------------+
| 4 Tesla M10 | 00000000:88:00.0 | 18% |
| 31087 GRID M10-4A | 50 | 15% |
+-------------------------------+--------------------------------+------------+
| 5 Tesla M10 | 00000000:89:00.0 | 2% |
| 23430 GRID M10-4A | 31 | 1% |
+-------------------------------+--------------------------------+------------+
| 6 Tesla M10 | 00000000:8A:00.0 | 0% |
| 9862 GRID M10-4A | 51 | 3% |
+-------------------------------+--------------------------------+------------+
| 7 Tesla M10 | 00000000:8B:00.0 | 0% |
| 6231 GRID M10-4A | 33 | 0% |


xen_commandline: watchdog dom0_max_vcpus=16 crashkernel=192M,below=4G console=vga vga=mode-0x0311 iommu=dom0-passthrough cpufreq=xen:performance dom0_mem=17179869184B,max:17179869184B

#1
Posted 09/06/2018 02:23 PM   
After researching this more I can see that one of the VM's that is having this issue shows up under XenServer as having the GPU attached, but looking at the output of nvidia-smi vgpu I can see that it does not have a vgpu process attached to it. Does anyone have any idea's of anything I could try or any further troubleshooting I can do?
After researching this more I can see that one of the VM's that is having this issue shows up under XenServer as having the GPU attached, but looking at the output of nvidia-smi vgpu I can see that it does not have a vgpu process attached to it.

Does anyone have any idea's of anything I could try or any further troubleshooting I can do?

#2
Posted 09/10/2018 03:10 PM   
Hi The only thing I can think of at the moment is maybe a configuration or version drift between drivers, XS Hosts or XA VMs. Are you using MCS or PVS to provision the XA VMs? Or are they individually provisioned? What's the Placement Policy on your Pool configured to? (Performance or Density) Just to double check, all your DL380s are an [u]identical[/u] physical spec? Just thinking out loud .... 15 Hosts is a decent size Pool, XS 7.4 supports a maximum of 16 Hosts .... How heavily loaded are your XA VMs, and how hard are they driving the XS Hosts underneath? It's been a while, but I'm wondering whether you may benefit from a dedicated Pool Master to help with VM placement / Pool management? ... Lastly ... Are you using WLB? This doesn't take into account GPU metrics, so may be trying to put VMs on servers that have full GPUs ... Regards Ben
Hi

The only thing I can think of at the moment is maybe a configuration or version drift between drivers, XS Hosts or XA VMs.

Are you using MCS or PVS to provision the XA VMs? Or are they individually provisioned?

What's the Placement Policy on your Pool configured to? (Performance or Density)

Just to double check, all your DL380s are an identical physical spec?

Just thinking out loud .... 15 Hosts is a decent size Pool, XS 7.4 supports a maximum of 16 Hosts .... How heavily loaded are your XA VMs, and how hard are they driving the XS Hosts underneath? It's been a while, but I'm wondering whether you may benefit from a dedicated Pool Master to help with VM placement / Pool management? ...

Lastly ... Are you using WLB? This doesn't take into account GPU metrics, so may be trying to put VMs on servers that have full GPUs ...


Regards

Ben

#3
Posted 09/10/2018 06:22 PM   
A dedicated pool master is not needed, really, IMO. I generally give it about a 20% lighter load than other hosts, but really see no obvious strain on it by making the pool mater do more than just work as an administrator. Newer versions of XenServer leverage more VCPUs, and that was one of the main reasons not to overload a host in earlier XS distributions. With 7.X, this really is no longer a concern, as far as I've seen form the field. I have typically around 80-100 VMs on the hosts on my servers within the pool of five hosts I have for XenDesktop and still see nothing obvious. My recommendation would be to start with a lighter load and slowly ramp it up (keep moving VMs over) and see if you detect any obvious behavioral differences. I'm all for try it empirically and see as opposed to just theoretical and "what the book claims" means! The top and xentop utilities are very useful to keep track of where your resources are going. If needed be, you can probably compensate to a high degree by increasing the amount of memory allocated to dom0. Hope this helps some...
A dedicated pool master is not needed, really, IMO. I generally give it about a 20% lighter load than other hosts, but really see no obvious strain on it by making the pool mater do more than just work as an administrator. Newer versions of XenServer leverage more VCPUs, and that was one of the main reasons not to overload a host in earlier XS distributions. With 7.X, this really is no longer a concern, as far as I've seen form the field.

I have typically around 80-100 VMs on the hosts on my servers within the pool of five hosts I have for XenDesktop and still see nothing obvious. My recommendation would be to start with a lighter load and slowly ramp it up (keep moving VMs over) and see if you detect any obvious behavioral differences. I'm all for try it empirically and see as opposed to just theoretical and "what the book claims" means!

The top and xentop utilities are very useful to keep track of where your resources are going. If needed be, you can probably compensate to a high degree by increasing the amount of memory allocated to dom0.

Hope this helps some...

-=Tobias

#4
Posted 09/10/2018 06:56 PM   
Tobias - Thanks for the additional information, much appreciated! Dan - Forget the Pool Master part, clearly not part of the issue :-) Regards Ben
Tobias - Thanks for the additional information, much appreciated!

Dan - Forget the Pool Master part, clearly not part of the issue :-)


Regards

Ben

#5
Posted 09/11/2018 09:10 AM   
Scroll To Top

Add Reply