NVIDIA
Grid vGPU enabled desktops will not power up on esxi 6.5 host
I'll try not to make this long winded. On yesterday I reinstalled esxi 6.0 (dell r730 customized ver) on one of our GRID (m10/m60) enabled servers to see how it behaved with our new vsphere 6.5 server appliance. I found that after reinstalling the software, patching the server, and attempting to start a m10 or m60 enabled vm on it, the vm would move to one of the other grid servers in the cluster to power on. My first thought was to check the graphics and found that the active type for the graphics on the "fresh install" server was set to basic and the configured type was shared. On the 2 functioning servers in the cluster, the active type was shared and blank for the configured type. I also found that the xorg service would not remain started for more than a couple of seconds before stopping. There is no error when powering on a desktop (my guess was that this was because the vm moved itself to another server in the cluster). I needed to interact with the graphics in an attempt to change the active type and hopefully get the xorg service to start, so I upgraded the host to esxi 6.5. That allowed me to interact with the graphics and change them to shared/ shared direct, but not for the active type is still basic. Also, the xorg service will not stay on and stops itself as soon as I refresh the screen. The vms still jump off the server when powered on like it's the plague. I have compared every other setting and it matches up. The biggest propblem is that the xorg won't stay running. If I run nvidia-smi from the functioning servers, I get nvidia-smi: not found. But I get information from the other servers. I'm not really sure what to look at next as there are no graphic related errors appearing, but I feel like there is a checkbox somewhere or a setting that I am missing. Any help would greatly appreciated.
I'll try not to make this long winded. On yesterday I reinstalled esxi 6.0 (dell r730 customized ver) on one of our GRID (m10/m60) enabled servers to see how it behaved with our new vsphere 6.5 server appliance. I found that after reinstalling the software, patching the server, and attempting to start a m10 or m60 enabled vm on it, the vm would move to one of the other grid servers in the cluster to power on. My first thought was to check the graphics and found that the active type for the graphics on the "fresh install" server was set to basic and the configured type was shared. On the 2 functioning servers in the cluster, the active type was shared and blank for the configured type. I also found that the xorg service would not remain started for more than a couple of seconds before stopping. There is no error when powering on a desktop (my guess was that this was because the vm moved itself to another server in the cluster).

I needed to interact with the graphics in an attempt to change the active type and hopefully get the xorg service to start, so I upgraded the host to esxi 6.5. That allowed me to interact with the graphics and change them to shared/ shared direct, but not for the active type is still basic. Also, the xorg service will not stay on and stops itself as soon as I refresh the screen.

The vms still jump off the server when powered on like it's the plague. I have compared every other setting and it matches up. The biggest propblem is that the xorg won't stay running. If I run nvidia-smi from the functioning servers, I get nvidia-smi: not found. But I get information from the other servers. I'm not really sure what to look at next as there are no graphic related errors appearing, but I feel like there is a checkbox somewhere or a setting that I am missing. Any help would greatly appreciated.

#1
Posted 02/24/2018 10:49 PM   
Just another note. When I run esxcli hardware pci list –c 0x0300 –m 0xf on a working host, the module has nvidia listed, but on the host with the issue, the module is none. I got the command from this kb https://kb.vmware.com/s/article/2064775. It says it's for 5.0, but I figured that it might mean something.
Just another note. When I run esxcli hardware pci list –c 0x0300 –m 0xf on a working host, the module has nvidia listed, but on the host with the issue, the module is none.

I got the command from this kb https://kb.vmware.com/s/article/2064775. It says it's for 5.0, but I figured that it might mean something.

#2
Posted 02/24/2018 11:09 PM   
Okay. Now I'm almost certain that it's the vib. When I run esxcli software vib list from the working server, I can see the nvidia vgpu esxi host driver. I'm going to install it on the fresh server and see what happens. EDIT: Okay. That removed the basic active type and got the xorg service started. I rebooted, but the desktops I start on the server still jump to another one.
Okay. Now I'm almost certain that it's the vib. When I run esxcli software vib list from the working server, I can see the nvidia vgpu esxi host driver. I'm going to install it on the fresh server and see what happens.

EDIT: Okay. That removed the basic active type and got the xorg service started. I rebooted, but the desktops I start on the server still jump to another one.

#3
Posted 02/24/2018 11:44 PM   
Well, using the forum search would probably help. It's not the VIB, it's a bug with ESX6.0 U3 as I suspect from your description with xorg... https://gridforums.nvidia.com/default/topic/1207/nvidia-virtual-gpu-drivers/vmware-esxi-6-0-update-3-support/ Regards Simon
Well, using the forum search would probably help. It's not the VIB, it's a bug with ESX6.0 U3 as I suspect from your description with xorg...
https://gridforums.nvidia.com/default/topic/1207/nvidia-virtual-gpu-drivers/vmware-esxi-6-0-update-3-support/


Regards

Simon

#4
Posted 02/25/2018 07:43 AM   
Thanks for the reply. I saw that post and ran the gauntlet on that issue last year, but I am on 6.5 now and xorg service is now started and holds. I could try that process again, but it would seem that I wouldn't need it if the service is not having an issue anymore, correct? Edit: I think I'm 99% of the way there. I turned off DRS on the cluster and the vms didn't move off of the fresh install host. So, I'm guessing that there is something that needs to be tweeked in DRS.
Thanks for the reply. I saw that post and ran the gauntlet on that issue last year, but I am on 6.5 now and xorg service is now started and holds. I could try that process again, but it would seem that I wouldn't need it if the service is not having an issue anymore, correct?

Edit: I think I'm 99% of the way there. I turned off DRS on the cluster and the vms didn't move off of the fresh install host. So, I'm guessing that there is something that needs to be tweeked in DRS.

#5
Posted 02/25/2018 03:56 PM   
Scroll To Top

Add Reply