NVIDIA
Odd performance blinks inside Grid Xendesktop VM's
I am running a VMWare 6.5 Farm with about 120 Grid Accelerated VDI. 6 HP WS460c Blades with 4 x Tesla M6 in a C7000 HP Blade Chassis VMWare 6.5 patched up to current 120 Windows 10 VDI using LTSB 2016 (1607) Citrix Xendesktop 7.18 MCS image - Non Persistent VDI Grid M6-1B Profiles on each VM Our primary office site is connected to a Co-Located data center by redundant 1Gb links. We typically see Latency and Round Trip time in session between 4-8ms. Each VM has 4 vCPU and 16GB of RAM, along with the GPU, and the disk runs on the VMWare Paravirtual Controller. Our Citrix Policy is setup to use the video codec on actively changing regions on high quality. We are using NVENC for moving images within the session. At random points throughout the day, users complain about their VDI freezing for less than a second, and then resuming. It's very obvious when they are running AV materials, because both audio and video pause and then resume like nothing happened. Sometimes this freeze will cause the audio to desync from an online video on sites such as youtube or vimeo. Typing will also be delayed, and a number of words will suddenly run across the screen. All users see the issues occur, but it's not something that happens at the same time to everyone. Storage is an all flash array, and as far as I can tell average read and write latency are under 1ms. Network load seems low, and according to our monitoring tool, we aren't seeing dropped packets or spikes in round trip latency on our routers. CPU load on the VM's and in the esxi host itself looks entirely within tolerable limits. We have one additional host that is used primarily for testing and updating our images, and upgraded it to the latest Grid 7.2 drivers, as we are on 5.x in production right now. This did not have a meaningful impact on the issue. The worst part about the problem is that I have yet to devise a method of reliably recreating the issue. I can hammer on a session with multiple videos and applications running simultaneously, and they won't flinch. Yet other users complain that they have to watch their typing catch up with them multiple times per hour. When the issue occurs, the entire VM appears to freeze momentarily, and then resume. If anyone has any insight, I would greatly appreciate it.
I am running a VMWare 6.5 Farm with about 120 Grid Accelerated VDI.

6 HP WS460c Blades with 4 x Tesla M6 in a C7000 HP Blade Chassis
VMWare 6.5 patched up to current
120 Windows 10 VDI using LTSB 2016 (1607)
Citrix Xendesktop 7.18
MCS image - Non Persistent VDI
Grid M6-1B Profiles on each VM

Our primary office site is connected to a Co-Located data center by redundant 1Gb links. We typically see Latency and Round Trip time in session between 4-8ms. Each VM has 4 vCPU and 16GB of RAM, along with the GPU, and the disk runs on the VMWare Paravirtual Controller. Our Citrix Policy is setup to use the video codec on actively changing regions on high quality. We are using NVENC for moving images within the session.

At random points throughout the day, users complain about their VDI freezing for less than a second, and then resuming. It's very obvious when they are running AV materials, because both audio and video pause and then resume like nothing happened. Sometimes this freeze will cause the audio to desync from an online video on sites such as youtube or vimeo. Typing will also be delayed, and a number of words will suddenly run across the screen. All users see the issues occur, but it's not something that happens at the same time to everyone.

Storage is an all flash array, and as far as I can tell average read and write latency are under 1ms. Network load seems low, and according to our monitoring tool, we aren't seeing dropped packets or spikes in round trip latency on our routers. CPU load on the VM's and in the esxi host itself looks entirely within tolerable limits.

We have one additional host that is used primarily for testing and updating our images, and upgraded it to the latest Grid 7.2 drivers, as we are on 5.x in production right now. This did not have a meaningful impact on the issue. The worst part about the problem is that I have yet to devise a method of reliably recreating the issue. I can hammer on a session with multiple videos and applications running simultaneously, and they won't flinch. Yet other users complain that they have to watch their typing catch up with them multiple times per hour.

When the issue occurs, the entire VM appears to freeze momentarily, and then resume. If anyone has any insight, I would greatly appreciate it.

#1
Posted 03/14/2019 11:13 PM   
Hi If the entire VM freezes, then that sounds like a connectivity / networking issue to me. It doesn't sound performance related, and your hardware specs sounds fine. 4-8ms is very good, so it's not latency related. What are the end point devices that are being used? Which version of Receiver / Workspace App do you have on them? Have you tried accessing the VMs from a different geographical location / over a completely different connection to see if it still occurs? Regards Ben
Hi

If the entire VM freezes, then that sounds like a connectivity / networking issue to me. It doesn't sound performance related, and your hardware specs sounds fine.

4-8ms is very good, so it's not latency related.

What are the end point devices that are being used? Which version of Receiver / Workspace App do you have on them?

Have you tried accessing the VMs from a different geographical location / over a completely different connection to see if it still occurs?

Regards

Ben

#2
Posted 03/15/2019 02:00 PM   
I've dumped logs from every piece of networking equipment between the clients and the endpoints. I have yet to find so much as a dropped packet. The endpoints are Intel NUC Workstations with Intel i3-6100's in them. They have 8GB's of RAM and we run ThinKiosk on top of Windows 10 to deliver VDI. Citrix Receiver is 4.12, but we have been able to replicate the issue in testing on the latest Workspace 1812. Connecting to the VDI externally through our netscaler gateway doesn't alter the situation.
I've dumped logs from every piece of networking equipment between the clients and the endpoints. I have yet to find so much as a dropped packet.

The endpoints are Intel NUC Workstations with Intel i3-6100's in them. They have 8GB's of RAM and we run ThinKiosk on top of Windows 10 to deliver VDI. Citrix Receiver is 4.12, but we have been able to replicate the issue in testing on the latest Workspace 1812.

Connecting to the VDI externally through our netscaler gateway doesn't alter the situation.

#3
Posted 03/15/2019 03:21 PM   
Well, that all sounds fine. - Have you tried experimenting with different Citrix Policies? - What resolution and how many monitors are your users running? - Have you tried connecting from a different end point? Regards Ben
Well, that all sounds fine.

- Have you tried experimenting with different Citrix Policies?
- What resolution and how many monitors are your users running?
- Have you tried connecting from a different end point?

Regards

Ben

#4
Posted 03/18/2019 05:22 PM   
Yes, we've tried different Citrix video policies, and they all produced similar results. Typical users have two 1920x1080 monitors. We've tried multiple end points. We have some Mac, Linux, and PC clients in the office, and it's universal. I'm beginning to think it might be a CPU ready time issue, but more testing is required.
Yes, we've tried different Citrix video policies, and they all produced similar results.

Typical users have two 1920x1080 monitors.

We've tried multiple end points. We have some Mac, Linux, and PC clients in the office, and it's universal.

I'm beginning to think it might be a CPU ready time issue, but more testing is required.

#5
Posted 03/19/2019 08:53 PM   
Hi Which CPUs are you running and what's the over-commit ratio on them? Have you been through the BIOS settings to make sure they're configured appropriately for your deployment and the same across all hosts? I appreciate that by checking that you may need to power cycle the Hosts, so it's not always a quick / easy thing to check. Lastly, has this issue been there since the start of the deployment? Or has it gradually gotten worse over time / through updates or through additional user density? Sorry for all the questions, just trying to understand exactly what you've tried and the environment. As you can appreciate, there are many many variables within the stack that can cause issues. Regards Ben
Hi

Which CPUs are you running and what's the over-commit ratio on them?

Have you been through the BIOS settings to make sure they're configured appropriately for your deployment and the same across all hosts? I appreciate that by checking that you may need to power cycle the Hosts, so it's not always a quick / easy thing to check.

Lastly, has this issue been there since the start of the deployment? Or has it gradually gotten worse over time / through updates or through additional user density?

Sorry for all the questions, just trying to understand exactly what you've tried and the environment. As you can appreciate, there are many many variables within the stack that can cause issues.

Regards

Ben

#6
Posted 03/20/2019 08:38 AM   
It's been worse since we added patches for Spectre and Meltdown to the hosts. We originally deployed these hosts with Windows 7, and eventually migrated to Windows 10. Performance degraded slightly with Windows 10, but these random pauses in the sessions didn't start until after we deployed patches for Spectre and Meltdown. All the hosts have the CPU's configured for the highest power profile possible, with C-States and P-States disabled. Each host has Dual Intel E5-2695 V4's with 18 physical cores. Hyperthreading is enabled. We have about 82 vCPU's configured on 72 logical processors. If you count the additional logical CPU's in the calculation, we are at a 1.125:1 commit ratio. It goes up to 2.3:1 if you only count physical cores.
It's been worse since we added patches for Spectre and Meltdown to the hosts. We originally deployed these hosts with Windows 7, and eventually migrated to Windows 10. Performance degraded slightly with Windows 10, but these random pauses in the sessions didn't start until after we deployed patches for Spectre and Meltdown.

All the hosts have the CPU's configured for the highest power profile possible, with C-States and P-States disabled. Each host has Dual Intel E5-2695 V4's with 18 physical cores. Hyperthreading is enabled. We have about 82 vCPU's configured on 72 logical processors. If you count the additional logical CPU's in the calculation, we are at a 1.125:1 commit ratio. It goes up to 2.3:1 if you only count physical cores.

#7
Posted 03/20/2019 05:33 PM   
A new wrinkle to this. Yesterday I deployed some machines to a hypervisor that is used for image updates and testing purposes, and discovered that the freezing disappears when I turn off NVENC for session encoding. Has anyone else experienced momentary pauses in their video stream in session while using the GPU to encode the stream? CPU Usage and overall utilization goes up in the session, which I'd like to avoid, but the random 1-2 second pauses have dissipated.
A new wrinkle to this.

Yesterday I deployed some machines to a hypervisor that is used for image updates and testing purposes, and discovered that the freezing disappears when I turn off NVENC for session encoding. Has anyone else experienced momentary pauses in their video stream in session while using the GPU to encode the stream? CPU Usage and overall utilization goes up in the session, which I'd like to avoid, but the random 1-2 second pauses have dissipated.

#8
Posted 03/21/2019 12:19 PM   
Hi Thanks for the additional information. Regarding your BIOS tuning, I have a couple of articles that you may find interesting. The first one is a slightly older article that discusses Single Core vs All Core Turbo and would be applicable to your configuration. Note your Base Clock, All Core Turbo Clock vs Single Core Turbo Clock: https://www.pugetsystems.com/blog/2015/07/09/Actual-CPU-Speeds---What-You-See-Is-Not-Always-What-You-Get-675/ The second article was written a few weeks ago and discusses various BIOS settings to help engage Turbo: https://www.mycugc.org/blogs/tobias-kreidl/2019/03/07/tale-of-two-servers-bios-settings-affect-apps-gpu The above aside, that's an interesting discovery with NVENC. I've not experienced that issue before. Out of interest, have you tried monitoring the encoders directly on the GPU through nvidia-smi just to see how loaded up they are? Or, a nicer way to visualise the utilisation would be to use NGPUTOP: https://github.com/JeremyMain/ngputop/releases Regards Ben
Hi

Thanks for the additional information.

Regarding your BIOS tuning, I have a couple of articles that you may find interesting. The first one is a slightly older article that discusses Single Core vs All Core Turbo and would be applicable to your configuration. Note your Base Clock, All Core Turbo Clock vs Single Core Turbo Clock:

https://www.pugetsystems.com/blog/2015/07/09/Actual-CPU-Speeds---What-You-See-Is-Not-Always-What-You-Get-675/

The second article was written a few weeks ago and discusses various BIOS settings to help engage Turbo:

https://www.mycugc.org/blogs/tobias-kreidl/2019/03/07/tale-of-two-servers-bios-settings-affect-apps-gpu

The above aside, that's an interesting discovery with NVENC. I've not experienced that issue before. Out of interest, have you tried monitoring the encoders directly on the GPU through nvidia-smi just to see how loaded up they are? Or, a nicer way to visualise the utilisation would be to use NGPUTOP:

https://github.com/JeremyMain/ngputop/releases

Regards

Ben

#9
Posted 03/21/2019 06:12 PM   
Scroll To Top

Add Reply