NVIDIA
3D load impacts NVENC performance
We are observing drops in the achievable XenDesktop frame rate when the 3D load increases. We can achieve 50 FPS using NVENC with little 3D load, however this drops to 15 FPS when the 3D load increase (such as due to increased model complexity or running something like Unigine Heaven demo). The issue can be observed on a dual display XenDesktop by playing a video on one monitor then introducing the Unigine Heaven demo on another. We've isolated this to some sort of interference between the graphics processing and hardware encoding units on the vGPU. As the 3D GPU utilisation increases the hardware encoder utilisation drops as does the frame rate. The GPU is nowhere near maximum load and should have plenty in reserve. Tech docs indicate the hardware decoder performance should not be affected by the CUDA load with some exceptions such as temporal AQ. However there is clearly significant interference occurring. Setup is Tesla M60, XenDesktop, Windows, vGPU, vSphere/ESXI everything recent at time of post. Dev/Test network with little load. This seems to be the last barrier preventing us from achieving near bare-metal performance for dynamic 3D model visualisation over NVIDIA Grid/XenDesktop. Having reviewed the docs, some possible theories/strategies I've come up with include: * Frame buffer reading is being held up by slower redraws * This is a software/driver issue and we should report it * vGPU Frame Rate Limiter setting might help * Temporal AQ is being used and is impacted by CUDA load. * Triple buffering might help?? Any assistance is appreciated. AB
We are observing drops in the achievable XenDesktop frame rate when the 3D load increases. We can achieve 50 FPS using NVENC with little 3D load, however this drops to 15 FPS when the 3D load increase (such as due to increased model complexity or running something like Unigine Heaven demo).

The issue can be observed on a dual display XenDesktop by playing a video on one monitor then introducing the Unigine Heaven demo on another.

We've isolated this to some sort of interference between the graphics processing and hardware encoding units on the vGPU. As the 3D GPU utilisation increases the hardware encoder utilisation drops as does the frame rate. The GPU is nowhere near maximum load and should have plenty in reserve.

Tech docs indicate the hardware decoder performance should not be affected by the CUDA load with some exceptions such as temporal AQ. However there is clearly significant interference occurring.

Setup is Tesla M60, XenDesktop, Windows, vGPU, vSphere/ESXI everything recent at time of post. Dev/Test network with little load.

This seems to be the last barrier preventing us from achieving near bare-metal performance for dynamic 3D model visualisation over NVIDIA Grid/XenDesktop.

Having reviewed the docs, some possible theories/strategies I've come up with include:
* Frame buffer reading is being held up by slower redraws
* This is a software/driver issue and we should report it
* vGPU Frame Rate Limiter setting might help
* Temporal AQ is being used and is impacted by CUDA load.
* Triple buffering might help??


Any assistance is appreciated.

AB

#1
Posted 04/11/2017 02:40 PM   
Looking into this further we've noticed that the performance is initially high at 60 FPS then after a short period (10s or so) of viewing a more complex part of the model (3D City model in this case) the performance drops dramatically to around 20 FPS. After scrolling away from the complex area of the model (i.e. the CBD) to a less complex part of the model (suburbs/rural) where there is less 3D complexity the performance returns to 60 FPS. Previous theories are probably invalid. Now looking at: * Cooling setup * Power management in particular correct BIOS support and fan control (this is a somewhat cobbled together Dell eval system) * Software/Drivers We'll instrument the Tesla to see what insights we can gain.
Looking into this further we've noticed that the performance is initially high at 60 FPS then after a short period (10s or so) of viewing a more complex part of the model (3D City model in this case) the performance drops dramatically to around 20 FPS.

After scrolling away from the complex area of the model (i.e. the CBD) to a less complex part of the model (suburbs/rural) where there is less 3D complexity the performance returns to 60 FPS.

Previous theories are probably invalid. Now looking at:
* Cooling setup
* Power management in particular correct BIOS support and fan control (this is a somewhat cobbled together Dell eval system)
* Software/Drivers



We'll instrument the Tesla to see what insights we can gain.

#2
Posted 04/11/2017 10:59 PM   
Hi AHB, which FPS do you mean in your tests? Is it the FPS within the application or the session? Regards Simon
Hi AHB,

which FPS do you mean in your tests? Is it the FPS within the application or the session?

Regards

Simon

#3
Posted 04/12/2017 07:26 AM   
FPS is as reported by Citrix HDX Monitor which is polling the virtual machine from the native client OS. Problem is not temperature - Tesla is running cool at around 44C. Problem appears to be that the M60 hardware encoder is dying when the 3D load increases. When the vGPU load passes around 35% the hardware encoder utilization drops sharply from 20% to below 10% and the FPS plummets. Encoder utilization starts climbing again when vGPU utilization reduces below around 35%. All data are as measured by GPUProfiler v1.04.
FPS is as reported by Citrix HDX Monitor which is polling the virtual machine from the native client OS.

Problem is not temperature - Tesla is running cool at around 44C.

Problem appears to be that the M60 hardware encoder is dying when the 3D load increases. When the vGPU load passes around 35% the hardware encoder utilization drops sharply from 20% to below 10% and the FPS plummets. Encoder utilization starts climbing again when vGPU utilization reduces below around 35%.

All data are as measured by GPUProfiler v1.04.

#4
Posted 04/13/2017 02:28 AM   
You can try to off graphics composer (like Aero in Win7). I have similar problem (low FPS on capture). I tested Win7 and 3d application with constant load like "UnigineHeaven" in small window (640x360) ([u]not fullscreen[/u]). The load on graphics card is low due to grid drivers "Frame Limiter" not allow more then 67 FPS. The "UnigineHeaven" shows always 66 FPS (right-top corner) also "Fraps" shows 66 FPS on application. K280Q/K2 load is <50% and temperature <50C (reports nvidia-smi in xen). I also try to downgrade win7 driver from 369.95 to 369.71 or 369.17 still the same results but I cannot verify if the downgrade setup is capable to replace all API-dll with older version. [.][b]Aero composer is ON[/b][/.] [b]Console[/b] high-freqency capture from xen is OK, 60 FPS (undocumented parameter "intervaltime=16666" see [url]https://gridforums.nvidia.com/default/topic/258/[/url]). Capture with [b]NvFBCToSysGrabFrame()[/b] inside win7 guest is [b]crappy[/b]. The same API (NvFBCTo*(), NVidia Capture SDK) is used in remoting protocols. I am using external dedicated encoder to eliminate influence with card encoder (see [url]https://gridforums.nvidia.com/default/topic/752/[/url]). The encoder input FPS results follow (10 sec average samples and yes, "UnigineHeaven" still running 66 FPS): [code]FPS 57.41 FPS 47.06 FPS 43.80 FPS 46.23 FPS 48.43 FPS 47.92 FPS 46.09 FPS 47.32 FPS 47.10 FPS 46.66 FPS 44.73 FPS 46.34 FPS 48.62 FPS 53.17 FPS 56.52 FPS 62.76 FPS 64.16 FPS 56.71 FPS 60.14 FPS 55.91 FPS 60.16 FPS 57.83 FPS 56.31 FPS 48.70 FPS 40.77 FPS 38.91 FPS 34.92 FPS 31.47 FPS 28.30 FPS 25.09 FPS 21.98 FPS 19.53 FPS 21.81 FPS 21.03 FPS 19.45 FPS 19.38 FPS 19.76 FPS 22.76 FPS 23.20 FPS 21.38 FPS 21.20 FPS 17.39 FPS 13.96 FPS 12.75 FPS 13.45 FPS 13.30 FPS 13.66 FPS 13.64 FPS 13.12 FPS 12.13 FPS 13.70 FPS 15.30 FPS 16.58 FPS 17.78 FPS 18.49 FPS 18.47 FPS 17.64 FPS 18.27 FPS 17.87 [/code] [.][b]Aero composer is OFF[/b][/.] There is no FPS problem in NvFBCToSysGrabFrame() or direct console capture. [color="orange"][b]UPDATE/DEMO[/b][/color]: [b]Attached video[/b] capture from encoder output shows the FPS capture problem. The video begins with console capture with Aero (exactly 60 FPS independent to application), continues with NvFBCToSysGrabFrame() capture with Aero (problematic 60-40 FPS) and finally NvFBCToSysGrabFrame() capture without Aero (67 FPS as generated by application, expected behavior). Captured video is fixed to 60 FPS eg. when only 40 FPS is captured/rendered from the encoder the "3d flag" flaps faster. [i]My opinion: something is still broken between M$$$ graphics composer and N$$$ capture SDK. There is also similar full screens capture problems in Win10 (see [url]https://gridforums.nvidia.com/default/topic/1046/[/url]). I believe that N$$$ is capable to repair this (or this [url]https://gridforums.nvidia.com/default/topic/382/[/url]) in a few years. The process already started, hiring ([url]http://www.nvidia.com/object/careers.html[/url]) new GRID managers and moving/expanding GRID QA to India, renaming product to "NVIDIAVirt" ([url]https://twitter.com/NVIDIAVirt[/url]) and I am expecting to double price on new cards and license fees and drop support of licese-less cards (K1/K2) ASAP. Nvidia is awarded "The Yahoo [b]Finance[/b] Company of the Year" congratulation ! ([url]http://finance.yahoo.com/news/nvidia-the-yahoo-finance-company-of-the-year-173130275.html[/url]) "The Way It's Meant to be Played" with customers.[/i]
You can try to off graphics composer (like Aero in Win7).

I have similar problem (low FPS on capture). I tested Win7 and 3d application with constant load like "UnigineHeaven" in small window (640x360) (not fullscreen). The load on graphics card is low due to grid drivers "Frame Limiter" not allow more then 67 FPS. The "UnigineHeaven" shows always 66 FPS (right-top corner) also "Fraps" shows 66 FPS on application. K280Q/K2 load is <50% and temperature <50C (reports nvidia-smi in xen). I also try to downgrade win7 driver from 369.95 to 369.71 or 369.17 still the same results but I cannot verify if the downgrade setup is capable to replace all API-dll with older version.

  • Aero composer is ON

  • Console high-freqency capture from xen is OK, 60 FPS (undocumented parameter "intervaltime=16666" see https://gridforums.nvidia.com/default/topic/258/).

    Capture with NvFBCToSysGrabFrame() inside win7 guest is crappy. The same API (NvFBCTo*(), NVidia Capture SDK) is used in remoting protocols. I am using external dedicated encoder to eliminate influence with card encoder (see https://gridforums.nvidia.com/default/topic/752/). The encoder input FPS results follow (10 sec average samples and yes, "UnigineHeaven" still running 66 FPS):
    FPS 57.41
    FPS 47.06
    FPS 43.80
    FPS 46.23
    FPS 48.43
    FPS 47.92
    FPS 46.09
    FPS 47.32
    FPS 47.10
    FPS 46.66
    FPS 44.73
    FPS 46.34
    FPS 48.62
    FPS 53.17
    FPS 56.52
    FPS 62.76
    FPS 64.16
    FPS 56.71
    FPS 60.14
    FPS 55.91
    FPS 60.16
    FPS 57.83
    FPS 56.31
    FPS 48.70
    FPS 40.77
    FPS 38.91
    FPS 34.92
    FPS 31.47
    FPS 28.30
    FPS 25.09
    FPS 21.98
    FPS 19.53
    FPS 21.81
    FPS 21.03
    FPS 19.45
    FPS 19.38
    FPS 19.76
    FPS 22.76
    FPS 23.20
    FPS 21.38
    FPS 21.20
    FPS 17.39
    FPS 13.96
    FPS 12.75
    FPS 13.45
    FPS 13.30
    FPS 13.66
    FPS 13.64
    FPS 13.12
    FPS 12.13
    FPS 13.70
    FPS 15.30
    FPS 16.58
    FPS 17.78
    FPS 18.49
    FPS 18.47
    FPS 17.64
    FPS 18.27
    FPS 17.87


  • Aero composer is OFF

  • There is no FPS problem in NvFBCToSysGrabFrame() or direct console capture.

    UPDATE/DEMO: Attached video capture from encoder output shows the FPS capture problem. The video begins with console capture with Aero (exactly 60 FPS independent to application), continues with NvFBCToSysGrabFrame() capture with Aero (problematic 60-40 FPS) and finally NvFBCToSysGrabFrame() capture without Aero (67 FPS as generated by application, expected behavior). Captured video is fixed to 60 FPS eg. when only 40 FPS is captured/rendered from the encoder the "3d flag" flaps faster.

    My opinion: something is still broken between M$$$ graphics composer and N$$$ capture SDK. There is also similar full screens capture problems in Win10 (see https://gridforums.nvidia.com/default/topic/1046/). I believe that N$$$ is capable to repair this (or this https://gridforums.nvidia.com/default/topic/382/) in a few years. The process already started, hiring (http://www.nvidia.com/object/careers.html) new GRID managers and moving/expanding GRID QA to India, renaming product to "NVIDIAVirt" (https://twitter.com/NVIDIAVirt) and I am expecting to double price on new cards and license fees and drop support of licese-less cards (K1/K2) ASAP. Nvidia is awarded "The Yahoo Finance Company of the Year" congratulation ! (http://finance.yahoo.com/news/nvidia-the-yahoo-finance-company-of-the-year-173130275.html) "The Way It's Meant to be Played" with customers.

    #5
    Posted 04/13/2017 08:35 AM   
    There looks to be two issues that are contributing to the visual degradation during the 3D model flyover: 1. Reduction in frame rate under 3D load 2. Visual jitter/flicker/tearing of 3D primitives within the model (e.g. a structure such as a column in a building). I've run Unigine Heaven demo at 'High' quality on 1920x1200 single display and the FPS ranges from 30 to 60 and is generally close to the native FPS value within Unigine. Display quality is generally very good with none of the tearing/jitter we are seeing with the 3D city model in TerraExplorer. I've investigated the following settings but none have resulted in any noticeable improvement: * Disable Aero theme on client/VDA ends. * Disable Off Screen Surfaces (using the .ini file setting on the receiver client) * Increase Display Memory Limit to max (using policy) * Set NVIDIA Frame Rate Limiter setting to 30 FPS (using registry on VDA virtual host) * CPU/RAM - VDA host has 8xCores, Xeon E5-2667 V4, 3.2/3.6 GHz, 32 GB RAM, Tesla M60, client has dual Xeon, Quadro M2000, stacks of RAM - plenty of grunt here. Further things to investigate: * Storage bottleneck (check virtual and native storage setups are equivalent) * CPU Only Encoding (Disable NVENC in policy) * Disable HDX 3D Pro (re-install VDA without HDX 3D, depending on outcome above, mightisolate issue to frame buffer capture) * Passthrough GPU (isolate to vGPU) * Linux Client (isolate to Windows Receiver) * System Display Memory (unlikely, but worth a check) * Legacy Graphics Mode (clutching at straws) * Alternate card - M10, M4000, M2000 (more straw clutching)
    There looks to be two issues that are contributing to the visual degradation during the 3D model flyover:
    1. Reduction in frame rate under 3D load
    2. Visual jitter/flicker/tearing of 3D primitives within the model (e.g. a structure such as a column in a building).

    I've run Unigine Heaven demo at 'High' quality on 1920x1200 single display and the FPS ranges from 30 to 60 and is generally close to the native FPS value within Unigine. Display quality is generally very good with none of the tearing/jitter we are seeing with the 3D city model in TerraExplorer.

    I've investigated the following settings but none have resulted in any noticeable improvement:
    * Disable Aero theme on client/VDA ends.
    * Disable Off Screen Surfaces (using the .ini file setting on the receiver client)
    * Increase Display Memory Limit to max (using policy)
    * Set NVIDIA Frame Rate Limiter setting to 30 FPS (using registry on VDA virtual host)
    * CPU/RAM - VDA host has 8xCores, Xeon E5-2667 V4, 3.2/3.6 GHz, 32 GB RAM, Tesla M60, client has dual Xeon, Quadro M2000, stacks of RAM - plenty of grunt here.


    Further things to investigate:
    * Storage bottleneck (check virtual and native storage setups are equivalent)
    * CPU Only Encoding (Disable NVENC in policy)
    * Disable HDX 3D Pro (re-install VDA without HDX 3D, depending on outcome above, mightisolate issue to frame buffer capture)
    * Passthrough GPU (isolate to vGPU)
    * Linux Client (isolate to Windows Receiver)
    * System Display Memory (unlikely, but worth a check)
    * Legacy Graphics Mode (clutching at straws)
    * Alternate card - M10, M4000, M2000 (more straw clutching)

    #6
    Posted 04/17/2017 08:28 PM   
    Try to install FPS counter like "fraps" to determine if the FPS problem is in [b]application or capture[/b] to remote session (as sschaber wrote earlier). If the FPS problem is in application then you possibly hit another problem. The long term unresolved problem is power management of grid cards (see [url]https://gridforums.nvidia.com/default/topic/378/[/url]). If global utilization is less then 30% the card is going to reduce clock (memory/gpu) and power in hypervisor. This can lead to application wrong assumption of slow vgpu card and to reduce graphics loads (like wire-model or to drop details to reduce complexity of scene) (negative feedback). You can try to start parallel dummy load session (like "unigineheaven" in separate window) to test this case. [i]My opinion: #Gridays are running now and "Erik Bornhorst" presenting NDA session "GRID Performance Engineering" 12:30-13:00 PDT (eg. exactly now). Try to contact him or #NGCA (see [url]https://gridforums.nvidia.com/default/topic/1153/[/url]). There is also some webinar [url]http://info.nvidianews.com/201605_NVVMCommunityWebinar_Reg.html[/url]. Try to contact him over PM [url]https://gridforums.nvidia.com/member/1882163/[/url] (last seen on Mar 17 2016) or @ErikBoh. You can try to create support case if you own M60 and pay SUM.[/i]
    Try to install FPS counter like "fraps" to determine if the FPS problem is in application or capture to remote session (as sschaber wrote earlier). If the FPS problem is in application then you possibly hit another problem. The long term unresolved problem is power management of grid cards (see https://gridforums.nvidia.com/default/topic/378/). If global utilization is less then 30% the card is going to reduce clock (memory/gpu) and power in hypervisor. This can lead to application wrong assumption of slow vgpu card and to reduce graphics loads (like wire-model or to drop details to reduce complexity of scene) (negative feedback). You can try to start parallel dummy load session (like "unigineheaven" in separate window) to test this case.

    My opinion: #Gridays are running now and "Erik Bornhorst" presenting NDA session "GRID Performance Engineering" 12:30-13:00 PDT (eg. exactly now). Try to contact him or #NGCA (see https://gridforums.nvidia.com/default/topic/1153/). There is also some webinar http://info.nvidianews.com/201605_NVVMCommunityWebinar_Reg.html. Try to contact him over PM https://gridforums.nvidia.com/member/1882163/ (last seen on Mar 17 2016) or @ErikBoh. You can try to create support case if you own M60 and pay SUM.

    #7
    Posted 04/19/2017 07:50 PM   
    Thanks for the advice Martin, the references you provide give us some other things to look at. FPS issue is not in the application as the native performance is seamless. We've established that configuring CPU encoding results in significantly improved performance (> 30 FPS from memory). We are pretty sure this is a hardware encoding issue with one of: * XenDesktop VDA * NVIDIA GRID Drivers * Hypervisor (ESXI) NVIDIA Australia are looking into it with us.
    Thanks for the advice Martin, the references you provide give us some other things to look at.

    FPS issue is not in the application as the native performance is seamless.

    We've established that configuring CPU encoding results in significantly improved performance (> 30 FPS from memory).

    We are pretty sure this is a hardware encoding issue with one of:
    * XenDesktop VDA
    * NVIDIA GRID Drivers
    * Hypervisor (ESXI)

    NVIDIA Australia are looking into it with us.

    #8
    Posted 04/21/2017 11:16 AM   
    Scroll To Top

    Add Reply