NVIDIA
GPU scheduler for vGPU
Hello. I have questions and proposal to NVidia developers. There are few information about true function of GPU scheduler. [img]http://blog.itvce.com/wp-content/uploads/2016/03/030716_0717_NVIDIAGRIDD19.png[/img] [color="orange"]Is the scheduler only simple round-robin ? Is it programmable ? Is it programmed from dom0 (eg. vgpu/libnvidia-vgpu process in Dom0) ?[/color] There are more sophisticated schedulers for more then decade. If you look in network hardware you can see many more advanced schedulers ([url]https://en.wikipedia.org/wiki/Network_scheduler[/url]). Because NVidia background is based on Sun Microsystems there is more sophisticated example of processor scheduler in SunOS/Solaris. The SunOS/Solaris combination of Fair Share Scheduler (FSS) (implements sharing, including [b]hierarchical[/b] shares (zones/projects)) and dynamic pools (implements [b]capping and pinning/binding[/b]) is VERY powerful and also simple to implement and demonstrating its power for nearly 20 years. [color="orange"]Can the GPU scheduler be more sophisticated ?[/color] If yes, there are more practical goals: - If [b]the share is programmable[/b] than [color="green"]the restriction about "all vgpu of one type (for example k120q) in one physical gpu" should be removed [/color] ! - If [b]the share is hierarchically programmable[/b] than [color="green"]the CUDA in all vGPU types should be available[/color] ! - If [b]the scheduler have pinning/binding capability[/b] (to SMX) than [color="green"]the performance should be boosted due to less instruction and data cache misses[/color] ! - If [b]the scheduler (probably non hierarchical) can be moved to domU[/b] for Grid2.0 "full" profiles M6-8Q and M60-8Q that [b]remove overhead of dom0 and enable CUDA in domU[/b] than [color="green"]the same feature should be available for k180q and k280q[/color] (yes, I am still optimistic that NVidia HQ allows to backport this feature and more to K1/K2 grid) ! [color="orange"]Is there any observability API (performance monitor API) for GPU scheduler (per vGPU (in Dom0) and per processes inside vGPU (in DomU)) ?[/color] ( https://gridforums.nvidia.com/default/topic/809/nvidia-grid-vgpu/vgpu-utilization-per-vm/ ) Thanks for answers, Martin
Hello.

I have questions and proposal to NVidia developers. There are few information about true function of GPU scheduler.

Image

Is the scheduler only simple round-robin ?

Is it programmable ?

Is it programmed from dom0 (eg. vgpu/libnvidia-vgpu process in Dom0) ?


There are more sophisticated schedulers for more then decade.
If you look in network hardware you can see many more advanced schedulers (https://en.wikipedia.org/wiki/Network_scheduler).
Because NVidia background is based on Sun Microsystems there is more sophisticated example of processor scheduler in SunOS/Solaris. The SunOS/Solaris combination of Fair Share Scheduler (FSS) (implements sharing, including hierarchical shares (zones/projects)) and dynamic pools (implements capping and pinning/binding) is VERY powerful and also simple to implement and demonstrating its power for nearly 20 years.

Can the GPU scheduler be more sophisticated ?

If yes, there are more practical goals:

- If the share is programmable than the restriction about "all vgpu of one type (for example k120q) in one physical gpu" should be removed !

- If the share is hierarchically programmable than the CUDA in all vGPU types should be available !

- If the scheduler have pinning/binding capability (to SMX) than the performance should be boosted due to less instruction and data cache misses !

- If the scheduler (probably non hierarchical) can be moved to domU for Grid2.0 "full" profiles M6-8Q and M60-8Q that remove overhead of dom0 and enable CUDA in domU than the same feature should be available for k180q and k280q (yes, I am still optimistic that NVidia HQ allows to backport this feature and more to K1/K2 grid) !

Is there any observability API (performance monitor API) for GPU scheduler (per vGPU (in Dom0) and per processes inside vGPU (in DomU)) ?
( https://gridforums.nvidia.com/default/topic/809/nvidia-grid-vgpu/vgpu-utilization-per-vm/ )

Thanks for answers, Martin

#1
Posted 03/13/2016 09:43 AM   
Hi MArtin, The restriction on homogenous (all the same) vGPU types could I guess be lifted however it's a bit like normal programmable arrays in my head, that a fixed size means many things can be done efficiently. I think also the need to avoid memmory fragmentation particularly as GPUs reassigned (I'm think of the day when vMotion and similar is possible) would be a consideration. Some restrictions are imposed by the need to ensure cotinual and ongoing testing, QA and regression testing. BAck porting always requires investment in extra QA and test for not just us but also the OEMs test labs. All sorts of things are possible but we must maintain quality and reliability. It is possible to pin and cap CPUs but my own experiences have been extremely mixed particularly with CAD/3D applications - reverse pinning PTC Creo actually improved performance and the intuitive pinning degraded it because of some very stragne semophore behaviour iirc. Too many configuration options can often mean users get themselves in a real muddle. I'm not an expert in this area - I'm hoping someone who is will pop along. With every feature request though we need to know what the user story/business case is.... why you _need_ to mix vGPU types and evidence it's worth a substantial expansion in the test matrix etc... Best wishes, Rachel
Hi MArtin,

The restriction on homogenous (all the same) vGPU types could I guess be lifted however it's a bit like normal programmable arrays in my head, that a fixed size means many things can be done efficiently. I think also the need to avoid memmory fragmentation particularly as GPUs reassigned (I'm think of the day when vMotion and similar is possible) would be a consideration.

Some restrictions are imposed by the need to ensure cotinual and ongoing testing, QA and regression testing. BAck porting always requires investment in extra QA and test for not just us but also the OEMs test labs. All sorts of things are possible but we must maintain quality and reliability.

It is possible to pin and cap CPUs but my own experiences have been extremely mixed particularly with CAD/3D applications - reverse pinning PTC Creo actually improved performance and the intuitive pinning degraded it because of some very stragne semophore behaviour iirc. Too many configuration options can often mean users get themselves in a real muddle.

I'm not an expert in this area - I'm hoping someone who is will pop along. With every feature request though we need to know what the user story/business case is.... why you _need_ to mix vGPU types and evidence it's worth a substantial expansion in the test matrix etc...

Best wishes,
Rachel

#2
Posted 03/14/2016 01:25 PM   
[quote="RachelBerry"]... why you _need_ to mix vGPU types and evidence it's worth a substantial expansion in the test matrix etc...[/quote] There is "breadth-first" allocation mechanism for vGPU startup that is optimal for performance but first allocation determine vGPU profile for whole GPU and it is not movable. For example start new 4x k120q on K1 and next new k160q is unstartable and old k120q are unmovable. Yes, there is also "depth-first" but it has impact on performance for shared 4x k120q on one GPU. This leads to lower UX (user-experience, NVidia buzzword for this year) for this five VM/VDI example. Best regards, M.C>
RachelBerry said:... why you _need_ to mix vGPU types and evidence it's worth a substantial expansion in the test matrix etc...

There is "breadth-first" allocation mechanism for vGPU startup that is optimal for performance but first allocation determine vGPU profile for whole GPU and it is not movable. For example start new 4x k120q on K1 and next new k160q is unstartable and old k120q are unmovable. Yes, there is also "depth-first" but it has impact on performance for shared 4x k120q on one GPU. This leads to lower UX (user-experience, NVidia buzzword for this year) for this five VM/VDI example.

Best regards, M.C>

#3
Posted 03/14/2016 05:07 PM   
Hi Martin, The breadth and depth allocations are functionality implemented by XenServer/XenCenter and by the equivalent in VMware. I'm wondering if you really need more control in the management tools. I'm still somewhat wary that this could expand the QA matrix substantially; a lot of users have sufficient users or similar apps that they can pool easily. I haven't heard a large number of people telling me that having homogenous VMs per pGPU is a big issue... Best wishes, Rachel
Hi Martin,

The breadth and depth allocations are functionality implemented by XenServer/XenCenter and by the equivalent in VMware. I'm wondering if you really need more control in the management tools.

I'm still somewhat wary that this could expand the QA matrix substantially; a lot of users have sufficient users or similar apps that they can pool easily. I haven't heard a large number of people telling me that having homogenous VMs per pGPU is a big issue...

Best wishes,
Rachel

#4
Posted 03/14/2016 05:46 PM   
Hi MArtin, I had a word with the product management team at Citrix and whilst they could possibly tweak the distribution it would still just be start of day. Really long goal they feel VMotion/XenMotion is the way forward that would balance load as needed (this is something both Citrix/VMware and NVIDIA are keen to achieve long term). Best wishes, Rachel
Hi MArtin,

I had a word with the product management team at Citrix and whilst they could possibly tweak the distribution it would still just be start of day. Really long goal they feel VMotion/XenMotion is the way forward that would balance load as needed (this is something both Citrix/VMware and NVIDIA are keen to achieve long term).

Best wishes,
Rachel

#5
Posted 03/17/2016 09:38 PM   
[size="XL"][b]Grid 5.0[/b][/size] New "QoS scheduler" for Pascal chips: [img]https://s26.postimg.org/cnmihinq1/pascal_qos_scheduler.jpg[/img] I do not known if this "QoS scheduler" for Pascal is only marketing branded stupid "fixed/equal share scheduler". [b]"... Pascal has a new hardware feature called Preemption that allows Compute on vGPU profiles. Preemption is a feature that allows task Context switching. It gives the GPU the ability to essentially pause and resume a task ..."[/b] - see [url]https://gridforums.nvidia.com/default/topic/1604/nvidia-grid-vgpu/compute-mode-quot-prohibited-quot-grid-m60-/post/5161/#5161[/url] - see [url]http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10[/url] - search for "preemtion" in [url]http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf[/url] - search for "preemtion" in [url]http://on-demand.gputechconf.com/gtc/2016/presentation/s6810-swapna-matwankar-optimizing-application-performance-cuda-tools.pdf[/url] - cuDeviceGetAttribute() - CU_DEVICE_ATTRIBUTE_COMPUTE_PREEMPTION_SUPPORTED - [url]https://devtalk.nvidia.com/default/topic/1023524/system-management-and-monitoring-nvml-/-vgpu-management-qos-api-/[/url] - docs [url]https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#changing-vgpu-scheduling-policy[/url] - [b]BUT compute preemption isn't exposed as a programmer visible control ![/b] Now it is clear that NVidia rediscovered wheel - "[color=orange][b]preemption[/b][/color]" in Pascal chip. [b]Welcome to year 1964 ![/b] (see [url]https://en.wikipedia.org/wiki/Computer_multitasking#Preemptive_multitasking[/url]). This disclosure explains all pitfalls with vGPU and CUDA in previous chip generations that vGPU paravirtualized driver was unable to force switch SMX/SMM context and [b]heavy depends[/b] on guest drivers cooperative multitasking (limited by FRL) and guest operating system. [b]Unbelievable, shame, shame, shame on NVidia ![/b] CUDA is now enabled in all "GRID P*-*Q" profiles. New "observability": [img]https://s26.postimg.org/3lbtfps49/pascal_gpu_observability.jpg[/img] Per process utilization API (usable for >= r375) with finally disclosured functions nvmlDeviceGetProcessUtilization() and nvmlDeviceGetVgpuProcessUtilization() (see [url]https://devtalk.nvidia.com/default/topic/934756/system-management-and-monitoring-nvml-/per-process-statistics-nvidia-smi-pmon-/[/url]). Let's wait few more years, for pinning/binding on SMX/SMM/SMP to be cache effective, for mixing vGPU profiles on GPU ...
Grid 5.0

New "QoS scheduler" for Pascal chips:

Image

I do not known if this "QoS scheduler" for Pascal is only marketing branded stupid "fixed/equal share scheduler".

"... Pascal has a new hardware feature called Preemption that allows Compute on vGPU profiles. Preemption is a feature that allows task Context switching. It gives the GPU the ability to essentially pause and resume a task ..."
- see https://gridforums.nvidia.com/default/topic/1604/nvidia-grid-vgpu/compute-mode-quot-prohibited-quot-grid-m60-/post/5161/#5161
- see http://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10
- search for "preemtion" in http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf
- search for "preemtion" in http://on-demand.gputechconf.com/gtc/2016/presentation/s6810-swapna-matwankar-optimizing-application-performance-cuda-tools.pdf
- cuDeviceGetAttribute() - CU_DEVICE_ATTRIBUTE_COMPUTE_PREEMPTION_SUPPORTED
- https://devtalk.nvidia.com/default/topic/1023524/system-management-and-monitoring-nvml-/-vgpu-management-qos-api-/
- docs https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#changing-vgpu-scheduling-policy
- BUT compute preemption isn't exposed as a programmer visible control !

Now it is clear that NVidia rediscovered wheel - "preemption" in Pascal chip. Welcome to year 1964 ! (see https://en.wikipedia.org/wiki/Computer_multitasking#Preemptive_multitasking). This disclosure explains all pitfalls with vGPU and CUDA in previous chip generations that vGPU paravirtualized driver was unable to force switch SMX/SMM context and heavy depends on guest drivers cooperative multitasking (limited by FRL) and guest operating system. Unbelievable, shame, shame, shame on NVidia !

CUDA is now enabled in all "GRID P*-*Q" profiles.

New "observability":

Image

Per process utilization API (usable for >= r375) with finally disclosured functions nvmlDeviceGetProcessUtilization() and nvmlDeviceGetVgpuProcessUtilization() (see https://devtalk.nvidia.com/default/topic/934756/system-management-and-monitoring-nvml-/per-process-statistics-nvidia-smi-pmon-/).

Let's wait few more years, for pinning/binding on SMX/SMM/SMP to be cache effective, for mixing vGPU profiles on GPU ...

#6
Posted 08/30/2017 08:46 AM   
Hi Martin CUDA is enabled for all profiles on Pascal GPUs (A, B & Q) (App, vPC & vDWS) As for mixing FB Profiles on the same physical GPU, a few of us raised this with NVIDIA engineering a while back, however there are reasons why it hasn't been offered. As you say, hopefully this will be added as a feature as the technology develops. Regards Ben
Hi Martin

CUDA is enabled for all profiles on Pascal GPUs (A, B & Q) (App, vPC & vDWS)

As for mixing FB Profiles on the same physical GPU, a few of us raised this with NVIDIA engineering a while back, however there are reasons why it hasn't been offered. As you say, hopefully this will be added as a feature as the technology develops.

Regards

Ben

#7
Posted 09/04/2017 06:40 PM   
CUDA/OpenCL is only in P*-*Q ([url]https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#features-grid-vgpu[/url]). Digitally signed /usr/share/nvidia/vgpu/vgpuConfig.xml has precendence over /usr/share/nvidia/vgx/*.conf (check with "egrep -i 'cuda|vgpuType|signature' usr/share/nvidia/vgpu/vgpuConfig.xml" and "grep cuda_enabled /usr/share/nvidia/vgx/*.conf") ([url]https://gridforums.nvidia.com/default/topic/258/nvidia-grid-vgpu/documentation-for-vgpu-configs/post/2087/#2087[/url]) ... you should post your's /usr/share/nvidia/vgpu/vgpuConfig.xml and /usr/bin/nvidia-vgpud
CUDA/OpenCL is only in P*-*Q (https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#features-grid-vgpu). Digitally signed /usr/share/nvidia/vgpu/vgpuConfig.xml has precendence over /usr/share/nvidia/vgx/*.conf (check with "egrep -i 'cuda|vgpuType|signature' usr/share/nvidia/vgpu/vgpuConfig.xml" and "grep cuda_enabled /usr/share/nvidia/vgx/*.conf") (https://gridforums.nvidia.com/default/topic/258/nvidia-grid-vgpu/documentation-for-vgpu-configs/post/2087/#2087) ... you should post your's /usr/share/nvidia/vgpu/vgpuConfig.xml and /usr/bin/nvidia-vgpud

#8
Posted 09/07/2017 07:25 PM   
My apologies, you're correct. I've just re-checked and those were evaluation drivers not production. Production drivers do not have this functionality. [u][b]Please note, I've edited my post above to remove the incorrect driver information so as not to add confusion for anyone else reading this[/b][/u]
My apologies, you're correct. I've just re-checked and those were evaluation drivers not production. Production drivers do not have this functionality.

Please note, I've edited my post above to remove the incorrect driver information so as not to add confusion for anyone else reading this

#9
Posted 09/08/2017 09:17 AM   
Nvidia updated scheduler slides. As expected "QoS" title was removed (the new preemptive schedulers are far away from true QoS). You can use old "Shared/Best Effort/Time Sliced Scheduler" with cooperative multitasking OR you can use "Fixed/Equal Scheduler" with preemptive multitasking and with card [b]performance lost due to "empty/unused" slots[/b]. It is not possible to redistribute "unused" slots ! The "slots" per VM should be programmable (like set ratio/share (minimum guaranteed and redistribute unused) and set maximum (capping) !). (Scheduler is chosen by driver parameter ([url]https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#changing-vgpu-scheduling-policy[/url]).) Updated summary (removed "QoS"): [img]https://s26.postimg.org/a307431cp/vgpu-scheduler-compare.jpg[/img] Shared/Best Effort/Time Sliced Scheduler based on cooperative multitasking: [img]https://s26.postimg.org/v16d2617d/vgpu-scheduler-shared.jpg[/img] Fixed/Equal Schedulers based on preemptive multitasking with performance lost ("empty/unused slots"!): [img]https://s26.postimg.org/qgk6n8hi1/vgpu-scheduler-equal.jpg[/img] [img]https://s26.postimg.org/3t4xh31y1/vgpu-scheduler-fixed.jpg[/img]
Nvidia updated scheduler slides. As expected "QoS" title was removed (the new preemptive schedulers are far away from true QoS). You can use old "Shared/Best Effort/Time Sliced Scheduler" with cooperative multitasking OR you can use "Fixed/Equal Scheduler" with preemptive multitasking and with card performance lost due to "empty/unused" slots. It is not possible to redistribute "unused" slots ! The "slots" per VM should be programmable (like set ratio/share (minimum guaranteed and redistribute unused) and set maximum (capping) !). (Scheduler is chosen by driver parameter (https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#changing-vgpu-scheduling-policy).)

Updated summary (removed "QoS"):

Image

Shared/Best Effort/Time Sliced Scheduler based on cooperative multitasking:

Image

Fixed/Equal Schedulers based on preemptive multitasking with performance lost ("empty/unused slots"!):

Image

Image

#10
Posted 09/14/2017 10:45 PM   
Scroll To Top

Add Reply