NVIDIA Logo - Artificial Intelligence Computing Leadership from NVIDIA
vGPU on GTX or Quadro
[i]My opinion: NVidia and commercial hypervisor vendors unbelievable escalated the cost of vGPU technology and NVidia support is unusable for me. Ok, this approach forced me to accept challenge to resolve this problem with own simplified (NVidia unsupported) Xen based virtualization stack. Few days of experiments (to make it compatible with GTX/Quadro) and now vGPU can be run with any Xen (few public XenServer patches and some more for vgpu ioreq_server (Citrix stops to distribute vgpu sources from XenServer 7.5)), any compatible NVidia GTX or Quadro (small 42 lines "magic" script daemon and change one line in driver install script), any linux kernel and any linux distribution. I will not publish how - challenge yourself. I used my notebook (i7-4710HQ/C220 VT-d aware with Intel graphics (as display frontend) and NVidia [b]GTX860M[/b] (as virtualization backend)) as demonstration of Xen with vGPU virtualization server (Dom0 - xen4.10.1 with Fedora 28 and kernel 4.16, Windows DomU - Win2008r2 with enabled CUDA and running GPU-Z, CUDA-Z, Unigine Haeven, Linux DomU - CentOS7.5 with enabled CUDA and running nvenc Video Codec SDK sample):[/i] [center][img]https://s31.postimg.cc/8zgpz36y1/vgpu_gtx860m.png[/img][/center] UPDATES: Verified with Grid [b]4.6[/b] (license less), [b]6.1[/b] .... Verified with Grid/vGPU SW [b]6.2[/b] - xen 4.11, dom0 - Fedora 28/kernel 4.16 (kernels 4.17-4.19 do not work with xen) Verified with vGPU SW [b]7.1[/b] - xen 4.11, dom0 - Fedora 28/kernel 4.16/4.20, xen 4.11.1 - Fedora 29/kernel 4.20 (with Quadro [b]K2200[/b]) Verified with vGPU SW [b]8.0[/b] - xen 4.12, dom0 - Fedora 29/kernel 5.0 (with Quadro [b]K2200[/b]) Verified with vGPU SW [b]9.0[/b] - xen 4.12, dom0 - Fedora 29/kernel 5.0 (with Quadro [b]K2200[/b])
My opinion: NVidia and commercial hypervisor vendors unbelievable escalated the cost of vGPU technology and NVidia support is unusable for me. Ok, this approach forced me to accept challenge to resolve this problem with own simplified (NVidia unsupported) Xen based virtualization stack. Few days of experiments (to make it compatible with GTX/Quadro) and now vGPU can be run with any Xen (few public XenServer patches and some more for vgpu ioreq_server (Citrix stops to distribute vgpu sources from XenServer 7.5)), any compatible NVidia GTX or Quadro (small 42 lines "magic" script daemon and change one line in driver install script), any linux kernel and any linux distribution. I will not publish how - challenge yourself.
I used my notebook (i7-4710HQ/C220 VT-d aware with Intel graphics (as display frontend) and NVidia GTX860M (as virtualization backend)) as demonstration of Xen with vGPU virtualization server (Dom0 - xen4.10.1 with Fedora 28 and kernel 4.16, Windows DomU - Win2008r2 with enabled CUDA and running GPU-Z, CUDA-Z, Unigine Haeven, Linux DomU - CentOS7.5 with enabled CUDA and running nvenc Video Codec SDK sample):


Image

UPDATES:
Verified with Grid 4.6 (license less), 6.1 ....
Verified with Grid/vGPU SW 6.2 - xen 4.11, dom0 - Fedora 28/kernel 4.16 (kernels 4.17-4.19 do not work with xen)
Verified with vGPU SW 7.1 - xen 4.11, dom0 - Fedora 28/kernel 4.16/4.20, xen 4.11.1 - Fedora 29/kernel 4.20 (with Quadro K2200)
Verified with vGPU SW 8.0 - xen 4.12, dom0 - Fedora 29/kernel 5.0 (with Quadro K2200)
Verified with vGPU SW 9.0 - xen 4.12, dom0 - Fedora 29/kernel 5.0 (with Quadro K2200)

#1
Posted 05/18/2018 06:30 AM   
Some happy announcements: I bought cheap [b]GTX1080[/b] card today ($460) to test vGPU virtualization with TeslaP4 profiles and I was successful. There is [b]performance advantage[/b] - TeslaP4 has power limit 75W (base clock ~860MHz) and my GTX1080 has power limit 180W (base clock ~1632MHz (boosted in Superposition benchmark to ~1873MHz (capped by 180W limit) with frame_rate_limiter=0). Conclusion: GTX1080 performance +- equals to TeslaT4 and NVENC is 2x better (raw FPS) or 5x better (FPS recomputed with delivered VDI) (see notes) than TeslaT4. TeslaT4 NVENC is unusable for VDI. [center][img]https://i.postimg.cc/tgTQVdVf/gtx1080-running-vgpu-p4-profiles.jpg[/img][/center] I bought [b]GTX980[/b] card from EBay ($160) to test vGPU virtualization with TeslaM60 profiles and I was also successful. There is [b]performance advantage[/b] - TeslaM60 has power limit 300W (2xGM204 chips and 2x8GB RAM) (base clock ~899MHz) and my GTX980 has power limit 180W (1xGM204 and 1x4GB RAM) (base clock ~1152MHz (boosted in Superposition benchmark to ~1265MHz (capped by 180W limit) with frame_rate_limiter=0). [center][img]https://i.postimg.cc/HL4MSsQD/gtx980-running-vgpu-m60-profiles.jpg[/img][/center] Notes: [list] [.]screenshots from virtualized "win1 (win2008r2)" with putty to virtualized "linux1 (centos75)" (with license server) and to XEN domain0 (srv)[/.] [.]available average [b]FPS for one VDI[/b] instance (H.264, Low latency High Performance single pass, reference NVENC speeds are taken from NVidia Video Codec SDK 9.0 (NVENC_Application_Note.pdf), GPU clocks from wikipedia) [center][img]https://i.postimg.cc/mDVVyfss/nvenc-vgpu-comparison.jpg[/img][/center] [/.] [/list] UPDATES: verified with vGPU SW [b]6.4[/b] + XEN 4.12 + Fedora 29/kernel 5.0 (dom0) + GTX1080 verified with vGPU SW [b]8.0[/b] - XEN 4.12 + Fedora 29/kernel 5.0 (dom0) + GTX1080 verified with vGPU SW [b]9.0[/b] - XEN 4.12 + Fedora 29/kernel 5.0 (dom0) + GTX1080 updated [b]NVENC performance per vGPU to vGPU SW 8.0[/b] (RTX 6000/8000, profile *-1B4) - RTX 6000/8000 do not have "A" or "B" profiles (pay more). There are limit to 32 profiles per GPU. RTX8000-1Q with 4k and NVENC guarantees fantastic [color="orange"][b]5 FPS and 16GB vRAM are lost[/b][/color] - best profile ever ! I do not believe that user VDI statistical multiplexing is usable without UX impact with less than 15 FPS in average especially in Windows10 with many animated effects running everywhere (and "running video", "running google maps", "running powerpoint transition effects" as NVidia marketing presentation asserts). [i][b]I think that Turing generation is absolutely unsuitable for vGPU based VDI.[/b][/i]
Some happy announcements:

I bought cheap GTX1080 card today ($460) to test vGPU virtualization with TeslaP4 profiles and I was successful. There is performance advantage - TeslaP4 has power limit 75W (base clock ~860MHz) and my GTX1080 has power limit 180W (base clock ~1632MHz (boosted in Superposition benchmark to ~1873MHz (capped by 180W limit) with frame_rate_limiter=0). Conclusion: GTX1080 performance +- equals to TeslaT4 and NVENC is 2x better (raw FPS) or 5x better (FPS recomputed with delivered VDI) (see notes) than TeslaT4. TeslaT4 NVENC is unusable for VDI.

Image


I bought GTX980 card from EBay ($160) to test vGPU virtualization with TeslaM60 profiles and I was also successful. There is performance advantage - TeslaM60 has power limit 300W (2xGM204 chips and 2x8GB RAM) (base clock ~899MHz) and my GTX980 has power limit 180W (1xGM204 and 1x4GB RAM) (base clock ~1152MHz (boosted in Superposition benchmark to ~1265MHz (capped by 180W limit) with frame_rate_limiter=0).

Image


Notes:
  • screenshots from virtualized "win1 (win2008r2)" with putty to virtualized "linux1 (centos75)" (with license server) and to XEN domain0 (srv)
  • available average FPS for one VDI instance (H.264, Low latency High Performance single pass, reference NVENC speeds are taken from NVidia Video Codec SDK 9.0 (NVENC_Application_Note.pdf), GPU clocks from wikipedia)
    Image


UPDATES:
verified with vGPU SW 6.4 + XEN 4.12 + Fedora 29/kernel 5.0 (dom0) + GTX1080
verified with vGPU SW 8.0 - XEN 4.12 + Fedora 29/kernel 5.0 (dom0) + GTX1080
verified with vGPU SW 9.0 - XEN 4.12 + Fedora 29/kernel 5.0 (dom0) + GTX1080
updated NVENC performance per vGPU to vGPU SW 8.0 (RTX 6000/8000, profile *-1B4) - RTX 6000/8000 do not have "A" or "B" profiles (pay more). There are limit to 32 profiles per GPU. RTX8000-1Q with 4k and NVENC guarantees fantastic 5 FPS and 16GB vRAM are lost - best profile ever ! I do not believe that user VDI statistical multiplexing is usable without UX impact with less than 15 FPS in average especially in Windows10 with many animated effects running everywhere (and "running video", "running google maps", "running powerpoint transition effects" as NVidia marketing presentation asserts). I think that Turing generation is absolutely unsuitable for vGPU based VDI.

#2
Posted 03/18/2019 05:17 PM   
FYI: NVidia and/or hypervisor vendors [b]do not allow mixed vGPU types/profiles on one GPU chip[/b] ([url]https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#homogeneous-grid-vgpus[/url]) and neither it will be possible in the next vGPU SW 8.0 release. It does not make any sense not to allow this from my engineering point of view because it only differs in allocation of static vRAM memory. The "best effort" scheduler works equally in all vGPU types/profiles (based on cooperative multitasking with frame rate limit as task switch trigger). I added two lines to my magic script that expands vRAM memory allocation (framebufferlength+reserved_fb). Practically any size should work (probably rounded to 64MB). Maximum resolution and number of heads in "P4-1Q" can be changed accordingly. You do not need additional Windows reboot with "vRAM memory size change" (it was needed with profile change due to PCIid change) - useful with disk clones. [b]I successfully tested mixed vGPU types/profiles on one GPU chip[/b] - srv is XEN virtualization DOM0 (XEN4.12.0+linux5.0), "P4-1Q with 2GB vRAM" for windows (following screenshot is from this windows) (GPU-Z+CUDA-Z+Heaven+Superposition("medium" needs 1.3GB vRAM) (performance is limited by FRL)), P4-1Q for linux (centos75 with stream encoder) and "P4-1Q with 4GB vRAM" for linux (linux10 with Xorg+Heaven): [center][img]https://i.postimg.cc/HkHR0jPK/mixed-vgpu-profiles-on-one-gpu.jpg[/img][/center] PS: Compare TeslaT4 FP32(single)=8.141 TFLOPS, FP64(double)=254.4 GFLOPS with virtualized GTX1080 (without other benchmark running, GTX1080 lacks Tensor and Ray-tracing cores): [center][img]https://i.postimg.cc/QdwxmRvv/cuda-z.jpg[/img][/center] UPDATES: Verified with vGPU SW [b]8.0[/b] - xen 4.12, dom0 - Fedora 29/kernel 5.0 (with [b]GTX1080[/b]) Verified with vGPU SW [b]9.0[/b] - xen 4.12, dom0 - Fedora 29/kernel 5.0 (with [b]GTX1080[/b])
FYI:

NVidia and/or hypervisor vendors do not allow mixed vGPU types/profiles on one GPU chip (https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#homogeneous-grid-vgpus) and neither it will be possible in the next vGPU SW 8.0 release. It does not make any sense not to allow this from my engineering point of view because it only differs in allocation of static vRAM memory. The "best effort" scheduler works equally in all vGPU types/profiles (based on cooperative multitasking with frame rate limit as task switch trigger).
I added two lines to my magic script that expands vRAM memory allocation (framebufferlength+reserved_fb). Practically any size should work (probably rounded to 64MB). Maximum resolution and number of heads in "P4-1Q" can be changed accordingly. You do not need additional Windows reboot with "vRAM memory size change" (it was needed with profile change due to PCIid change) - useful with disk clones. I successfully tested mixed vGPU types/profiles on one GPU chip - srv is XEN virtualization DOM0 (XEN4.12.0+linux5.0), "P4-1Q with 2GB vRAM" for windows (following screenshot is from this windows) (GPU-Z+CUDA-Z+Heaven+Superposition("medium" needs 1.3GB vRAM) (performance is limited by FRL)), P4-1Q for linux (centos75 with stream encoder) and "P4-1Q with 4GB vRAM" for linux (linux10 with Xorg+Heaven):

Image


PS: Compare TeslaT4 FP32(single)=8.141 TFLOPS, FP64(double)=254.4 GFLOPS with virtualized GTX1080 (without other benchmark running, GTX1080 lacks Tensor and Ray-tracing cores):

Image


UPDATES:
Verified with vGPU SW 8.0 - xen 4.12, dom0 - Fedora 29/kernel 5.0 (with GTX1080)
Verified with vGPU SW 9.0 - xen 4.12, dom0 - Fedora 29/kernel 5.0 (with GTX1080)

#3
Posted 04/14/2019 03:09 PM   
Scroll To Top

Add Reply