NVIDIA Logo - Artificial Intelligence Computing Leadership from NVIDIA
Looking for advice on optimal config for latest-gen Citrix Xenapp vGPU solution
Hi, I'm designing a new setup for Nvidia Grid vAPPS user density RDSH based sessions based on the following hardware: HP DL380 G10 with dual Xeon 6254 best GPU. I suppose Tesla T4 right now for more flexibility, codec support and future-proofness. Ideal I would have wanted something like T6 or T8 (like a new gen M10 with 64GB mem) but it doesn't exist ? 10Gb backbone network to connect everything. I figure I will end up with 16 smaller or 8 larger Citrix virtual Xenapp servers per physical host server. 1. What can you recommend ? 2. Are there any reference config or casestudy documents available ? 3. How well can I scale per physical T4 GPU adapter ? 4. How is vGPU memory used in this RDSH model ? Does it limit the number of maximum Xenapp sessions per virtual Citrix server (which will have 1GB or 2GB memory assigned to it's virtual machine) ? Does it limit the resolutions they can run their Xenapp sessions in ? 5. Do I need VMware or can Xenserver work just as fine for Nvidia grid ? (VMware will require enterprise licenses etc...) Are their limitations ? Thx in advance for any replies Th
Hi,


I'm designing a new setup for Nvidia Grid vAPPS user density RDSH based sessions based on the following hardware:

HP DL380 G10 with dual Xeon 6254
best GPU. I suppose Tesla T4 right now for more flexibility, codec support and future-proofness. Ideal I would have wanted something like T6 or T8 (like a new gen M10 with 64GB mem) but it doesn't exist ?
10Gb backbone network to connect everything.

I figure I will end up with 16 smaller or 8 larger Citrix virtual Xenapp servers per physical host server.


1. What can you recommend ?

2. Are there any reference config or casestudy documents available ?

3. How well can I scale per physical T4 GPU adapter ?

4. How is vGPU memory used in this RDSH model ? Does it limit the number of maximum Xenapp sessions per virtual Citrix server (which will have 1GB or 2GB memory assigned to it's virtual machine) ? Does it limit the resolutions they can run their Xenapp sessions in ?

5. Do I need VMware or can Xenserver work just as fine for Nvidia grid ? (VMware will require enterprise licenses etc...) Are their limitations ?

Thx in advance for any replies

Th

#1
Posted 08/08/2019 05:03 PM   
Hi You can use either XenServer or vSphere for the Hypervisor. XenServer licensing is included with XenDesktop / XenApp so it's a bit cheaper, but at the end of the day you're stuck using XenCenter to manage your deployment (which is an extremely out dated and massively under developed management console). The only nice thing about XenCenter, is the way in which it visualises the GPUs and makes it really easy to see where they are allocated. This is a feature vSphere is lacking. By contrast with vSphere, you'll need Enterprise licensing for the Hypervisor and at least Standard licensing for vCenter. However, you're getting a platform that's much much nicer to manage and support with better overall functionality. If you plan to use Netscaler VPX appliances, be aware that XenServer does not support live migration of them (they migrate and crash) whereas vSphere does support live migration of them. Honestly, apart from vSphere's poor vGPU management, the only reason you'd choose XenServer over vSphere is because it's cheaper. If they were the same price or a lot closer, it would be vSphere every time without hesitation. As for your T4s, change the Scheduler on the GPUs to "Fixed" and allocate 8GB (using the 8A vGPU Profile) to each of your XenApp VMs. You'll get 2 XenApp VMs per GPU, allocate your CPU and RAM resources to each XenApp VM accordingly based on that. The downside of using XenApp, is that you have no control of how the Framebuffer is allocated per user. One user could consume the entire 8GB if their workload required it. If you want more granular control and a fixed Framebuffer allocation per user, then you need to use XenDesktop. How many concurrent users do you plan to support? Regards MG
Hi

You can use either XenServer or vSphere for the Hypervisor.

XenServer licensing is included with XenDesktop / XenApp so it's a bit cheaper, but at the end of the day you're stuck using XenCenter to manage your deployment (which is an extremely out dated and massively under developed management console). The only nice thing about XenCenter, is the way in which it visualises the GPUs and makes it really easy to see where they are allocated. This is a feature vSphere is lacking.

By contrast with vSphere, you'll need Enterprise licensing for the Hypervisor and at least Standard licensing for vCenter. However, you're getting a platform that's much much nicer to manage and support with better overall functionality. If you plan to use Netscaler VPX appliances, be aware that XenServer does not support live migration of them (they migrate and crash) whereas vSphere does support live migration of them.

Honestly, apart from vSphere's poor vGPU management, the only reason you'd choose XenServer over vSphere is because it's cheaper. If they were the same price or a lot closer, it would be vSphere every time without hesitation.

As for your T4s, change the Scheduler on the GPUs to "Fixed" and allocate 8GB (using the 8A vGPU Profile) to each of your XenApp VMs. You'll get 2 XenApp VMs per GPU, allocate your CPU and RAM resources to each XenApp VM accordingly based on that. The downside of using XenApp, is that you have no control of how the Framebuffer is allocated per user. One user could consume the entire 8GB if their workload required it. If you want more granular control and a fixed Framebuffer allocation per user, then you need to use XenDesktop.

How many concurrent users do you plan to support?

Regards

MG

#2
Posted 08/09/2019 07:37 AM   
Hi Mr GRID, thanks for your response. You seem like just the right man to talk to ! I'm planning for 350-400 concurrent users capacity-wise. All intended worker profiles are office level. So I'm looking to offload 'normal' applications. We don't have autocad or other GPU power users to service. Outside of the GPU part I think I'll have 8 or 16 virtual Xenapp servers per physical host in the current design. Since the T4 cards are very costly and PCIE slots are limited in general I was hoping to be able to service them with 1 or 2 T4 cards per physical host maximum. What is the exact criterium here or how do I calculate the need for specific vGPU profiles ? How does that work ? I also remember to have seen an overview of all possible vGPU profiles for M10 last year but now I cannot seem to find the same document for T4. ps: for some odd reason I'm not getting update emails of your reply (not in spam folder either) so good thing I checked back manually
Hi Mr GRID,

thanks for your response. You seem like just the right man to talk to !

I'm planning for 350-400 concurrent users capacity-wise. All intended worker profiles are office level. So I'm looking to offload 'normal' applications. We don't have autocad or other GPU power users to service.

Outside of the GPU part I think I'll have 8 or 16 virtual Xenapp servers per physical host in the current design.

Since the T4 cards are very costly and PCIE slots are limited in general I was hoping to be able to service them with 1 or 2 T4 cards per physical host maximum. What is the exact criterium here or how do I calculate the need for specific vGPU profiles ? How does that work ?

I also remember to have seen an overview of all possible vGPU profiles for M10 last year but now I cannot seem to find the same document for T4.


ps: for some odd reason I'm not getting update emails of your reply (not in spam folder either) so good thing I checked back manually

#3
Posted 08/09/2019 03:26 PM   
Hi [quote="Profundido"]What is the exact criterium here or how do I calculate the need for specific vGPU profiles ? How does that work ?[/quote] With XenApp / RDSH, it's relatively strait forwards to design as the vGPU configuration options are pretty standard. Basically, 8GB is the number you should "[i]typically[/i]" be looking to use for XenApp, and the configuration options would be as follows: The most cost effective (cheapest) solution is to still use the M10 for XenApp deployments. The M10 has 4 GPUs on a single board, and you'll put 2 of those boards in a single 2U server. This will give you the capability of running 8 XenApp VMs per Server, each with an 8A vGPU profile. A more future proof configuration would be to replace 1 M10 with 2 T4s (or 2 M10s with 4 T4s). This will give you the same amount of Framebuffer to share between your XenApp VMs, but the T4 will provide better performance and functionality and the T4 is more power efficient as well. Then (as mentioned earlier) change the vGPU scheduler on the T4 to "Fixed" and allocate the same 8A vGPU profile to the XenApp VMs. You don't need to change the Scheduler on the M10, as each XenApp VM has its own dedicated GPU. You will want more than 2 T4s per server, or you'll need more servers to cater for that amount of users. So scale up, not out. If you have 400 users and want 16 XenApp VMs that equates to 25 users per XenApp VM. With 4 T4s installed, you'll have 8 XenApp VMs per DL380. To account for N+1 (physical resilience, image updates, user load balancing) you're going to need 3 DL380 servers each with 4 T4s installed to cater for those numbers,[b] assuming that you can actually support 25 users per XenApp VM without impacting the experience[/b]. User density on the XenApp VMs will vary depending on utilisation, so it's very important to test first in a POC before finalising any specifications or quantity of Servers required. The DL380 G10 will actually support up to 5 T4s (https://www.nvidia.com/object/vgpu-certified-servers.html), which means you could host 10 XenApp VMs per DL380. This would reduce your user density per XenApp VM down to 20, which may be a better number to target. vGPU Profile options for the M10 and T4 are available here: [b]M10[/b]: [url]https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#vgpu-types-tesla-m10[/url] [b]T4[/b]: [url]https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#vgpu-types-tesla-t4[/url] But as said, the best profile for higher density XenApp VMs will be to use the 8A profile. If you're supporting 10 XenApp VMs / 200 users per Server, don't forget to consider the CPU. You should be looking at something with more Cores, rather than higher Clock. Here are some better options to consider: [b]Platinum 8280[/b]: https://ark.intel.com/content/www/us/en/ark/products/192478/intel-xeon-platinum-8280-processor-38-5m-cache-2-70-ghz.html [b]Platinum 8260[/b]: https://ark.intel.com/content/www/us/en/ark/products/192474/intel-xeon-platinum-8260-processor-35-75m-cache-2-40-ghz.html [b]Gold 6252N[/b]: https://ark.intel.com/content/www/us/en/ark/products/193951/intel-xeon-gold-6252n-processor-35-75m-cache-2-30-ghz.html Due to the nature of the workload, you don't need such a high Clock, and having more Cores will reduce the CPU overcommit. As a starting point for your POC, you should be looking at 8 vCPUs / 32GB RAM / 8A vGPU with the aim of supporting 20-25 users per XenApp VM. Your VMs should be running on All Flash / SSD storage as well (not cheap / slow spinning disks). You can then monitor the hardware utilisation for each component and tailor the specs to suit the user experience, performance and user density. Regards MG
Hi

Profundido said:What is the exact criterium here or how do I calculate the need for specific vGPU profiles ? How does that work ?

With XenApp / RDSH, it's relatively strait forwards to design as the vGPU configuration options are pretty standard. Basically, 8GB is the number you should "typically" be looking to use for XenApp, and the configuration options would be as follows:

The most cost effective (cheapest) solution is to still use the M10 for XenApp deployments. The M10 has 4 GPUs on a single board, and you'll put 2 of those boards in a single 2U server. This will give you the capability of running 8 XenApp VMs per Server, each with an 8A vGPU profile.

A more future proof configuration would be to replace 1 M10 with 2 T4s (or 2 M10s with 4 T4s). This will give you the same amount of Framebuffer to share between your XenApp VMs, but the T4 will provide better performance and functionality and the T4 is more power efficient as well. Then (as mentioned earlier) change the vGPU scheduler on the T4 to "Fixed" and allocate the same 8A vGPU profile to the XenApp VMs. You don't need to change the Scheduler on the M10, as each XenApp VM has its own dedicated GPU.

You will want more than 2 T4s per server, or you'll need more servers to cater for that amount of users. So scale up, not out. If you have 400 users and want 16 XenApp VMs that equates to 25 users per XenApp VM. With 4 T4s installed, you'll have 8 XenApp VMs per DL380.

To account for N+1 (physical resilience, image updates, user load balancing) you're going to need 3 DL380 servers each with 4 T4s installed to cater for those numbers, assuming that you can actually support 25 users per XenApp VM without impacting the experience. User density on the XenApp VMs will vary depending on utilisation, so it's very important to test first in a POC before finalising any specifications or quantity of Servers required.

The DL380 G10 will actually support up to 5 T4s (https://www.nvidia.com/object/vgpu-certified-servers.html), which means you could host 10 XenApp VMs per DL380. This would reduce your user density per XenApp VM down to 20, which may be a better number to target.

vGPU Profile options for the M10 and T4 are available here:

M10: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#vgpu-types-tesla-m10
T4: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#vgpu-types-tesla-t4

But as said, the best profile for higher density XenApp VMs will be to use the 8A profile.

If you're supporting 10 XenApp VMs / 200 users per Server, don't forget to consider the CPU. You should be looking at something with more Cores, rather than higher Clock. Here are some better options to consider:

Platinum 8280: https://ark.intel.com/content/www/us/en/ark/products/192478/intel-xeon-platinum-8280-processor-38-5m-cache-2-70-ghz.html
Platinum 8260: https://ark.intel.com/content/www/us/en/ark/products/192474/intel-xeon-platinum-8260-processor-35-75m-cache-2-40-ghz.html
Gold 6252N: https://ark.intel.com/content/www/us/en/ark/products/193951/intel-xeon-gold-6252n-processor-35-75m-cache-2-30-ghz.html

Due to the nature of the workload, you don't need such a high Clock, and having more Cores will reduce the CPU overcommit.

As a starting point for your POC, you should be looking at 8 vCPUs / 32GB RAM / 8A vGPU with the aim of supporting 20-25 users per XenApp VM. Your VMs should be running on All Flash / SSD storage as well (not cheap / slow spinning disks). You can then monitor the hardware utilisation for each component and tailor the specs to suit the user experience, performance and user density.

Regards

MG

#4
Posted 08/10/2019 09:48 AM   
Thanks for the comprehensive answer. Very clear except 1 thing: why is 8GB the number you should "typically" be looking to use for XennApp vGPU profiles ? What is the impact if I choose ending up with 4GB for instance ?
Thanks for the comprehensive answer. Very clear except 1 thing:

why is 8GB the number you should "typically" be looking to use for XennApp vGPU profiles ? What is the impact if I choose ending up with 4GB for instance ?

#5
Posted 08/11/2019 11:31 AM   
Hi For XenApp / RDSH workloads, the 8GB started with the M10 (which was specifically created by NVIDIA to provide a low cost entry point for workloads like these). Best practice was (and still is) to assign the entire 8GB of a GPU to a single RDSH VM. That way, the VM gets the full power of the GPU and doesn't have to share it with a competing VM via the Scheduler. The more VMs you add to the same GPU the less consistent the performance is as the resources now need to be Scheduled, this is especially true with RDSH, as you have multiple users per RDSH VM. The only way to then provide more consistent performance (bearing in mind that one user on the RDSH VM can still impact another) is to modify the Scheduler accordingly to provide consistent performance (trading peaky performance, for consistent performance but at a lower level). However, by doing that, neither VM will ever get the full power of the GPU, so the user experience will ultimately suffer. If you wanted to run the M10 and allocate 4GB to each RDSH VM, then each RDSH VM would only be getting 50% performance of an already not very powerful GPU shared between multiple users on each RDSH VM. With the T4, that same scenario gets slightly worse. As the M10 has 8GB GPUs, running 2 4GB VMs on it only halves the GPUs performance. With the T4, even though it's more powerful [b]than a single GPU on an M10[/b], it's still a single 16GB GPU, so if you run 4 4GB RDSH VMs on it, what you're actually doing is giving each RDSH VM a maximum of 25% performance of the GPU (assuming you've configured the Scheduler to "Fixed" to give consistent performance). Each set of users on the RDSH VMs, then only gets up to 25% of the GPU divided by however many users are on that VM using that GPU at the same time. All of that, and that's before we even get on to encoding. The Framebuffer is the only bit of the GPU that [b]isn't[/b] shared between other VMs, meaning that everything else is. If you overload the encoders, you'll further impact user experience. Even though the encoders on the Turing GPUs are much more efficient than those used on the older architectures, there are less of them, so it's still possible to overload them. A great way to do that is by running too many RDSH VMs on a GPU, as there is no hard limit to the amount of users (individual sessions that require encoding) per VM. This is in contrast to Desktop based VMs. As the Framebuffer is a fixed resource, each GPU can only support a finite amount of VMs. With the M10, forgetting that pointless 512MB profile, the maximum amount of VMs you can get per GPU is 8 (1GB). This means that the Scheduler only has to share the resources between a maximum of 8 VMs (users), unlike RDSH, where you can easily get 20+ users per VM. For best results running RDSH on the T4, use the 8A profile, assign that to 2 RDSH VMs and change the scheduler to "Fixed" to give your users consistent performance (or as consistent as a VM shared by 20 - 25 users can be), that way the users on one RDSH VM will get 50% of a T4 without the ability to impact the 20 - 25 users on the second RDSH VM sharing the GPU, which will be as good / better than an entire 8GB M10. If you were hoping to run 4 4GB RDSH VMs with 20 - 25 users on each T4 (totalling 100 users per T4), I'll save the the trouble of running a POC ... Don't bother, the user experience won't be good enough. You'll need the configuration I've mentioned above :-) If 4 T4s don't fit your budget, then use 2 M10s (per server) instead, again with the 8A profile (that's 4 RDSH VMs per M10), but you'll still need 2 DL380 servers to hit your number (3 servers, if you want to include N+1 resilience), assuming that you can get 25 users per RDSH VM to hit your 400 user peak. Regards MG
Hi

For XenApp / RDSH workloads, the 8GB started with the M10 (which was specifically created by NVIDIA to provide a low cost entry point for workloads like these). Best practice was (and still is) to assign the entire 8GB of a GPU to a single RDSH VM. That way, the VM gets the full power of the GPU and doesn't have to share it with a competing VM via the Scheduler. The more VMs you add to the same GPU the less consistent the performance is as the resources now need to be Scheduled, this is especially true with RDSH, as you have multiple users per RDSH VM. The only way to then provide more consistent performance (bearing in mind that one user on the RDSH VM can still impact another) is to modify the Scheduler accordingly to provide consistent performance (trading peaky performance, for consistent performance but at a lower level). However, by doing that, neither VM will ever get the full power of the GPU, so the user experience will ultimately suffer. If you wanted to run the M10 and allocate 4GB to each RDSH VM, then each RDSH VM would only be getting 50% performance of an already not very powerful GPU shared between multiple users on each RDSH VM.

With the T4, that same scenario gets slightly worse. As the M10 has 8GB GPUs, running 2 4GB VMs on it only halves the GPUs performance. With the T4, even though it's more powerful than a single GPU on an M10, it's still a single 16GB GPU, so if you run 4 4GB RDSH VMs on it, what you're actually doing is giving each RDSH VM a maximum of 25% performance of the GPU (assuming you've configured the Scheduler to "Fixed" to give consistent performance). Each set of users on the RDSH VMs, then only gets up to 25% of the GPU divided by however many users are on that VM using that GPU at the same time.

All of that, and that's before we even get on to encoding. The Framebuffer is the only bit of the GPU that isn't shared between other VMs, meaning that everything else is. If you overload the encoders, you'll further impact user experience. Even though the encoders on the Turing GPUs are much more efficient than those used on the older architectures, there are less of them, so it's still possible to overload them. A great way to do that is by running too many RDSH VMs on a GPU, as there is no hard limit to the amount of users (individual sessions that require encoding) per VM. This is in contrast to Desktop based VMs. As the Framebuffer is a fixed resource, each GPU can only support a finite amount of VMs. With the M10, forgetting that pointless 512MB profile, the maximum amount of VMs you can get per GPU is 8 (1GB). This means that the Scheduler only has to share the resources between a maximum of 8 VMs (users), unlike RDSH, where you can easily get 20+ users per VM.

For best results running RDSH on the T4, use the 8A profile, assign that to 2 RDSH VMs and change the scheduler to "Fixed" to give your users consistent performance (or as consistent as a VM shared by 20 - 25 users can be), that way the users on one RDSH VM will get 50% of a T4 without the ability to impact the 20 - 25 users on the second RDSH VM sharing the GPU, which will be as good / better than an entire 8GB M10.

If you were hoping to run 4 4GB RDSH VMs with 20 - 25 users on each T4 (totalling 100 users per T4), I'll save the the trouble of running a POC ... Don't bother, the user experience won't be good enough. You'll need the configuration I've mentioned above :-) If 4 T4s don't fit your budget, then use 2 M10s (per server) instead, again with the 8A profile (that's 4 RDSH VMs per M10), but you'll still need 2 DL380 servers to hit your number (3 servers, if you want to include N+1 resilience), assuming that you can get 25 users per RDSH VM to hit your 400 user peak.

Regards

MG

#6
Posted 08/11/2019 04:18 PM   
Awesome information ! That's exactly what I was looking for. Had to read it 3 times before it sank in fully :) [i]"Each set of users on the RDSH VMs, then only gets up to 25% of the GPU divided by however many users are on that VM using that GPU at the same time"[/i] => ouch yes, I can see how that will affect my scaling options as well as the maximum potential vGPU performance a single user can reach. "I[i]f you were hoping to run 4 4GB RDSH VMs with 20 - 25 users on each T4 (totalling 100 users per T4), I'll save the the trouble of running a POC ... Don't bother, the user experience won't be good enough"[/i] => Thx, I think you just saved me quite some 'hard lessons learned' time :) Is there any technical documentation where I can further educate myself on how the scheduler and framebuffer work at a technical level ? I will take all this information into consideration in my design and total cost considerations. I can already see how this will affect my choice and scaling options.
Awesome information ! That's exactly what I was looking for. Had to read it 3 times before it sank in fully :)

"Each set of users on the RDSH VMs, then only gets up to 25% of the GPU divided by however many users are on that VM using that GPU at the same time"

=> ouch yes, I can see how that will affect my scaling options as well as the maximum potential vGPU performance a single user can reach.

"If you were hoping to run 4 4GB RDSH VMs with 20 - 25 users on each T4 (totalling 100 users per T4), I'll save the the trouble of running a POC ... Don't bother, the user experience won't be good enough"

=> Thx, I think you just saved me quite some 'hard lessons learned' time :)


Is there any technical documentation where I can further educate myself on how the scheduler and framebuffer work at a technical level ?

I will take all this information into consideration in my design and total cost considerations. I can already see how this will affect my choice and scaling options.

#7
Posted 08/12/2019 08:35 AM   
Hi Sure. All vGPU documentation is available here: https://docs.nvidia.com/grid/ When your POC begins you'll be running the latest version (currently 9.0) so just select "Latest Release" for the most up to date features and functionality. The piece of information you're looking for relating to the Scheduler is located here: [url]https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#changing-vgpu-scheduling-policy[/url] There's a lot of information in those documents and sometimes specific details aren't that easy to locate. In that case, just use the "search box" (top right) to scan through all of the documentation for specific key words to help find the information. Regards MG
Hi

Sure. All vGPU documentation is available here: https://docs.nvidia.com/grid/ When your POC begins you'll be running the latest version (currently 9.0) so just select "Latest Release" for the most up to date features and functionality.

The piece of information you're looking for relating to the Scheduler is located here: https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#changing-vgpu-scheduling-policy

There's a lot of information in those documents and sometimes specific details aren't that easy to locate. In that case, just use the "search box" (top right) to scan through all of the documentation for specific key words to help find the information.

Regards

MG

#8
Posted 08/12/2019 09:14 AM   
Scroll To Top

Add Reply