Poor performance of highly specced VMs – vNUMA!

The scenario:

Database server with 10 vCPU and 192 GB ram

Physical host with 2 sockets of 10 cores, 265GB RAM total

Customer was reporting poor performance of their database server, an initial browse of the configuration seemed to show it had been configured well. Multiple paravirtual scsi adapters, enough CPU/RAM for the workload, storage performing well etc. Yet the CPU was almost consistently at 100% and adding more cores did not help.

Eventually spotted part of the problem – CPU Hot Add was enabled. Whilst this is a useful feature for smaller VMs so they can start small and grow as the workload grows, for VMS over 8 vCPUs this disables vNUMA and can lead to poor performance.

Why this is a problem

vNUMA is a technology which was introduced in vSphere 5.5 and improved upon in 6.5. It allows presenting an optimal NUMA layout available to the guest OS. This means the OS is able to optimally place data in memory in its local physical NUMA node, which is faster and more optimal than using memory on the physical processor that the vm is not scheduled on.

The Catch

vNUMA only calculates on CPU capacity. This is fine if you are using less memory than is attached to a single processor in your host, as the vNUMA presentation will be correct.

So in our example host above, a 10 vCPU VM with 128GB ram would be presented a single vNUMA node, as it fits within a single physical numa node. If we were to increase that to 12 vCPUs, it would present 2 vNUMA nodes as it crosses the amount of cores on one physical processor. As the memory is otherwise within the bounds of the physical numa boundary on one processor, we do not have to worry.

However, our database server has 10 cores and 196 GB RAM. Simply disabling CPU Hot Add and allowing vNUMA to take over will present 1 vNUMA node on the basis of the VM having 10 vCPUs. As vNUMA hasn’t taken memory into account, in this instance it has more memory than is in a physical NUMA node (128 GB). 1 vNUMA node in this instance performs poorly due to the total memory crossing two NUMA nodes.

The Solution

In this instance the solution is simple. Configure the vCPU socket layout to match your physical CPU socket layout. Here, we configure 2 sockets of 5 vCPUs which present 2 NUMA nodes to the guest os.

With this configuration, thanks to more optimised memory usage and using the same available resources, CPU usage dropped from 100% constant to around 70-80%. A nice saving without having to allocate more!

Takeaways:

  • On large (>8 vCPU) VMs – don’t enable CPU Hot Add
  • On VMs with more memory than a single numa node or more vCPU cores than on a single processor – manually set your vCPU socket configuration to match the number of sockets in your host system.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s