I have an interesting scenario (HP vs DELL hardware) with potentially degraded performance (specific to the DELL R815 hardware) and I would like to know if what I am seeing is being interpreted correctly, or whether am I simply being over cautious and don't actually have an issue.
Summary;
DELL Technical Details;
Hypervisor : VMware ESXi 4.1.0, build 582267
Hardware specification;
Dell PowerEdge R815
- Model : AMD Opteron(tm) Processor 6174
- Processor Speed : 2.2 GHz
- Processor Sockets : 4
- Processor Cores per Socket : 12
- Logical Processors : 48
- Memory : 256 GB
esxtop performance statistics;
DELL Memory (incl NUMA statistics);
Dell CPU;
Observations;
Example of the affected Guest VM
DELL Host is under no load whatsoever;
As a contrasting perspective from a heavily loaded HP DL585 G6 host, this is what I would “expect” to see;
HP Technical Details;
Hypervisor : VMware ESXi 4.1.0, build 582267
HP Hardware specification;
HP ProLiant DL585 G6
- Model : Six-Core AMD Opteron(tm) Processor 8435
- Processor Speed : 2.6 GHz
- Processor Sockets : 4
- Processor Cores per Socket : 6
- Logical Processors : 24
- Memory : 128 GB
esxtop performance statistics;
HP Memory (incl NUMA statistics);
HP CPU;
Observations;
HP host still has capacity, but is under much more load the than the affected DELL host;
In both cases (HP and DELL) we do expect to see a certain level of ready time, but the levels seen on the DELL hardware are of concern, as well as the inefficient use of NUMA local memory. This issues is not seen on the HP hardware, including earlier and later generation hardware.
So the questions are;
That sounds great Jon.
I did get round to trying out three of the BIOS settings, DMA Virtualization Enabled, C1E Enabled and Power to MAX.
I did notice a small increase in boot speed but not had time to properly test yet.
I got a call back from support yesterday who said they will look into it but thanks for your efforts in digging into this problem, much appreciated.
Pete
Just a quick update. I can't confirm yet if this issue is linked to the one reported but after setting the C1E to Disabled in the BIOS my tests shows a 50% boot time increase in my XP and LInux VM. I'm running on 469512.
Thanks for the feedback Pete. I don't believe it is related, but useful knowing the results of your testing.
Here is an interesting read regarding the C states;
http://en.community.dell.com/techcenter/high-performance-computing/w/wiki/2288.aspx
I have decided to DISABLE this on all of my ESXi hosts.
Testing so far looks really good with the hot patch that VMWare provided me and I don't see the NUMA issue that I previously saw. I did a clean build upto June patch level excluding the hotpatch and I can see the issue. As soon as I apply the hot patch the NUMA balance is evenly spread and the NUMA local memory % is almost 100% on all running guests. I'm currently rebuilding one entire cluster to this level.
For testing purposes, I could give you the hot patch that I have been provided to see if you see the same benifits. I would however only use this for testing and push VMware to provide you with the same hot patch if it works for you.
Cheers,
Jon
I also noticed such a behavior over a year ago when we were still on 4.1 U1 and described it here:
http://communities.vmware.com/thread/313253
I think Dev09 made some important points there
Some also discussed it in the comments of http://frankdenneman.nl/2010/09/esx-4-1-numa-scheduling/
Looking at the stats in resxtop on our 5.0U1 hosts today, it seems a bit better though I still have to wonder why rarely some VMs have less than 90% memory N%L.
Not like we have any real issue here though.
Yes, that's exactly the same issue I observed. With the hot patch that VMware provided me to address this, the issue dissappears and almost all VM's are 100% local and there is no longer the increased %ready time.
I wonder if this will be addressed for general release in version 5.1 that has just been announced - I will definitely be testing this as soon as it's available.
http://www.vmware.com/products/vsphere/esxi-and-esx/overview.html
I am facing the same issue with our r815 systems. We have updated BIOS, Firmware etc to latest and the ESX build number we are using is 768111.
The performance of the VMs is problematic. This is still an environment under implementation but we need to go live soon.
How can I get the hot patch to try? When shall we expect the patch to get to us through the official channels?
The update to my SR was
"Here's an update on your case.
this fix is included in ESX 5.0 P04 which was released September 27 (KB 2032584), with the details of the fix in KB 2032586.
All patches get rolled up into the update release so it will also be in U2 which is scheduled for December this year"
The fix they are talking about is HotPatch for PR 875553.
Hope that helps someone as I don't see any of them in update manager maybe I'm missing something.
This issue has been fixed in the latest set of patches that have just been released (27/09/2012) - successfully tested.
If you are running ESXi 5.0.0 build # 821926 then you should see the issue resolved.
See: http://www.vmware.com/patchmgr/findPatch.portal (Build 821926)
PR787454: After performing vMotion operations on virtual machines, a NUMA imbalance might occur with all the virtual machines being assigned the same home node on the destination server.
Let me know if you need any additional information.
Cheers,
Jon
Yes, the fix is included in this patch release (successfully tested);
We had temporarily resolved the problem by setting the NUMA configuration, as described in the previous posts, and working a bit with the BIOS of the server. We also have upgraded to 821926. Do we need to get NUMA back to the original value?
FYI for anyone with this issue on 4.1. VMware told me it can't be fixed (or they won't fix it) on 4.1 so we're just SOL.
Also successfully tested here as of last night, after the patch overall CPU load has gone from 60-70% to 14-24% per host and as expected %RDY has dropped into single digits or below.
Thanks jrmunday! Without your great writeups and you persistence with VMware support these systems would have gone back to Dell and everyone would have been left with a bad impression of AMD here.
Latney Hoagland
No problem at all, I'm glad your issues are resolved and that I can finally contribute something to the community.
Let me know if you use the Dell vCenter plugin and need any help getting it working with OME / vCenter (including SNMP) ... it was a real pain to get working (with all the Firmare update features) but actually real simple once you know what needs doing ... I just havent had time document this for others.
Cheers,
Jon