Hi there,
I am running Esxi 5.5 with the latest patches and every so often my server will just freeze. The only way to get it back is to power off.
Can someone give me a starting point of where to look to determine the cause?
Thanks
Hello,
What's the manufacturer model of the hardware ?
Is there any PSOD when the host crashes/freezes ?
Is it a newly provisioned ESXi host ? If no, since how long the sever is running and was the issue noticed recently ?
Hi,
VMKernel logs gives fair idea about the crash. Please refer it .
Hi there,
Can you tell me how I can get to these logs files?
Thanks
You can ssh to the host and they are located in /var/log.
You can also export them using the vSphere client: File>Export>Export System Logs
Hi again,
Had a look at the vmkernel.log file and found this - any ideas?
Once it crashed I was unable to ptty to the server and I had to push the power button off and on.
2015-06-18T11:04:38.834Z cpu0:35332)MCE: 1082: cpu0: MCA error detected via Polling (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2015-06-18T11:04:38.834Z cpu0:35332)MCE: 180: cpu0: bank4: status=0x94514000b0080a13: (VAL=1, OVFLW=0, UC=0, EN=1, PCC=0, S=0, AR=0), Addr:0x1a9699910 (valid), Misc:0x0 (invalid)
2015-06-18T11:04:38.834Z cpu0:35332)MCE: 189: cpu0: bank4: MCA fatal error (CE): "Corrected DRAM ECC Error on cpu 0 physical address 0x1a9699910 "
2015-06-18T11:07:33.338Z cpu0:32787)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x1a (0x439d802b1b80, 0) to dev "mpx.vmhba1:C0:T0:L0" on path "vmhba1:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
2015-06-18T11:09:57.834Z cpu0:35332)MCE: 1082: cpu0: MCA error detected via Polling (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2015-06-18T11:09:57.834Z cpu0:35332)MCE: 180: cpu0: bank4: status=0x94514000b0080813: (VAL=1, OVFLW=0, UC=0, EN=1, PCC=0, S=0, AR=0), Addr:0x23089fd90 (valid), Misc:0x0 (invalid)
2015-06-18T11:09:57.834Z cpu0:35332)MCE: 189: cpu0: bank4: MCA fatal error (CE): "Corrected DRAM ECC Error on cpu 0 physical address 0x23089fd90 "
VMB: 49: mbMagic: 2badb002, mbInfo 0x1010f0
VMB: 54: flags a6d
Thanks
From reviewing what is in the kernel log you have an issue with a machine check exception what the kernel log is returning seems like a defective ram card or slot causing the crash to happen, I would recommend running a full hardware diagnostics on the host and also contact your vendor provider to assist with that action,
If you have a support contract with VMware please open a case so that you can upload the screen shot and logs for the event to find the case for the crash if you have power management auto reboot enabled then the host might not stall on the PSOD purple screen,
If you have the screen shot of the crashed host post it and I can take a look for you.
Please see this KB for relevant information on the kernel log
VMware KB: Decoding Machine Check Exception (MCE) output after a purple screen error
Hi there,
When you say "defective ram card" - do you mean one of the memory modules could be faulty?
I would recommend running a full hardware diagnostics on the host = Could I use something like this - MemTest86 - Offical Site of the x86 Memory Testing Tool
The server simply just freezes and when you press any key to initiate the screen to come alive it doesn't. There is no PSOD and the only way is to power off the server and power it back on.
Thanks
The entries in your logfile stating "[...] MCA fatal error (CE): "Corrected DRAM ECC Error[...]" might lead to defective RAM. Even if it states the error as "Corrected" it might lead to a freeze or crash sooner or later.
You may use memtest for checking your memory modules. Let that run for quite a while as sometimes errors come up only after the modules are heating up.
If your server has vendor tools for checking the hardware you may also run those. Also a brief look onto the cooling system might lead to a solution: all fans operating as normal? Probably clean up heatsinks/fans.