VMware Cloud Community
MSUKvm
Contributor
Contributor

ESXi 5 crashing

Hi there,

I am running Esxi 5.5 with the latest patches and every so often my server will just freeze.  The only way to get it back is to power off.

Can someone give me a starting point of where to look to determine the cause?

Thanks

8 Replies
vuzzini
Enthusiast
Enthusiast

Hello,

What's the manufacturer model of the hardware ?

Is there any PSOD when the host crashes/freezes ?

Is it a newly provisioned ESXi host ? If no, since how long the sever is running and was the issue noticed recently ?

If you found this or any other answer useful please consider the use of the Helpful or Correct buttons to award points. Sandeep Vuzzini Sr. DevOps Engineer
0 Kudos
Praveenmna
Enthusiast
Enthusiast

Hi,

VMKernel logs gives fair idea about the crash. Please refer it .

If you found this or any other answer useful please consider the use of the Helpful or Correct buttons to award points. Praveen P Senior Support Engineer
0 Kudos
MSUKvm
Contributor
Contributor

Hi there,

Can you tell me how I can get to these logs files?

Thanks

0 Kudos
homerzzz
Hot Shot
Hot Shot

You can ssh to the host and they are located in /var/log.

You can also export them using the vSphere client:  File>Export>Export System Logs

0 Kudos
MSUKvm
Contributor
Contributor

Hi again,

Had a look at the vmkernel.log file and found this - any ideas?

Once it crashed I was unable to ptty to the server and I had to push the power button off and on.

2015-06-18T11:04:38.834Z cpu0:35332)MCE: 1082: cpu0: MCA error detected via Polling (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.

2015-06-18T11:04:38.834Z cpu0:35332)MCE: 180: cpu0: bank4: status=0x94514000b0080a13: (VAL=1, OVFLW=0, UC=0, EN=1, PCC=0, S=0, AR=0), Addr:0x1a9699910 (valid), Misc:0x0 (invalid)

2015-06-18T11:04:38.834Z cpu0:35332)MCE: 189: cpu0: bank4: MCA fatal error (CE): "Corrected DRAM ECC Error on cpu 0 physical address 0x1a9699910 "

2015-06-18T11:07:33.338Z cpu0:32787)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x1a (0x439d802b1b80, 0) to dev "mpx.vmhba1:C0:T0:L0" on path "vmhba1:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2015-06-18T11:09:57.834Z cpu0:35332)MCE: 1082: cpu0: MCA error detected via Polling (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.

2015-06-18T11:09:57.834Z cpu0:35332)MCE: 180: cpu0: bank4: status=0x94514000b0080813: (VAL=1, OVFLW=0, UC=0, EN=1, PCC=0, S=0, AR=0), Addr:0x23089fd90 (valid), Misc:0x0 (invalid)

2015-06-18T11:09:57.834Z cpu0:35332)MCE: 189: cpu0: bank4: MCA fatal error (CE): "Corrected DRAM ECC Error on cpu 0 physical address 0x23089fd90 "

VMB: 49: mbMagic: 2badb002, mbInfo 0x1010f0

VMB: 54: flags a6d

Thanks

0 Kudos
jamesarmstrong1
Contributor
Contributor

From reviewing what is in the kernel log you have an issue with a machine check exception what the kernel log is returning seems like a defective ram card or slot causing the crash to happen, I would recommend running a full hardware diagnostics on the host and also contact your vendor provider to assist with that action,

If you have a support contract with VMware please open a case so that you can upload the screen shot and logs for the event to find the case for the crash if you have power management auto reboot enabled then the host might not stall on the PSOD purple screen,

If you have the screen shot of the crashed host post it and I can take a look for you.

Please see this KB for relevant information on the kernel log

    

VMware KB: Decoding Machine Check Exception (MCE) output after a purple screen error

Excellence as a service!
0 Kudos
MSUKvm
Contributor
Contributor

Hi there,

When you say "defective ram card" - do you mean one of the memory modules could be faulty?

I would recommend running a full hardware diagnostics on the host = Could I use something like this - MemTest86 - Offical Site of the x86 Memory Testing Tool


The server simply just freezes and when you press any key to initiate the screen to come alive it doesn't. There is no PSOD and the only way is to power off the server and power it back on. 


Thanks

0 Kudos
cykVM
Expert
Expert

The entries in your logfile stating "[...] MCA fatal error (CE): "Corrected DRAM ECC Error[...]" might lead to defective RAM. Even if it states the error as "Corrected" it might lead to a freeze or crash sooner or later.

You may use memtest for checking your memory modules. Let that run for quite a while as sometimes errors come up only after the modules are heating up.

If your server has vendor tools for checking the hardware you may also run those. Also a brief look onto the cooling system might lead to a solution: all fans operating as normal? Probably clean up heatsinks/fans.