ESXi 5.1 host craches few minutes after HA is conf...

markdom · ‎11-12-2015

Hi everybody!

I have two hosts running ESXi 5.1 hypervisor. I manage both of them using vCenter. Everything is fine at this point. The issues comes when I create an HA Cluster and I move the hosts inside the cluster. In that moment HA is configured on the hosts and few minutes later a purple screen crash occur on both of them, not at the same time but they always crashes. The message wrote on the top of the purple screen is always like this:

#PF Exception 14 in world 8218:idle26 IP 0x418027ab7e7f addr 0x0 (The world value and CPU vary, but always is the same exception)

I analyzed the core dump file after the purple screen event and they always report some error when they try to access to the local disc just before the fail event.

2015-11-11T10:15:12.386Z cpu15:8667)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x1a (0x41240bcea780, 0) to dev "device-name" on path "vmhba1:C0:T32:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2015-11-11T10:15:12.386Z cpu15:8667)ScsiDeviceIO: 2331: Cmd(0x41240bcea780) 0x1a, CmdSN 0x1323 from world 0 to dev "device-name" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2015-11-11T10:20:12.385Z cpu8:8200)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x1a (0x4124004081c0, 0) to dev "device-name" on path "vmhba1:C0:T32:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2015-11-11T10:20:12.385Z cpu8:8200)ScsiDeviceIO: 2331: Cmd(0x4124004081c0) 0x1a, CmdSN 0x135b from world 0 to dev "device-name" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2015-11-11T10:25:12.384Z cpu10:8202)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x1a (0x41240c326440, 0) to dev "device-name" on path "vmhba1:C0:T32:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2015-11-11T10:25:12.384Z cpu10:8202)ScsiDeviceIO: 2331: Cmd(0x41240c326440) 0x1a, CmdSN 0x138a from world 0 to dev "device-name" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

The path "vmhba1:C0:T32:L0" referred in the error is the path to local disc. But here there is something weird. Because I have a Syslog Collector in my environment too, and when the hosts were outside the HA cluster, they reported the same, and they never had failed before. Do Anyone have some idea about what's happening?

SavkoorSuhas · ‎11-12-2015

Hi Mark,

You say the host crashes with a PSOD correct?

Can you please share the core dump logging? If you have a screenshot of the PSOD that would be great too.

Quick heads up: cd var/core should give you the core dump for the host.

Suhas

If you found this or any other answer useful please consider the use of the Helpful or Correct buttons to award points.

Don't Backup. Go Forward!
Rubrik

markdom · ‎11-12-2015

The servers have crashed several times. I'll attach the files for the last one. Thanks for your help.

markdom · ‎11-12-2015

This are the files for the second host

markdom · ‎11-13-2015

I'm sure there is something wrong with HA agent installed in the hosts, I've disabled Host Monitoring under cluster settings three hours ago and the hosts are working properly. Of course, that's not the solution, but it's an important point to think about. Now I'm analyzing the HA log file located in /var/log/fdm.log, maybe there is something useful there.

All

ESXi 5.1 host craches few minutes after HA is configured on it