Hi everybody!
I have two hosts running ESXi 5.1 hypervisor. I manage both of them using vCenter. Everything is fine at this point. The issues comes when I create an HA Cluster and I move the hosts inside the cluster. In that moment HA is configured on the hosts and few minutes later a purple screen crash occur on both of them, not at the same time but they always crashes. The message wrote on the top of the purple screen is always like this:
#PF Exception 14 in world 8218:idle26 IP 0x418027ab7e7f addr 0x0 (The world value and CPU vary, but always is the same exception)
I analyzed the core dump file after the purple screen event and they always report some error when they try to access to the local disc just before the fail event.
2015-11-11T10:15:12.386Z cpu15:8667)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x1a (0x41240bcea780, 0) to dev "device-name" on path "vmhba1:C0:T32:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
2015-11-11T10:15:12.386Z cpu15:8667)ScsiDeviceIO: 2331: Cmd(0x41240bcea780) 0x1a, CmdSN 0x1323 from world 0 to dev "device-name" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2015-11-11T10:20:12.385Z cpu8:8200)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x1a (0x4124004081c0, 0) to dev "device-name" on path "vmhba1:C0:T32:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
2015-11-11T10:20:12.385Z cpu8:8200)ScsiDeviceIO: 2331: Cmd(0x4124004081c0) 0x1a, CmdSN 0x135b from world 0 to dev "device-name" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2015-11-11T10:25:12.384Z cpu10:8202)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x1a (0x41240c326440, 0) to dev "device-name" on path "vmhba1:C0:T32:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
2015-11-11T10:25:12.384Z cpu10:8202)ScsiDeviceIO: 2331: Cmd(0x41240c326440) 0x1a, CmdSN 0x138a from world 0 to dev "device-name" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
The path "vmhba1:C0:T32:L0" referred in the error is the path to local disc. But here there is something weird. Because I have a Syslog Collector in my environment too, and when the hosts were outside the HA cluster, they reported the same, and they never had failed before. Do Anyone have some idea about what's happening?
Hi Mark,
You say the host crashes with a PSOD correct?
Can you please share the core dump logging? If you have a screenshot of the PSOD that would be great too.
Quick heads up: cd var/core should give you the core dump for the host.
Suhas
I'm sure there is something wrong with HA agent installed in the hosts, I've disabled Host Monitoring under cluster settings three hours ago and the hosts are working properly. Of course, that's not the solution, but it's an important point to think about. Now I'm analyzing the HA log file located in /var/log/fdm.log, maybe there is something useful there.