We recently had a fibre channel storage issue (looks like it was a single bad cable) that affected all 360 VMs in 2 clusters attached to the same storage virtualisation device - IBM SVC 2145. The VMs were basically so slow to respond that they were unusable and many were logging symmpi errors in the Windows event logs.
VMware responded with the obvious - "storage issue" - but our storage team is adamant there was no problem with their equipment or zoning. I need to know how one single faulty cable could effectively bring down all VMs in 2 separate clusters.
Has anyone had a similar problem, or able to shed any light?
PS all hosts running vSphere update 1, with patches to December 2009.
Edit: Physical servers attached to the same SVD were apparently unaffected.
Message was edited by: lldmka
Hi,
do you share the LUN's between both Clusters?
Maybe you did see SCSI reservation issues during the timeframe?
I do agree with Anatoly, usualy FC is rock solid.
But there're some possible scenarios which could interfere with the SCSI IO flow, but this could be verified when checking the SAN switch port statistics.
Imagine the following "rare condition"
FC wants to send a frame, but due to a "bad connection" the frame content and the frame checksum differ on the recepient side.
Because the sender side doesn't receive an "ACK", it need to retransmit the frame again (and again) until content and checksum does match.
This could happen serveral hundred times without any problem seen on the SCSI level, simply because SAN Switches does work within nsec while SCSI timeout values are configured in msec or sec.
But for sure this would have an impact on the IO performance.
And even worse, this could cause SCSI reservations to be active much longer than planned.
And only the host which does own the reservation could perform IO's to the locked LUN.
Hope this helps a bit.
Greetings from Germany. (CET)
You could be over subscribing the links. Perhaps take a look at traffic
on the FC network, is it slammed? If so a single link being down could
mean high latency for disk I/O.
Vkernal has some good software for locating bottlenecks within VMWare
cluster. It could shed some light. Hope this helps.
I had a similar issue about a month ago. I'm using iSCSI storage, and I have two different SAN's that I'm connecting to. On of my LUNS went down on my tier 2 storage, but all of my VM's on my tier 1 storage went down like yours did. I don't remember where I found the answer, but there is a critial patch that was released on 1/5/10 http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101629... that is supposed to resolve the issue of loosing connections to all LUNS when one goes into an APD state. Similar behaviors occur when removing LUNS from your storage. http://kennethvanditmarsch.wordpress.com/2009/12/02/vsphere-freezing-vms-after-deleting-a-volume-fro...
Hope this helps in your search.
Have just realised I didn't ever post a followup to this thread. Apologies and thanks for your replies.
The root cause was never positively identified, despite escalation to management within VMware and our storage vendor. Best guess is that it was the APD issue combined with improper zoning.
Cheers.