VMware Cloud Community
derikp
Contributor
Contributor

ESXI Cluster dead in the water?

We have a 5 host ESXI cluster using Vsan and Vcenter 6. Each host is hyper converged and has all SSD's. 2 cache drvies, and data drives for each cache drive.To make a long story short our power went out in our second data center. We were not able to do anything, but shut the VM's down so no writes happen to them. Unfortunately we could not put the hosts in maintenance mode, nor did we get the hosts shut down. The issue now is upon power restoration the servers started back up on their own, but all seems to be lost. The cache drives are showing normal, degraded state on each ESXI host, and I cant find what exactly that means anywhere or if this is even part of the issue. We did notice some of the drives LED lights were not on and not fully initialized so we popped the drives out 1 at a time, and let it sit overnight to see if everything rebuilt. Well this morning nothing has happened and it looks like the disks aren't doing much... We have a support ticket in with VMware, but thought i would post here as well.

If anybody has any ideas how we can get up and running again that would be amazing.

Update: 03-02-2017 4:46pm. We ended up getting a hold of support and they had to increase the HEAP size in LSOM and then the hurry up and wait game we are finally back up and running.

Tags (4)
0 Kudos
3 Replies
MBrownWFP
Enthusiast
Enthusiast

What is the status of the physical disks and disk groups when viewed in cluster > Manage > Settings > Virtual SAN Disk Management?

What is the VSAN Health service reporting? (cluster > Monitor > Virtual SAN > Health)

Working this with VMware support is your best bet but the info requested above may help others in the community to provide feedback on initial diagnosis and troubleshooting.

Matt

0 Kudos
srodenburg
Expert
Expert

Glad to read it worked out for you.  After 2 years of VSAN experience (at 6.5 at the moment), I noticed that each time a host dies, a disk dies, or after a power outage and everything dies, VSAN has issues. I never lost data (hats off for that) but boy oh boy there will be a steaming pile of donkey-doodoo to clean up: Inaccessable objects, other "half-dead stuff" that VSAN cannot repair by itself etc. etc.

In short, VSAN is fine and "simple storage" until a piece of hardware craps out. Then you are screwed (most of the time). Folks not as tech-savvy, will need help straightening things out because messing about in an RVC is not for everybody. Neither is objtool on an ESXi commandline. From that perspective, vSAN is a long way away from proper, good quality traditional storage-arrays and their self-healing capabilities. vSAN is hardly self healing and really needs to be improved in that area.

0 Kudos
TheBobkin
Champion
Champion

If anyone is interested, here is the method for configuring this:

#esxcfg-advcfg -s 1024 /LSOM/heapSize

(Default is 256)

It is configured on a per-host basis.

In my experience it can sometimes help get disks back into CMMDS after reboot (and thus SSD Initialization phase) when without this and reboot the disks would still be in an un-managed state.

The only external kb article that references this is a tad useless on further info of the settings though:

https://kb.vmware.com/kb/2146495

-o- If you found this comment useful please click the 'Helpful' button -o-

0 Kudos