VMware Cloud Community
KAhnemann
Contributor
Contributor

VSAN cluster went down and only 3 of 4 nodes came back up

I could use a little help if someone has some guidance they want to provide. 

I have a 4 node hybrid cluster and the entire cluster went down hard today.  VCSA runs on the VSAN cluster and it is down as well.  I managed to bring up one node at a time and 3 of them have synced up.  When I run vsan cluster get (on a good node), it shows 3 nodes (master, agent and backup).  This would be fine, but the 4th node won't join and I think it has some data that the cluster needs.  I have about 6 or so VMs that are "invalid" in the vsphere web client.  One of those is vcenter.

Is there a way to force add my 4th node to the cluster?  I did the vsan cluster leave (on the isolated node), rebooted, and did vsan cluster join -u and the uuid of the cluster and it basically made itself it's own cluster and set itself to master. 

I let the 3 node cluster resync this afternoon and it looks healthy, except for the 6 invalid VMs.

I can ping all the vmkernal nics and there doesn't appear to be a network issue.  I just need this host added back to the cluster and I bet it will sync up the missing data...

Thoughts?

0 Kudos
5 Replies
TheBobkin
Champion
Champion

Hello KAhnemann​,

Does the isolated node have a decom state that varies from its Maintenance Mode state in vSphere client?

# cmmds-tool find -t NODE_DECOM_STATE -f json

You can give it a decom state and/or take it out of this state using:

# esxcli vsan maintenancemode cancel

Is this a Multicast or Unicast cluster?

If it is Unicast then start by ensuring that all nodes have all other nodes (but NOT their own) Unicast information in:

# esxcli vsan cluster unicastagent list

If entries are missing then you can manually populate them:

https://kb.vmware.com/s/article/2150303

"I can ping all the vmkernal nics and there doesn't appear to be a network issue.

Are you checking on the correct required ports though?

You can see what hosts packets are being received and sent on these using:

Multicast:

# tcpdump-uw -i <vSAN vmk> -n -s0 udp port 12345

# tcpdump-uw -i <vSAN vmk> -n -s0 udp port 23451

Unicast:

# tcpdump-uw -i <vSAN vmk> -n -s0 udp port 12321

https://blogs.vmware.com/vsphere/2014/09/virtual-san-networking-guidelines-multicast.html

You should see traffic on these when attempting to cluster leave/join.

Bob

0 Kudos
KAhnemann
Contributor
Contributor

Thanks for the reply!

Based on the info below it looks like my working nodes know about the isolated node, but the isolated node doesn't have their info in the unicast agent list.  Do you think I should add it on the isolated node? 

ESX01 is part of working cluster

ESX03 is the isolated node

This is what I get on the isolated node

[root@ESX03:~] cmmds-tool find -t NODE_DECOM_STATE -f json

{

"entries":

[

{

   "uuid": "5a228402-42a6-8a0d-49dd-a0d3c1f90ec8",

   "owner": "5a228402-42a6-8a0d-49dd-a0d3c1f90ec8",

   "health": "Healthy",

   "revision": "6",

   "type": "NODE_DECOM_STATE",

   "flag": "2",

   "minHostVersion": "0",

   "md5sum": "3c2593056659ee3c9e97039a3eefea8e",

   "valueLen": "80",

   "content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},

   "errorStr": "(null)"

}

]

}

Cluster status on isolated node:

[root@ESX03:~] esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2018-08-19T14:45:50Z

   Local Node UUID: 5a228402-42a6-8a0d-49dd-a0d3c1f90ec8

   Local Node Type: NORMAL

   Local Node State: MASTER

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 5a228402-42a6-8a0d-49dd-a0d3c1f90ec8

   Sub-Cluster Backup UUID:

   Sub-Cluster UUID: 52594e99-fe07-4f0b-b47b-19299c4b286d

   Sub-Cluster Membership Entry Revision: 0

   Sub-Cluster Member Count: 1

   Sub-Cluster Member UUIDs: 5a228402-42a6-8a0d-49dd-a0d3c1f90ec8

   Sub-Cluster Membership UUID: 09d7785b-67e0-dfe6-0a3c-a0d3c1f90ec8

   Unicast Mode Enabled: true

   Maintenance Mode State: OFF

   Config Generation: None 0 0.0

Unicast info on isolated node (it's blank)

[root@ESX03:~] esxcli vsan cluster unicastagent list

Working node:

[root@ESX01:~] cmmds-tool find -t NODE_DECOM_STATE -f json

{

"entries":

[

{

   "uuid": "50d36eb6-eff6-990c-a3d4-2c768a4e7f50",

   "owner": "50d36eb6-eff6-990c-a3d4-2c768a4e7f50",

   "health": "Healthy",

   "revision": "1",

   "type": "NODE_DECOM_STATE",

   "flag": "2",

   "minHostVersion": "0",

   "md5sum": "3c2593056659ee3c9e97039a3eefea8e",

   "valueLen": "80",

   "content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},

   "errorStr": "(null)"

}

,{

   "uuid": "5a80cce5-0330-c66c-05c6-a0d3c102285c",

   "owner": "5a80cce5-0330-c66c-05c6-a0d3c102285c",

   "health": "Healthy",

   "revision": "18",

   "type": "NODE_DECOM_STATE",

   "flag": "2",

   "minHostVersion": "0",

   "md5sum": "3c2593056659ee3c9e97039a3eefea8e",

   "valueLen": "80",

   "content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},

   "errorStr": "(null)"

}

,{

   "uuid": "5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0",

   "owner": "5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0",

   "health": "Healthy",

   "revision": "5",

   "type": "NODE_DECOM_STATE",

   "flag": "2",

   "minHostVersion": "0",

   "md5sum": "3c2593056659ee3c9e97039a3eefea8e",

   "valueLen": "80",

   "content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},

   "errorStr": "(null)"

}

]

}

Cluster status on working nodes:

[root@ESX01:~] esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2018-08-19T14:51:11Z

   Local Node UUID: 5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0

   Local Node Type: NORMAL

   Local Node State: BACKUP

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 5a80cce5-0330-c66c-05c6-a0d3c102285c

   Sub-Cluster Backup UUID: 5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0

   Sub-Cluster UUID: 52594e99-fe07-4f0b-b47b-19299c4b286d

   Sub-Cluster Membership Entry Revision: 8

   Sub-Cluster Member Count: 3

   Sub-Cluster Member UUIDs: 5a80cce5-0330-c66c-05c6-a0d3c102285c, 5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0, 50d36eb6-eff6-990c-a3d4-2c768a4e7f50

   Sub-Cluster Membership UUID: 1982795b-c83e-939a-f3c1-a0d3c102285c

   Unicast Mode Enabled: true

   Maintenance Mode State: OFF

   Config Generation: b07a237c-c73a-4b2f-b02b-2d57e0b22c6f 38 2018-08-18T15:41:49.571

Unicast on working node

[root@ESX01:~] esxcli vsan cluster unicastagent list

NodeUuid                              IsWitness  Supports Unicast  IP Address     Port  Iface Name

------------------------------------  ---------  ----------------  ------------  -----  ----------

50d36eb6-eff6-990c-a3d4-2c768a4e7f50          0              true  172.16.3.154  12321

5a228402-42a6-8a0d-49dd-a0d3c1f90ec8          0              true  172.16.3.153  12321

5a80cce5-0330-c66c-05c6-a0d3c102285c          0              true  172.16.3.152  12321

0 Kudos
TheBobkin
Champion
Champion

Hello KAhnemann​,

Yes, do populate the list on that.

Are all ndoes connected to vCenter and connected under the vSphere Cluster Object? If they are then there should be no need to prevent vCenter from pushing down Unicast agent lists (IIRC an earlu version of 6.5 U1 vCenter did have issues with this so you could do this as precaution if manually adding gets wiped right away again e.g. # esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates):

https://blogs.vmware.com/virtualblocks/2017/04/11/goodbye-multicast/

Here is all that you should need to recreate the Unicastagent list on node ESX03 (assuming linear IPs assigned and ESX01 is 172.16.3.151), add the vSAN vmk in use:

# esxcli vsan cluster unicastagent add -t node -u 50d36eb6-eff6-990c-a3d4-2c768a4e7f50 -U true -a 172.16.3.154 -p 12321 -i vmk<??>

# esxcli vsan cluster unicastagent add -t node -u 5a80cce5-0330-c66c-05c6-a0d3c102285c -U true -a 172.16.3.152 -p 12321 -i vmk<??>

# esxcli vsan cluster unicastagent add -t node -u 5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0 -U true -a 172.16.3.151 -p 12321 -i vmk<??>

Bob

0 Kudos
KAhnemann
Contributor
Contributor

Update:

I added those missing UUIDs to the unicast list on my isolated server.  It now looks like it's part of the cluster again and it's trying to resync with it.  There's a ton of data... I'll let you know how it goes.  So far it looks like my invalid VMs are better!

Thanks so much for the help!  If you're going to VMworld, I'll buy you a beer!

0 Kudos
TheBobkin
Champion
Champion

Hello KAhnemann​,

Glad to hear that got it clustered properly.

You can use vsan.fix_renamed_vms via RVC if any VMs are showing named as working directory path:

https://www.virten.net/2017/07/vsan-6-6-rvc-guide-part-6-troubleshooting/#vsan-fix_renamed_vms

"Thanks so much for the help!  If you're going to VMworld, I'll buy you a beer!"

Happy to help, unfortunately I won't be attending VMworld US but if you see random stranger wearing any vSAN apparel, do please buy them a beer and tell them it's from Bob :smileygrin:

Bob

0 Kudos