VSAN cluster went down and only 3 of 4 nodes came ...

KAhnemann · ‎08-18-2018

I could use a little help if someone has some guidance they want to provide.

I have a 4 node hybrid cluster and the entire cluster went down hard today. VCSA runs on the VSAN cluster and it is down as well. I managed to bring up one node at a time and 3 of them have synced up. When I run vsan cluster get (on a good node), it shows 3 nodes (master, agent and backup). This would be fine, but the 4th node won't join and I think it has some data that the cluster needs. I have about 6 or so VMs that are "invalid" in the vsphere web client. One of those is vcenter.

Is there a way to force add my 4th node to the cluster? I did the vsan cluster leave (on the isolated node), rebooted, and did vsan cluster join -u and the uuid of the cluster and it basically made itself it's own cluster and set itself to master.

I let the 3 node cluster resync this afternoon and it looks healthy, except for the 6 invalid VMs.

I can ping all the vmkernal nics and there doesn't appear to be a network issue. I just need this host added back to the cluster and I bet it will sync up the missing data...

Thoughts?

TheBobkin · ‎08-19-2018

Hello KAhnemann,

Does the isolated node have a decom state that varies from its Maintenance Mode state in vSphere client?

# cmmds-tool find -t NODE_DECOM_STATE -f json

You can give it a decom state and/or take it out of this state using:

# esxcli vsan maintenancemode cancel

Is this a Multicast or Unicast cluster?

If it is Unicast then start by ensuring that all nodes have all other nodes (but NOT their own) Unicast information in:

# esxcli vsan cluster unicastagent list

If entries are missing then you can manually populate them:

https://kb.vmware.com/s/article/2150303

"I can ping all the vmkernal nics and there doesn't appear to be a network issue.

Are you checking on the correct required ports though?

You can see what hosts packets are being received and sent on these using:

Multicast:

# tcpdump-uw -i <vSAN vmk> -n -s0 udp port 12345

# tcpdump-uw -i <vSAN vmk> -n -s0 udp port 23451

Unicast:

# tcpdump-uw -i <vSAN vmk> -n -s0 udp port 12321

https://blogs.vmware.com/vsphere/2014/09/virtual-san-networking-guidelines-multicast.html

You should see traffic on these when attempting to cluster leave/join.

Bob

KAhnemann · ‎08-19-2018

Thanks for the reply!

Based on the info below it looks like my working nodes know about the isolated node, but the isolated node doesn't have their info in the unicast agent list. Do you think I should add it on the isolated node?

ESX01 is part of working cluster

ESX03 is the isolated node

This is what I get on the isolated node

[root@ESX03:~] cmmds-tool find -t NODE_DECOM_STATE -f json

{

"entries":

[

{

"uuid": "5a228402-42a6-8a0d-49dd-a0d3c1f90ec8",

"owner": "5a228402-42a6-8a0d-49dd-a0d3c1f90ec8",

"health": "Healthy",

"revision": "6",

"type": "NODE_DECOM_STATE",

"flag": "2",

"minHostVersion": "0",

"md5sum": "3c2593056659ee3c9e97039a3eefea8e",

"valueLen": "80",

"content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},

"errorStr": "(null)"

}

]

}

Cluster status on isolated node:

[root@ESX03:~] esxcli vsan cluster get

Cluster Information

Enabled: true

Current Local Time: 2018-08-19T14:45:50Z

Local Node UUID: 5a228402-42a6-8a0d-49dd-a0d3c1f90ec8

Local Node Type: NORMAL

Local Node State: MASTER

Local Node Health State: HEALTHY

Sub-Cluster Master UUID: 5a228402-42a6-8a0d-49dd-a0d3c1f90ec8

Sub-Cluster Backup UUID:

Sub-Cluster UUID: 52594e99-fe07-4f0b-b47b-19299c4b286d

Sub-Cluster Membership Entry Revision: 0

Sub-Cluster Member Count: 1

Sub-Cluster Member UUIDs: 5a228402-42a6-8a0d-49dd-a0d3c1f90ec8

Sub-Cluster Membership UUID: 09d7785b-67e0-dfe6-0a3c-a0d3c1f90ec8

Unicast Mode Enabled: true

Maintenance Mode State: OFF

Config Generation: None 0 0.0

Unicast info on isolated node (it's blank)

[root@ESX03:~] esxcli vsan cluster unicastagent list

Working node:

[root@ESX01:~] cmmds-tool find -t NODE_DECOM_STATE -f json

{

"entries":

[

{

"uuid": "50d36eb6-eff6-990c-a3d4-2c768a4e7f50",

"owner": "50d36eb6-eff6-990c-a3d4-2c768a4e7f50",

"health": "Healthy",

"revision": "1",

"type": "NODE_DECOM_STATE",

"flag": "2",

"minHostVersion": "0",

"md5sum": "3c2593056659ee3c9e97039a3eefea8e",

"valueLen": "80",

"content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},

"errorStr": "(null)"

}

,{

"uuid": "5a80cce5-0330-c66c-05c6-a0d3c102285c",

"owner": "5a80cce5-0330-c66c-05c6-a0d3c102285c",

"health": "Healthy",

"revision": "18",

"type": "NODE_DECOM_STATE",

"flag": "2",

"minHostVersion": "0",

"md5sum": "3c2593056659ee3c9e97039a3eefea8e",

"valueLen": "80",

"content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},

"errorStr": "(null)"

}

,{

"uuid": "5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0",

"owner": "5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0",

"health": "Healthy",

"revision": "5",

"type": "NODE_DECOM_STATE",

"flag": "2",

"minHostVersion": "0",

"md5sum": "3c2593056659ee3c9e97039a3eefea8e",

"valueLen": "80",

"content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},

"errorStr": "(null)"

}

]

}

Cluster status on working nodes:

[root@ESX01:~] esxcli vsan cluster get

Cluster Information

Enabled: true

Current Local Time: 2018-08-19T14:51:11Z

Local Node UUID: 5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0

Local Node Type: NORMAL

Local Node State: BACKUP

Local Node Health State: HEALTHY

Sub-Cluster Master UUID: 5a80cce5-0330-c66c-05c6-a0d3c102285c

Sub-Cluster Backup UUID: 5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0

Sub-Cluster UUID: 52594e99-fe07-4f0b-b47b-19299c4b286d

Sub-Cluster Membership Entry Revision: 8

Sub-Cluster Member Count: 3

Sub-Cluster Member UUIDs: 5a80cce5-0330-c66c-05c6-a0d3c102285c, 5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0, 50d36eb6-eff6-990c-a3d4-2c768a4e7f50

Sub-Cluster Membership UUID: 1982795b-c83e-939a-f3c1-a0d3c102285c

Unicast Mode Enabled: true

Maintenance Mode State: OFF

Config Generation: b07a237c-c73a-4b2f-b02b-2d57e0b22c6f 38 2018-08-18T15:41:49.571

Unicast on working node

[root@ESX01:~] esxcli vsan cluster unicastagent list

NodeUuid IsWitness Supports Unicast IP Address Port Iface Name

------------------------------------ --------- ---------------- ------------ ----- ----------

50d36eb6-eff6-990c-a3d4-2c768a4e7f50 0 true 172.16.3.154 12321

5a228402-42a6-8a0d-49dd-a0d3c1f90ec8 0 true 172.16.3.153 12321

5a80cce5-0330-c66c-05c6-a0d3c102285c 0 true 172.16.3.152 12321

TheBobkin · ‎08-19-2018

Hello KAhnemann,

Yes, do populate the list on that.

Are all ndoes connected to vCenter and connected under the vSphere Cluster Object? If they are then there should be no need to prevent vCenter from pushing down Unicast agent lists (IIRC an earlu version of 6.5 U1 vCenter did have issues with this so you could do this as precaution if manually adding gets wiped right away again e.g. # esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates):

https://blogs.vmware.com/virtualblocks/2017/04/11/goodbye-multicast/

Here is all that you should need to recreate the Unicastagent list on node ESX03 (assuming linear IPs assigned and ESX01 is 172.16.3.151), add the vSAN vmk in use:

# esxcli vsan cluster unicastagent add -t node -u 50d36eb6-eff6-990c-a3d4-2c768a4e7f50 -U true -a 172.16.3.154 -p 12321 -i vmk<??>

# esxcli vsan cluster unicastagent add -t node -u 5a80cce5-0330-c66c-05c6-a0d3c102285c -U true -a 172.16.3.152 -p 12321 -i vmk<??>

# esxcli vsan cluster unicastagent add -t node -u 5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0 -U true -a 172.16.3.151 -p 12321 -i vmk<??>

Bob

KAhnemann · ‎08-19-2018

Update:

I added those missing UUIDs to the unicast list on my isolated server. It now looks like it's part of the cluster again and it's trying to resync with it. There's a ton of data... I'll let you know how it goes. So far it looks like my invalid VMs are better!

Thanks so much for the help! If you're going to VMworld, I'll buy you a beer!

TheBobkin · ‎08-19-2018

Hello KAhnemann,

Glad to hear that got it clustered properly.

You can use vsan.fix_renamed_vms via RVC if any VMs are showing named as working directory path:

https://www.virten.net/2017/07/vsan-6-6-rvc-guide-part-6-troubleshooting/#vsan-fix_renamed_vms

"Thanks so much for the help! If you're going to VMworld, I'll buy you a beer!"

Happy to help, unfortunately I won't be attending VMworld US but if you see random stranger wearing any vSAN apparel, do please buy them a beer and tell them it's from Bob :smileygrin:

Bob

All

VSAN cluster went down and only 3 of 4 nodes came back up