I could use a little help if someone has some guidance they want to provide.
I have a 4 node hybrid cluster and the entire cluster went down hard today. VCSA runs on the VSAN cluster and it is down as well. I managed to bring up one node at a time and 3 of them have synced up. When I run vsan cluster get (on a good node), it shows 3 nodes (master, agent and backup). This would be fine, but the 4th node won't join and I think it has some data that the cluster needs. I have about 6 or so VMs that are "invalid" in the vsphere web client. One of those is vcenter.
Is there a way to force add my 4th node to the cluster? I did the vsan cluster leave (on the isolated node), rebooted, and did vsan cluster join -u and the uuid of the cluster and it basically made itself it's own cluster and set itself to master.
I let the 3 node cluster resync this afternoon and it looks healthy, except for the 6 invalid VMs.
I can ping all the vmkernal nics and there doesn't appear to be a network issue. I just need this host added back to the cluster and I bet it will sync up the missing data...
Thoughts?
Hello KAhnemann,
Does the isolated node have a decom state that varies from its Maintenance Mode state in vSphere client?
# cmmds-tool find -t NODE_DECOM_STATE -f json
You can give it a decom state and/or take it out of this state using:
# esxcli vsan maintenancemode cancel
Is this a Multicast or Unicast cluster?
If it is Unicast then start by ensuring that all nodes have all other nodes (but NOT their own) Unicast information in:
# esxcli vsan cluster unicastagent list
If entries are missing then you can manually populate them:
https://kb.vmware.com/s/article/2150303
"I can ping all the vmkernal nics and there doesn't appear to be a network issue.
Are you checking on the correct required ports though?
You can see what hosts packets are being received and sent on these using:
Multicast:
# tcpdump-uw -i <vSAN vmk> -n -s0 udp port 12345
# tcpdump-uw -i <vSAN vmk> -n -s0 udp port 23451
Unicast:
# tcpdump-uw -i <vSAN vmk> -n -s0 udp port 12321
https://blogs.vmware.com/vsphere/2014/09/virtual-san-networking-guidelines-multicast.html
You should see traffic on these when attempting to cluster leave/join.
Bob
Thanks for the reply!
Based on the info below it looks like my working nodes know about the isolated node, but the isolated node doesn't have their info in the unicast agent list. Do you think I should add it on the isolated node?
ESX01 is part of working cluster
ESX03 is the isolated node
This is what I get on the isolated node
[root@ESX03:~] cmmds-tool find -t NODE_DECOM_STATE -f json
{
"entries":
[
{
"uuid": "5a228402-42a6-8a0d-49dd-a0d3c1f90ec8",
"owner": "5a228402-42a6-8a0d-49dd-a0d3c1f90ec8",
"health": "Healthy",
"revision": "6",
"type": "NODE_DECOM_STATE",
"flag": "2",
"minHostVersion": "0",
"md5sum": "3c2593056659ee3c9e97039a3eefea8e",
"valueLen": "80",
"content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},
"errorStr": "(null)"
}
]
}
Cluster status on isolated node:
[root@ESX03:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2018-08-19T14:45:50Z
Local Node UUID: 5a228402-42a6-8a0d-49dd-a0d3c1f90ec8
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 5a228402-42a6-8a0d-49dd-a0d3c1f90ec8
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52594e99-fe07-4f0b-b47b-19299c4b286d
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 5a228402-42a6-8a0d-49dd-a0d3c1f90ec8
Sub-Cluster Membership UUID: 09d7785b-67e0-dfe6-0a3c-a0d3c1f90ec8
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: None 0 0.0
Unicast info on isolated node (it's blank)
[root@ESX03:~] esxcli vsan cluster unicastagent list
Working node:
[root@ESX01:~] cmmds-tool find -t NODE_DECOM_STATE -f json
{
"entries":
[
{
"uuid": "50d36eb6-eff6-990c-a3d4-2c768a4e7f50",
"owner": "50d36eb6-eff6-990c-a3d4-2c768a4e7f50",
"health": "Healthy",
"revision": "1",
"type": "NODE_DECOM_STATE",
"flag": "2",
"minHostVersion": "0",
"md5sum": "3c2593056659ee3c9e97039a3eefea8e",
"valueLen": "80",
"content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},
"errorStr": "(null)"
}
,{
"uuid": "5a80cce5-0330-c66c-05c6-a0d3c102285c",
"owner": "5a80cce5-0330-c66c-05c6-a0d3c102285c",
"health": "Healthy",
"revision": "18",
"type": "NODE_DECOM_STATE",
"flag": "2",
"minHostVersion": "0",
"md5sum": "3c2593056659ee3c9e97039a3eefea8e",
"valueLen": "80",
"content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},
"errorStr": "(null)"
}
,{
"uuid": "5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0",
"owner": "5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0",
"health": "Healthy",
"revision": "5",
"type": "NODE_DECOM_STATE",
"flag": "2",
"minHostVersion": "0",
"md5sum": "3c2593056659ee3c9e97039a3eefea8e",
"valueLen": "80",
"content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},
"errorStr": "(null)"
}
]
}
Cluster status on working nodes:
[root@ESX01:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2018-08-19T14:51:11Z
Local Node UUID: 5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0
Local Node Type: NORMAL
Local Node State: BACKUP
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 5a80cce5-0330-c66c-05c6-a0d3c102285c
Sub-Cluster Backup UUID: 5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0
Sub-Cluster UUID: 52594e99-fe07-4f0b-b47b-19299c4b286d
Sub-Cluster Membership Entry Revision: 8
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 5a80cce5-0330-c66c-05c6-a0d3c102285c, 5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0, 50d36eb6-eff6-990c-a3d4-2c768a4e7f50
Sub-Cluster Membership UUID: 1982795b-c83e-939a-f3c1-a0d3c102285c
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: b07a237c-c73a-4b2f-b02b-2d57e0b22c6f 38 2018-08-18T15:41:49.571
Unicast on working node
[root@ESX01:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name
------------------------------------ --------- ---------------- ------------ ----- ----------
50d36eb6-eff6-990c-a3d4-2c768a4e7f50 0 true 172.16.3.154 12321
5a228402-42a6-8a0d-49dd-a0d3c1f90ec8 0 true 172.16.3.153 12321
5a80cce5-0330-c66c-05c6-a0d3c102285c 0 true 172.16.3.152 12321
Hello KAhnemann,
Yes, do populate the list on that.
Are all ndoes connected to vCenter and connected under the vSphere Cluster Object? If they are then there should be no need to prevent vCenter from pushing down Unicast agent lists (IIRC an earlu version of 6.5 U1 vCenter did have issues with this so you could do this as precaution if manually adding gets wiped right away again e.g. # esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates):
https://blogs.vmware.com/virtualblocks/2017/04/11/goodbye-multicast/
Here is all that you should need to recreate the Unicastagent list on node ESX03 (assuming linear IPs assigned and ESX01 is 172.16.3.151), add the vSAN vmk in use:
# esxcli vsan cluster unicastagent add -t node -u 50d36eb6-eff6-990c-a3d4-2c768a4e7f50 -U true -a 172.16.3.154 -p 12321 -i vmk<??>
# esxcli vsan cluster unicastagent add -t node -u 5a80cce5-0330-c66c-05c6-a0d3c102285c -U true -a 172.16.3.152 -p 12321 -i vmk<??>
# esxcli vsan cluster unicastagent add -t node -u 5a2291d6-76fd-55e0-0c33-a0d3c1f7d4c0 -U true -a 172.16.3.151 -p 12321 -i vmk<??>
Bob
Update:
I added those missing UUIDs to the unicast list on my isolated server. It now looks like it's part of the cluster again and it's trying to resync with it. There's a ton of data... I'll let you know how it goes. So far it looks like my invalid VMs are better!
Thanks so much for the help! If you're going to VMworld, I'll buy you a beer!
Hello KAhnemann,
Glad to hear that got it clustered properly.
You can use vsan.fix_renamed_vms via RVC if any VMs are showing named as working directory path:
https://www.virten.net/2017/07/vsan-6-6-rvc-guide-part-6-troubleshooting/#vsan-fix_renamed_vms
"Thanks so much for the help! If you're going to VMworld, I'll buy you a beer!"
Happy to help, unfortunately I won't be attending VMworld US but if you see random stranger wearing any vSAN apparel, do please buy them a beer and tell them it's from Bob :smileygrin:
Bob