Solved: Re: Recovering vCenter 7

Dr_Virt · ‎07-21-2023

Had a lab vCenter crash and am trying to figure out why.

Current symptoms:

* If starting vCenter from shell with "service-control --start --all" the process will fail with vPostgres couldn't start.

* If starting vPostgres manually ("service-control --start --vmware-vpostgres") and then starting vCenter ("service-control --start --all") the proccess will fail with vpxd-svcs failed to start.

* I logged into vCenter VCDB and verified administrator account

* I reset vCenter certificates and validated with lsdoctor

* vxpd.log shows "Failed to connect to Authz service" and "Failed to initialize authorizeManager"

Anyone seen something like this?

Dr_Virt · ‎08-02-2023

Well, was able to recover. VMware sent a certificate tool (vCert) which identified some trust issues and registrations which the standard tools didn't address.

Then I found an issue with setting up logging within the tomcat instance. I commented out the "isAccessLogCreated" and "accessLogCleaner" beans from the Tomcat config.

I also had to manually rebuild the vPostgresql certificate store.

I restarted the services and got the core up and running. I got a good snapshot of the VCSA. I attempted to do a VCSA back and it failed continuously. I decided to attempt an upgrade to repair the VCSA. It took about 2 hours, but the upgrade completed from 7.03f to g. I continue to walk the update path all the way to the latest 7.03 release.

I tested the VCSA backup and it ran successfully.

I tested the Tomcat by uncommenting the previously commented out beans. It ran successfully.

In summary, there was corruption at multiple points within the VCSA. The help here and from VMware was able to recover it. Thank you all.

View solution in original post

hirschinho · ‎07-24-2023

Hi,

looks like a certificate issue.

Have you checked all certificates with

for store in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list | grep -v TRUSTED_ROOT_CRLS); do echo "[*] Store :" $store; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $store --text | grep -ie "Alias" -ie "Not After";done;

lsdcotor says all good ? after trustfix ?

Dr_Virt · ‎07-24-2023

@hirschinho

Upon running the suggested code, all certs are dated 2025 and beyond. There is a BACKUP_STORE cert for __MACHINE_CERT dated December 2022, but I was under the understanding that those are inactive.

LSDOCTOR shows all good.

Dr_Virt · ‎07-24-2023

hirschinho · ‎07-24-2023

Can you verify it the Hostname is correct with this command with the certificate ?

/usr/lib/vmware-vmafd/bin/vmafd-cli get-pnid --server-name localhost

Dr_Virt · ‎07-24-2023

@hirschinho

Yes. It returns the FQDN of the VCSA.

hirschinho · ‎07-24-2023

which Build of vCenter you are running ?

Which way do you have reset the certificates ?

/usr/lib/vmware-vmca/bin/certificate-manager

with option 8 ?

If not please do it with option 8

Dr_Virt · ‎07-24-2023

@hirschinho

VCSA - 7.0.3.01000
BUILD - 20395099

Yes. it was the Certificate Manager with option 8.

Dr_Virt · ‎07-24-2023

Got a new error:

VPXD - Failed to read X509 cert

hirschinho · ‎07-24-2023

Try this KB

https://kb.vmware.com/s/article/76719

maksym007 · ‎07-24-2023

I would reset a Certificate to default VMware cert and after that would create a new CSR.

Dr_Virt · ‎07-24-2023

@hirschinho

First, thank you for all of the assistance.

I have executed that KB. The STS was in good standing, but I replaced it anyway.

Dr_Virt · ‎07-24-2023

@maksym007

All certificates are VMware self-signed certificates.

maksym007 · ‎07-24-2023

what about that option?

https://kb.vmware.com/s/article/82332

Dr_Virt · ‎07-24-2023

@maksym007

All certs are in good standing and the STS was replaced today.

Dr_Virt · ‎07-24-2023

There is something with vPostgres and the certificates. When attempting to start vPostgres on its own, there is a long list of messages about trying to build the root_crl.pem file. It makes many requests to the auth service, but eventually fails.

Dr_Virt · ‎07-24-2023

Well, can get most of the services up, but the vSphere-UI just won't play nice.

maksym007 · ‎07-25-2023

Interesting what is causing such issues

Dr_Virt · ‎07-26-2023

Anyone know if we can just deploy a new vCenter and have it discover or reregister the existing cluster (vSAN, NSX, etc.)?

If not, I will have to plan a big "new deployment and migration".

1) Remove host from existing cluster

2) Clean host

3) Deploy vCenter to single host

4) Enable vSAN

5) Enable NSX

6) Begin migration of workloads (how without a working vCenter?)

7) Role hosts between clusters

hirschinho · ‎07-26-2023

Hi @Dr_Virt

First ist this a streched cluster with witness host or a standard cluster / OSA or ESA ?

vsan can work without the vCenter, so in my opinion its not neccessary to destroy everything.

The importent thing ist to install a new vCenter - do you have local datastores in one of your ESXi host - for example a boot device mit about 200 GB space ? - there you can temporarly deploy a vCenter.

Then follow this

Create a cluster and enable vSAN on new cluster
Check vSAN Health before (esxcli vsan cluster get,....)
Move each host.
Reapply storage policies to the VMs
Re-enable stretched cluster if neccessary

Witch NSX Version do you use ? - the nodes must be redeployed.