VMware Cloud Community
cfritze
Contributor
Contributor

All sorts of strange behaviour, (VC) installation totally messed up?

Good morning!

On Friday afternoon I had two working ESX hosts in an HA/DRS cluster running VC in a virtualized Windows XP and two more VMs with Open SuSE and SLES respectively. Yesterday I thought I should put some load onto that cluster and decided to install various downloaded appliances (everybody says that's a matter of minutes, right?)

So first I unpacked three appliances (LAMP, PostgreSQL, VKernel) to a NFS directory shared by the two ESX hosts and tried to add them to the inventory by right-clicking on the vmx files in the datastore browser. Under recent tasks the operations were marked as "in progress" and, after 15 minutes, timed out. I thought maybe I can register the appliances with an ESX host that's not clustered. I put one of the two hosts into maintenance mode, and kicked it out of the cluster. Still not able to register the appliances.

After this I tried to re-add the host into the cluster but got an error message saying that there weren't enough licenses installed to do so. Under "Configuration" -> "Licensed Features" the only add-on that showed up was "Consolidated Backup". I searched this forum up and down and tried about everything mentioned here from re-reading the license, switching from centralized to host based licensing and back to using a manually edited license file. No luck. Finally I reinstalled and configured the ESX server from scratch, now I was able to register the appliances with it and re-add the host to its twin in the cluster.

The appliances were registered now. They wouldn't start however. Timeouts as before. Just for a test I removed one of the working VMs (SLES) from the inventory and tried to re-register it. Timeout. Tried to register it using vmware-cmd -s register at the service console. Using a relative path to the vmx file results in "ServerFaultCode(1588) Invalid datastore format", using the full path resulted in a "VMControl Error -999" first but works right now.

'Funny' thing is that HostA by now sees a different inventory (and different machine states for virtual as well as physical machines) than VC (which is running on HostB in the same cluster). I do understand that changes made to HostA are not necessarily reflected in the current VC session at once. But disconnecting HostA in the VC session, exiting the VI client for VC, restarting it and reconnecting to HostA should probably be enough to show the VMs running on HostA? Differences are for example: a VM removed from the inventory using VI client targeted at HostA is still shown (as powered off but without the possibility to power it on) in the VI client targeted at the VC. The licensed features of HostA differ depending on where I point the VI client at, HostA or the VC. Just as I'm writing this the VC seems to update its state at least a bit, the deregistered VM now disappears in the VC view as well...

Does this all sound as if I can still reach a consistent state with this installation? Was it a bad idea to set up VC in a VM (lots of people seem to do this) or was it a bad idea to add that VM to a cluster? Any further ideas or comments on this? Can this mess be caused by network performance problems? The hosts are two blades in the same blade center and should see each other without significant delays but write performance on the NFS shares is very poor, the storage admin is still looking for a solution to this.

puzzled...

Christian

0 Kudos
3 Replies
cfritze
Contributor
Contributor

Interesting detail just seen: when I register a VM in VC using VI client the operation seems to get done but VC doesn't see the result!The VMs I tried to register with the cluster via VC are now registered. I can see them in the inventory (and can even start them without problems) but only when I connect the VI client to HostA directly. The VI client connected to VC reported a timeout after 15 minutes and does not show the VMs in its inventory. So I thought this might be a matter of the HostA not being able to access the VC, but the Windows Firewall on the (virtual) VC host shows various VMware services as exceptions to its configuration rules. Still not seeing more clearly...

0 Kudos
cfritze
Contributor
Contributor

After all this I decided to remove VC from that VM, stop all VMs on the two hosts in the cluster, install VC on my deskside PC here and get the two ESX hosts managed by it. Now registering, starting, stopping, migrating etc works again.

Still no idea what caused the problem. Makes my skin crawl thinking that something like that could happen in a production environment... Smiley Sad

Wish you all a Merry Christmas

Christian

0 Kudos
jhanekom
Virtuoso
Virtuoso

I doubt your problem is caused by VC being in a VM. As you've said, many people have done so successfully in the past.

I'm busy troubleshooting something similar at a multi-server farm I'm helping out with. They were on VC 201 U2 when the problem started, and have since upgraded to VC202 U2, but it didn't help. The symptoms include the following:

- Only one host in the farm is affected

- VM inventory is not refreshed for that host

- Most tasks (including maintenance mode) initiated from VC that target that host, fail or time out

- Running the same tasks directly on the host are successful, without delays of any kind

- Evicting and re-adding the node from VC results in a "there are not enough licenses available to perform the operation", despite there definitely being enough licenses. (Takes quite a bit of work to get it back in again, including deleting all the resource pools on the host as well as the vpxuser user account.)

- When multiple hosts are evicted and re-added, it's always the last host that's added that's affected by the problem

This leads me to believe the problem has something to do with the licensing component. My next step in troubleshooting will be to re-install the license server. I'm not ready yet to take the bold step of blasting away my VC installation and database. There's stuff in there that I want to keep!

Will try to post back here if I find anything.

0 Kudos