VMware Networking Community
KWKirchner
Enthusiast
Enthusiast
Jump to solution

Ancient NSX-V 6.4.13 Question - Agent VIB install task and scan task hang at 0%

Yep, its ancient and we want to get off it ASAP, but we have not had much luck with NSX-T stability in our other environment, so we have been hesitant to migrate this air-gapped cluster until we had a handle on NSX-T.

The 6.4.13 has been quite solid for us, but we have had one cluster start doing weird things lately and we are not sure where to look and VMware isnt going to support this, so I thought maybe someone might remember how it works and what we might need to look at.

It's a small 4 host cluster running ESXi 7.0u3m and we dont do anything crazy with NSX-V, a pair of Edge's for FW, a pair for LB, and a DLR with no VM's. There are 9 hosts, but only 4 are in the NSX domain.

After this last reboot to move our syslog/scratch folders off a NetApp NFS share that we were replacing (Syslog moved to a new NFS share, Scratch moved to a folder on local SSD storage for each host) we started having problems with the agent VIB install.  Both the scan task and the Agent install task would just hang at 0% and not budge even if left for over a day.  After about 48 hours they would finally expire and show a "Not Ready" status in the Host Prep page.  All other NSX components report Green.  All 3 controllers are green and all of the Edges look happy. The NSX Manager connections to the vCenter and lookup service are both green. Three of the 4 hosts were rebooted before it was noticed that the VIB wasnt installing. We probably got lucky because if we had rebooted the last host it probably would have killed the DLR that we suspect was on that last working host.  

So the 3 rebooted hosts still had the VIB installed and still had their VTEP interfaces, so they kept working even though they were listed as "Installing" in the host prep page.  I will point out that while they were in this state, NSX was also shown as "Installing" and the actions dropdown was not functioning and there was no option to resync or resolve or anything until 48 hours later.  I tried to resolve one host and it just went back into the perpetual installing status with the VIB task at 0%.

I removed one host out of the cluster cleanly by unmounting all NFS shares, removing all NFS/vMotion VMK's, and disconnecting from all vDS switches. Dragging it out of the cluster triggered the VIB uninstall task, which ran to 80% and then hung for a day. I canceled it and proceeded to rebuild the host. I installed fresh, wiped the drive completely, and put all the VMK's/vDS/NFS shares back in place and then dragged it back into the cluster.  Same hung VIB install task and scan task.  I left it over the weekend and at somepoint on Sunday it actually completed.  The host had the VIB and the vTEP interfaces and testing some VM's migrated over to it, they worked fine. So thats nice, we have 2 out of 4 showing green on the host prep now. 

I did some more testing to try and see what was wrong with the other 2 hosts. I ran the debug connection test and it was green for 2 and Critical (timed out I think) for the other 2. I tested the URL on the NSX Manager to make sure the vxlan.zip file was available and it was fine. I watched the EAM log and saw a "VM UpdateManager Unexpected Exception" error, but could find no data about it.  So I gave up and decided to rebuild the other 2 with the same process I did for the last one.  Once I completed the 2nd rebuild I moved it into the NSX cluster and it did the hang on the scan and agent VIB install.  That is where I am now, waiting on that to finally go through.

So does anyone want to guess whats happening here? All the services are apparently working because it eventually does go through. We have made no DNS changes and NTP sync has been confirmed. The only change was the syslog/scratch repointing. 

I have noticed on the broken hosts that the auditrecords service was unhappy about missing files and I had to shut it down to get syslog service to start.  But even with that turned off and the host rebooted, the agent VIB install task still hangs. And even if it was that, why would that persist across a rebuild? We have 3 other identical NSX-V clusters in other datacenters and none of them are doing this.  Its just really weird and we are concerned about the state of this cluster when we finally go to upgrade to NSX-T v3.2.

0 Kudos
1 Solution

Accepted Solutions
KWKirchner
Enthusiast
Enthusiast
Jump to solution

We rebuilt the hosts.  All is working now.

View solution in original post

0 Kudos
1 Reply
KWKirchner
Enthusiast
Enthusiast
Jump to solution

We rebuilt the hosts.  All is working now.

0 Kudos