Re: can't register/add to inventory a vm because o...

swspjcd · ‎09-24-2013

I have several ESXi 5.1 hosts in a cluster. There is one vm that I can't add to inventory. I have followed several kb articles and even removed the ESXi host from the cluster completely after verifying that it was the ESXi host causing the lock by finding the MAC address. I am unable to view the vmware.log file for this virtual machine. I get an "invalid argument" error when trying to cat the vmware.log or the .vmx file. The lock file is "vmname.vmx.lck". I've rebooted and restarted the management agents several times. I'm just not sure where to proceed from here as I've been reading about how to resolve this for about 3 hours and have yet to find anything that works. The contents of the directory contain the following files if this helps at all:

-rw-r--r--	1 root	root	162078 Sep 21 16:06 vmware.log
-rw-r--r--	1 root	root	73 Sep 9 18:33 vmname-63cf5ada.hlog
-rw-------	1 root	root	21474836480 Sep 21 16:51 vmname-flat.vmdk
-rw-------	1 root	root	8684 Sep 9 18:34 vmname.nvram
-rw-------	1 root	root	517 Sep 9 18:33 vmname.vmdk
-rw-r--r--	1 root	root	0 Sep 7 17:50 vmname.vmsd
-rwxr-xr-x	1 root	root	3342 Sep 13 18:18 vmname.vmx
-rw-------	1 root	root	0 Sep 9 18:33 vmname.vmx.lck
-rw-r--r--	1 root	root	262 Sep 7 17:50 vmname.vmxf
-rwxr-xr-x	1 root	root	3341 Sep 13 18:18 vmname.vmx~
-rw-------	1 root	root	37580963840 Sep 21 16:51 vmname_1-flat.vmdk
-rw-------	1 root	root	519 Sep 9 18:33 vmname_1.vmdk

Suggestions?

Troy_Clavell · ‎09-24-2013

not sure if you've seen the below KB, but just in case.

http://kb.vmware.com/kb/10051

swspjcd · ‎09-24-2013

That was one of the first things I tried. The ESXi host is in maintenance mode so there are no virtual machines running on it and therefore nothing to kill. I have no clue how to resolve this as all of the suggestions I have found don't seem to work.

admin · ‎09-24-2013

Hi,

The vmname.vmx.lck seems to be a NFS lock..? are you having NFS storage if so

1. browse to the folder where the VM is located and run ls -la

It should list the hidden files.

2. find the files with extension .lck and delete them. and rename the vmname.vmx.lck

using the command mv vmname.vmx.lck vmname.vmx.backup.

3. Add the VM back to Inventory and Power ON the VM, it should work.

Thanks,
Avinash

swspjcd · ‎09-24-2013

We are not using NFS at all. There is only one .lck file and I can't delete it although I can rename it. Even after renaming it, I still can't register the vm. It is still grayed out.

admin · ‎09-24-2013

can you run the command vmkfstools -D vmname.vmx and ls -la

and paste the output...?

Thanks,

Avinash

swspjcd · ‎09-24-2013

Here is the output.

Lock [type 10c00001 offset 4405248 v 132, hb offset 3424256

gen 25, mode 1, owner 51ffc163-c2f3c8e4-8cff-001d092b0694 mtime 126477

num 0 gblnum 0 gblgen 0 gblbrk 0]

Addr <4, 0, 39>, gen 71, links 1, type reg, flags 0, uid 0, gid 0, mode 100755

len 3342, nb 1 tbz 0, cow 0, newSinceEpoch 1, zla 2, bs 8192

admin · ‎09-24-2013

Hi,

seems still there is a lock on the file from the MAC address - 001d092b0694.

try registering the VM to the Host who is the owner of the MAC address and power on if still fails.

cd into the vm folder and try running the command rm -rf *.lck and then power ON

if still a problem then

Power off the host and then run the command rm rm -rf *.lck

Thanks,

Avinash

swspjcd · ‎09-24-2013

I've tried all of those. I thoroughly followed the kb article describing how to do all of this. The option to register the vm on the ESXi host that has that mac address, is grayed out. I have tried deleting the lck file. When trying to delete it, I get "invalid argument". Even when powering off the ESXi host that has that MAC address and trying to delete the file on oanother ESXi host, I still get "invalid argument" when trying to delete the file.

admin · ‎09-24-2013

what does the VMkernel.log and Hostd.log show up..? do you see any corruption messages..?

swspjcd · ‎09-24-2013

I don't see anything in either log that looks suspicious. We replicate our LUNs off site every night for DR, and even after presenting the clone of the LUN where this server lives, to an ESXi host at a remote site, I still can't delete the lck file or even add the virtual server to inventory.

hodo · ‎10-16-2013

Was this ever resolved? I'm running into this exact issue as well.

tomtom901 · ‎10-16-2013

Have you tried the steps outlined above?

hodo · ‎10-16-2013

Just like swspjcd I followed everything in VMware KB: Investigating virtual machine file locks on ESXi/ESX and still cannot remove the lock file. It is a VMFS datastore, not NFS. I've restarted the host that has the MAC address shown when running vmkfstools -D lockfile. I've also restarted every other host in the cluster for good measure.

swspjcd · ‎10-16-2013

We were never able to get it resolved and from it looks like, the problem is due to a bug in equallogic firmware which can on "rare" circumstances, cause corruption in the metadata of a LUN. So far we have had 3 virtual servers with the exact same problem, all of which were on the same LUN. One we recovered from our backup software, the other two were able to be recovered from the replicated LUN of where they lived. The bad firmware, according to Dell is 6.06 and it is fixed by 6.06-H2.

hodo · ‎10-16-2013

Interesting. Thanks for replying.

We are running Equallogic as well but still on firmware 6.0.5. Hopefully I can salvage the VM's from our replicated volumes or snapshots.

tomtom901 · ‎10-16-2013

Severe. Has Dell acknowledged this?

hodo · ‎10-16-2013

From the latest, 6.0.6-H2, Equallogic firmware update:

Issue Corrected in Version 6.0.6-H2

[CRITICAL]: In rare circumstances, an error handling routine was not properly executed. Currently, this has only been observed in VMware environments, where a VMFS datastore might experience heartbeat metadata corruption, impacting the ability to perform operations on virtual machines (VMs). However, there is a small risk that a similar issue could be observed in non-VMware environments as well, so Dell recommends that all customers upgrade to this release.

swspjcd · ‎10-16-2013

If you read the release notes for 6.06, it says can cause "metadata heartbeat corruption" under rare circumstances, or something very similar. We were running 6.06 for roughly 2 weeks and so far, have had 3 virtual servers that were on the same LUN, become corrupt. They all had the exact same symptoms as in my original post in this thread. It's possible that the LUN was corrupted from something else but we have nevr had any problems like this before, and not all of the servers that were on that LUN, were corrupted. We have since moved all of the servers to a different LUN and deleted the corrupted/faulty LUN.

hodo · ‎10-16-2013

We had 3 guests out of 10 that were affected.

I was just able to mount a snapshot of this datastore from earlier today before the problem started and got those guests back. My next step is to clear this datastore out and re-create this LUN. And obviously try to get the firmware upgraded.

Thanks again for replying swspjcd. It has been a frustrating morning trying to troubleshoot this problem.

All

can't register/add to inventory a vm because of locked file