I have several ESXi 5.1 hosts in a cluster. There is one vm that I can't add to inventory. I have followed several kb articles and even removed the ESXi host from the cluster completely after verifying that it was the ESXi host causing the lock by finding the MAC address. I am unable to view the vmware.log file for this virtual machine. I get an "invalid argument" error when trying to cat the vmware.log or the .vmx file. The lock file is "vmname.vmx.lck". I've rebooted and restarted the management agents several times. I'm just not sure where to proceed from here as I've been reading about how to resolve this for about 3 hours and have yet to find anything that works. The contents of the directory contain the following files if this helps at all:
-rw-r--r-- | 1 root | root | 162078 Sep 21 16:06 vmware.log |
-rw-r--r-- | 1 root | root | 73 Sep 9 18:33 vmname-63cf5ada.hlog |
-rw------- | 1 root | root | 21474836480 Sep 21 16:51 vmname-flat.vmdk |
-rw------- | 1 root | root | 8684 Sep 9 18:34 vmname.nvram |
-rw------- | 1 root | root | 517 Sep 9 18:33 vmname.vmdk |
-rw-r--r-- | 1 root | root | 0 Sep 7 17:50 vmname.vmsd |
-rwxr-xr-x | 1 root | root | 3342 Sep 13 18:18 vmname.vmx |
-rw------- | 1 root | root | 0 Sep 9 18:33 vmname.vmx.lck |
-rw-r--r-- | 1 root | root | 262 Sep 7 17:50 vmname.vmxf |
-rwxr-xr-x | 1 root | root | 3341 Sep 13 18:18 vmname.vmx~ |
-rw------- | 1 root | root | 37580963840 Sep 21 16:51 vmname_1-flat.vmdk |
-rw------- | 1 root | root | 519 Sep 9 18:33 vmname_1.vmdk |
Suggestions?
not sure if you've seen the below KB, but just in case.
That was one of the first things I tried. The ESXi host is in maintenance mode so there are no virtual machines running on it and therefore nothing to kill. I have no clue how to resolve this as all of the suggestions I have found don't seem to work.
Hi,
The vmname.vmx.lck seems to be a NFS lock..? are you having NFS storage if so
1. browse to the folder where the VM is located and run ls -la
It should list the hidden files.
2. find the files with extension .lck and delete them. and rename the vmname.vmx.lck
using the command mv vmname.vmx.lck vmname.vmx.backup.
3. Add the VM back to Inventory and Power ON the VM, it should work.
Thanks,
Avinash
We are not using NFS at all. There is only one .lck file and I can't delete it although I can rename it. Even after renaming it, I still can't register the vm. It is still grayed out.
can you run the command vmkfstools -D vmname.vmx and ls -la
and paste the output...?
Thanks,
Avinash
Here is the output.
Lock [type 10c00001 offset 4405248 v 132, hb offset 3424256
gen 25, mode 1, owner 51ffc163-c2f3c8e4-8cff-001d092b0694 mtime 126477
num 0 gblnum 0 gblgen 0 gblbrk 0]
Addr <4, 0, 39>, gen 71, links 1, type reg, flags 0, uid 0, gid 0, mode 100755
len 3342, nb 1 tbz 0, cow 0, newSinceEpoch 1, zla 2, bs 8192
Hi,
seems still there is a lock on the file from the MAC address - 001d092b0694.
try registering the VM to the Host who is the owner of the MAC address and power on if still fails.
cd into the vm folder and try running the command rm -rf *.lck and then power ON
if still a problem then
Power off the host and then run the command rm rm -rf *.lck
Thanks,
Avinash
I've tried all of those. I thoroughly followed the kb article describing how to do all of this. The option to register the vm on the ESXi host that has that mac address, is grayed out. I have tried deleting the lck file. When trying to delete it, I get "invalid argument". Even when powering off the ESXi host that has that MAC address and trying to delete the file on oanother ESXi host, I still get "invalid argument" when trying to delete the file.
what does the VMkernel.log and Hostd.log show up..? do you see any corruption messages..?
I don't see anything in either log that looks suspicious. We replicate our LUNs off site every night for DR, and even after presenting the clone of the LUN where this server lives, to an ESXi host at a remote site, I still can't delete the lck file or even add the virtual server to inventory.
Was this ever resolved? I'm running into this exact issue as well.
Have you tried the steps outlined above?
Just like swspjcd I followed everything in VMware KB: Investigating virtual machine file locks on ESXi/ESX and still cannot remove the lock file. It is a VMFS datastore, not NFS. I've restarted the host that has the MAC address shown when running vmkfstools -D lockfile. I've also restarted every other host in the cluster for good measure.
We were never able to get it resolved and from it looks like, the problem is due to a bug in equallogic firmware which can on "rare" circumstances, cause corruption in the metadata of a LUN. So far we have had 3 virtual servers with the exact same problem, all of which were on the same LUN. One we recovered from our backup software, the other two were able to be recovered from the replicated LUN of where they lived. The bad firmware, according to Dell is 6.06 and it is fixed by 6.06-H2.
Interesting. Thanks for replying.
We are running Equallogic as well but still on firmware 6.0.5. Hopefully I can salvage the VM's from our replicated volumes or snapshots.
Severe. Has Dell acknowledged this?
From the latest, 6.0.6-H2, Equallogic firmware update:
Issue Corrected in Version 6.0.6-H2
[CRITICAL]: In rare circumstances, an error handling routine was not properly executed. Currently, this has only been observed in VMware environments, where a VMFS datastore might experience heartbeat metadata corruption, impacting the ability to perform operations on virtual machines (VMs). However, there is a small risk that a similar issue could be observed in non-VMware environments as well, so Dell recommends that all customers upgrade to this release.
If you read the release notes for 6.06, it says can cause "metadata heartbeat corruption" under rare circumstances, or something very similar. We were running 6.06 for roughly 2 weeks and so far, have had 3 virtual servers that were on the same LUN, become corrupt. They all had the exact same symptoms as in my original post in this thread. It's possible that the LUN was corrupted from something else but we have nevr had any problems like this before, and not all of the servers that were on that LUN, were corrupted. We have since moved all of the servers to a different LUN and deleted the corrupted/faulty LUN.
We had 3 guests out of 10 that were affected.
I was just able to mount a snapshot of this datastore from earlier today before the problem started and got those guests back. My next step is to clear this datastore out and re-create this LUN. And obviously try to get the firmware upgraded.
Thanks again for replying swspjcd. It has been a frustrating morning trying to troubleshoot this problem.