Re: Virtual Machine corrupted during snapshot remo...

mattjk · ‎11-30-2008

Hi,

We just had the nasty experience of ESX (3.5 update 2) corrupting one of our Virtual Machines while trying to remove a snapshot. I'm interested to see if anyone on here can shed some light on why it might have happened - there are quite a few threads on similar issues but none seem directly equivilent to our experience.

This evening while performing some other work I noticed that the VM in question (W2K8 64-bit, if it matters) had a snapshot that we'd forgotten to remove - the snapshot was over a month old, but relatively small (only 800MB or so in the delta-disks).

I deleted the SS via VirtualCenter, which "worked" on it for a few seconds before erroring out with the message "Doing an online commit, cannot power off". At the same time the Virtual Machine stopped (was powered on when I started the SS commit).

Subsequent attempts to power on the Virtual Machine resulted in VC showing errors like "Failed to power on xxxx on xxx in xxx: A general ssystem error occurred: Internal error". Examining the VM's log files showed the problem was to do with inconsistencies between the delta disks and the main disk:

Nov 30 18:20:57.718: vmx| DISK: Cannot open disk "/vmfs/volumes/9a9d0976-a1cfd695/xxx/xxx-000003.vmdk": The parent virtual disk has been modified since the child was created (18).

Nov 30 18:20:57.718: vmx| Msg_Post: Error

Nov 30 18:20:57.718: vmx| Cannot open the disk '/vmfs/volumes/9a9d0976-a1cfd695/xxx/xxx-000003.vmdk' or one of the snapshot disks it depends on.

Nov 30 18:20:57.718: vmx| Reason: The parent virtual disk has been modified since the child was created.

Also, the VM's log file at the time that the snapshot deletion was running shows lots of opening & closing of the delta disk and base VMDK files, followed by:

Nov 30 18:11:29.726: vmx| DISKLIB-LINK : Attach: Content ID mismatch (9bb515b0 != 95b90c26).

Nov 30 18:11:29.736: vmx| DISKLIB-CHAIN : "/vmfs/volumes/9a9d0976-a1cfd695/xxx/xxx.vmdk" : failed to open (The parent virtual disk has been modified since the child was created).

Nov 30 18:11:29.738: vmx| DISKLIB-VMFS : "/vmfs/volumes/9a9d0976-a1cfd695/xxx/xxx-000003-delta.vmdk" : closed.

Nov 30 18:11:29.738: vmx| DISKLIB-VMFS : "/vmfs/volumes/9a9d0976-a1cfd695/xxx/xxx-flat.vmdk" : closed.

Nov 30 18:11:29.738: vmx| DISKLIB-LIB : Failed to open '/vmfs/volumes/9a9d0976-a1cfd695/xxx/xxx-000003.vmdk' with flags 0xa (The parent virtual disk has been modified since the child was created).

Nov 30 18:11:29.738: vmx| DISK: Cannot open disk "/vmfs/volumes/9a9d0976-a1cfd695/xxx/xxx-000003.vmdk": The parent virtual disk has been modified since the child was created (18).

Nov 30 18:11:29.738: vmx| Msg_Post: Error

Nov 30 18:11:29.738: vmx| http://msg.disk.noBackEnd Cannot open the disk '/vmfs/volumes/9a9d0976-a1cfd695/xxx/xxx-000003.vmdk' or one of the snapshot disks it depends on.

Nov 30 18:11:29.738: vmx| http://msg.disk.configureDiskError Reason: The parent virtual disk has been modified since the child was created.

Looking at the snapshot data in the VM's .vmsd file after the above occurred, there were two snapshots listed - one was the original snapshot that I'd tried to delete, and the other named "Consolidate Helper- 0". I have no idea where there "Consolidate Helper- 0" one came from, and unfortunately am also not sure if it was present before the attempt to delete snapshots or not - it certainly wasn't listed in the "Snapshot Manager" in VC though.

No attempts on my behalf managed to get it "working again", so we restored the most recent SAN (NetApp) snapshot (luckily taken 5 minutes before the problem), which seems to have brought the VM back to life again without issues (aside from having to remove from VC / re-add).

Post-restore, a subsequent attempt to delete the ESX snapshot succeeded, with the VM showing no ill effects.

So - can anyone suggest why this may have occurred? While we suffered no data loss I did lose 4 hours of my time - on a Sunday night no less - and would prefer this didn't happy again!!

Cheers,

Matt Kilham

Message was edited by: mattjk

Cheers, Matt

Texiwill · ‎11-30-2008

Hello,

Moved to the VI: Virtual Machine and Guest OS forum.

Looking at the snapshot data in the VM's .vmsd file after the above occurred, there were two snapshots listed - one was the original snapshot that I'd tried to delete, and the other named "Consolidate Helper- 0". I have no idea where there "Consolidate Helper- 0" one came from, and unfortunately am also not sure if it was present before the attempt to delete snapshots or not - it certainly wasn't listed in the "Snapshot Manager" in VC though.

Consolidate Helper- 0 was created by a backup tool. Usually VCB.

One of the delta files could have been locked. Not sure why that would be the case. The errors you see in the vmware.log are related to the delta deletion problem and your restore was the proper thing to do.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

SearchVMware Blog: http://itknowledgeexchange.techtarget.com/virtualization-pro/

Blue Gears Blogs - http://www.itworld.com/ and http://www.networkworld.com/community/haletky

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

mattjk · ‎11-30-2008

Consolidate Helper- 0 was created by a backup tool. Usually VCB.

Thanks for the info.

I've noticed that a lot of my VMs have a Consolidate Helper snapshot in their VMSD file, even though no snapshots are listed in the VI client and no delta files exist. Some also have other snapshots I have created in the past (and since committed) in their VMSD files, again no listing in VI Client nor any delta files.

Is this normal?

One of the delta files could have been locked. Not sure why that would be the case. The errors you see in the vmware.log are related to the delta deletion problem and your restore was the proper thing to do.

Thanks for the info. Any tips of how to track down the underlying cause of the problem?

Cheers,

Matt Kilham

Cheers, Matt

warrenv · ‎08-01-2009

I've run into this same problem. Since I don't see anyone from VMWare providing a useful answer of any kind, I'll respond, even though this is an old thread.

The issue is that the above user didn't wait long enough for his snapshot commit to complete. Basically, it takes roughly 4 hours for a 300GB snapshot to commit to the prime vmdk. This is pretty linear of a size/time ratio, assuming modern hardware. See:

http://communities.vmware.com/thread/182276

If you do not wait for the commit to complete, essentially you are corrupting the primary vmdk that any later snapshot depends on. So, even though you may have a proper running snapshot file, any disruption of the primary vmdk will render it useless. Remember: snapshots are essentially based on a differential from the initial vmdk.

Once you get the later errors observed in this post, your only solution is to restore from backup. But hopefully this message will prevent anyone from doing anything drastic like rebooting their ESX platform because of a perceived "hang" at 95%.

-W

All

Virtual Machine corrupted during snapshot removal?