VMware Cloud Community
MK22
Contributor
Contributor

VMFS 5 migration via sVmotion after succesful vSphere 5 upgrade maybe causing Windows VMs to crash

I have recently upgraded our two datacenters to vSphere 5, one utilizing the Linux vCenter Appliance and a new Oracle database, rather than upgrade again the vCenter database we'd brought up from VI3 several years ago, and the other utilizing an in place upgrade of both the vCenter server and the MS SQL database. Our storage is 2 "Groups" of Equallogic PS6510 iSCSI "SUMO" arrays, 3 arrays at one datacenter and 8 arrays at the other. After successfully upgrading each of the ESXi hosts from 4.1 to 5.0 I began creating VMFS 5 datastores, to get away from the 2 TB extent size. At the first datacenter we have experienced not a single issue after migrating all of dev, then test, then prod to the new VMFS 5 datastores, at the second datacenter we've had several machines crash. The only commonality between the crashing machines, which is also the only difference between datacenter one and two is, in datacenter two we mount iSCSI volumes inside the VMs using the Microsoft iSCSI initiator (ASM snapshots get mounted to be more specific). So what seems to happen is;

  1. migrate a healthy VM that has been running for months      with no problems to a VMFS 5 datastore
  2. a scheduled task that mounts an iSCSI volume (an ASM      snapshot of another volume) runs
  3. the machine crashes and reboots
  4. windows must be reactivated (you have 30 days to      activate windows)
  5. subsequent "iSCSI volume snapshot mounts"      cause no issue

I am finding it very difficult to believe that these are related, as the VM and the ASM program running inside, as well as the MS iSCSI initiator, should have no idea what type of underlying VMFS datastore they are running on.

Is anybody else experiencing anything like this?

VCP
0 Kudos
15 Replies
MK22
Contributor
Contributor

I've been testing this scenario all morning, I've yet to reproduce it. I manualy ran through the whole process on a test machine, and then also tested it on a production machine that is not active yet, both worked as expected and did not crash or ask to be reactivated from hardware changes.

VCP
0 Kudos
JLFG
Contributor
Contributor

Hi,

Did you figure out this problem? I upgraded all my datacenter to ESXi5 and my next step is the upgrade of datastores. I am checking any possible issues with vmfs upgrade.

Were you doing a VMFS upgrade  or a clean vmfs5 datastore?

Thanks

Pepe F. VCP
0 Kudos
MK22
Contributor
Contributor

Fresh VMFS 5 datastores, but I have not had it occur again, I tested several times, then commenced the moves of Dev, Test, and Prod machines to the new VMFS 5 datastores. I will update if it ever happens again. All the evidence pointed back to this being the problem, but I feel now it has to be some sort of very strange coincidence, and there are so many hands on these Windows machines, anybody could have done something strange to them.

VCP
0 Kudos
JLFG
Contributor
Contributor

I didnt have any problem with datastores after the upgrade, but i created new VMFS-5 datastores for all my servers though. 

Pepe F. VCP
0 Kudos
lukas_ch
Contributor
Contributor

Hi

We have exactly the same issue here.

We also have an opened vmware-case and in the telco today vmware told us, they know about similar issues and actually they are investigating that internaly and will come back with feedback soon.

Our environment:

EVA4400

c7000 Blade enclosure with virtual connect flex fabric

BL460c Gen8

vshere 5.0 U1

First we moved about 40 vms from a local storage (D2200sb) on vmfs3 to eva4400 vmfs3 datastore, then from there back to local storage with vmfs5. There we had no problem at all.

Then we moved about 60 vms from eva4400 vmfs3 luns to eva4400 vmfs5 luns and until we know about 5 vms with ntfs corruptions and 2 ms sql database corruptions. we had to restore all them to get it work again.

Cheers Lukas

0 Kudos
EXO3AW
Enthusiast
Enthusiast

We have encountered a similar issue at a customers site and are currently investigating this with the HP support, but we will probably open a new VMware case as well.

The customers previous setup included an EVA 4400 with a bunch of Datastores (VMFS 3) connected to 3 HP BL460cG6 using supported FC-Links,-HBAs and Switches. Those blades were running ESXi 4.1.

During the upgrade we installed a new HP P6350 array alongside the "old" EVA4400 and upgraded the 3 BL460cG6 to ESXi 5.1 (HP-Branded ISO).

We sVmotioned the VMs to the new P6350 datastores which have been setup as VMFS5.

The sVmotion went through flawlessly and the machines have been up and running.

The daily backup started complaining about "unreadable files in the shadow copy" and some (but not all) VMs complained about disk errors.

This led to a self-scheduled chkdsk on the Windows VMs and subsequently chkdsk "repaired" many files inside of c:\windows by simply deleting and/or renaming them, rendering nearly half of the servers useless and we had to restore them from backup.

We've investigated this as far as possible and couldn't find a reason for this, even HP and VMware denied anything and stated that everything is working as expected. Since the VM have been left untouched so far, we believed that this might have been a bad coincidence involving live sVmotion, VMFs, outdated VMTools and maybe backup-triggered snapshots. So we went on and subsequently some of the remaining VMs crashed as well with MFT bitmap and NTFS errors.

Our second HP support case went through L1 to L2 and apparently they have seen some occurences like this and they believe it has something to do with sVmotion in conjuntion with a VMFS upgrade and probably even VAAI enabled.

We have been prepping logs half the day so far and are awaiting a feedback from HP so far.

So - if anyone has further info on this topic, I'd be glad to see updates on this.

Kind regards

Alex

0 Kudos
lukas_ch
Contributor
Contributor

Hi

This seems to be exactly what we were seeing here. We also had unreadable files in backup with VSS and then ntfs corruptions on the filesystems. And we also had corrupt ms sql databases which we had to restore!

First, vmware told us in our case, to disable VAAI at all. Now the actual advise is the following:

The VAAI primitive "Block Zeroing" should be disabled

Here are the instructions:

VAAI (vStorage APIs for Array Integration) FAQ

http://kb.vmware.com/kb/1021976

HardwareAcceleratedInit

Zero Blocks/Write Same, which is used to zero-out disk regions

To disable it set DataMover.HardwareAcceleratedInit to 0

It should be done on all vSphere hosts.

We have done this now, but honestly, we do not do any storage vmotion activity at the moment, because we could not reproduce the case at all!

Our case on vmware is managed by HP, so I can't tell you the case id on vmware support. but I think, when you tell them all that, they will see the other cases.

I try to update here asap I get new informations.

0 Kudos
lukas_ch
Contributor
Contributor

Hi

We just had a telco with hp and vmware.

Vmware has notice from different customers with the same issue. They also know from customer which could reproduce the issue with enabled VAAI-Option and same customer had no issue with disabled option.

So they are pretty sure, that the issue will not occure with disabled option. We have disabled the option on all our hosts. Vmware suggests that we could go forward by storage vmotion from vmfs3 to 5 now.

They do not know the root cause at the moment and so also do not have a solution for it. Probably the root cause is also on the storage array (probably only on hp arrays).

Most of the issues also get seen is after a reboot of the virtual machines. So we decided to reboot all our virtual machines as soon as possible and will do a fscheck on all filesystems and do a integrity check of the databases.

We let the case open and if I hear any news, I'll post it here of course.

Cheers

Lukas

0 Kudos
EXO3AW
Enthusiast
Enthusiast

Hi Lukas,

thanks a lot for the update, this clarifies things very much up although there is no root cause available i am perfectly comfortable with knowing that HP/VMware are fully aware of and are working on the problem.

Concerning the fact that VMs will show problems on reboot does also fit at least into our scenario.

I assume that somethin in MFT/Bitmap/NTFS breaks on sVmotion rendering the NTFS broken/dirty.

Since the VM is running, there is no need for many system processes to re-read the now broken files (still running in memory) so the core processes keep running. Every disk-related task might fail though.

I've seen VMs where i've been unable to launch mmc.exe, due to the fact that the file was broken. Or in another case mmc opened, but i've been unable to add specific snapins, due to the fact that related files were broken.

In every Case of broken VM NTFS our VSS backup encountered disk-problems like this:

"Backup Runner could not backup \\ShadowCopyVolume46\?\Windows\system32\mmc.exe : Error reported by OS: "Cannot read file""

This perfectly fits into the assumption that sVmotion breaks the NTFS and all related errors are just drive-bys.

When the VM is rebooted (like many admins would react to a somewhat "strange" behaving system) chkdsk fires up and "repairs" the NTFS by "cleaning" up. Unfortunately this VM might lose vital files during this process without telling me which ones are broken.

Fortunately a VSS-based fullbackup is able to list the broken files since they should (!) be in the sessions's error list, so i've been able to revive machine by restoring unchanged files from an earlier backup, YMMV. I've had one VM with only 3-4 broken uncritical files, whilst others had more than half of their windows-folder trashed.

Thanks again Lukas, working together like this helps alot Smiley Wink ...

Kind regards

Alex

0 Kudos
TomBakry
Contributor
Contributor

We are also experiencing related issues.  We have cases opened with VMware support and HP support.  VMware initially pointed us at a this KB

http://kb.vmware.com/kb/1033665, indicating that HP Level 3 Engineering told them it while they are working on the issues, disabling VIIA is the work-around.

This is a bigger hammer approach, if you have more than one storage platform.  It disables the VIIA features for each ESXi server entirely, regardless of what storage platforms are being accessed.  Additionally, the work is performed on/to each server rather than to the overall vSphere environment.  VMware engineers suggested that the functionality could be turned off at the EVA end, but so far HP has been either unwilling or unable to provide a storage side solution.

If you are going to be implementing the workaround, I highly suggest creating a simple PowerCLI script containing the required commands.

The commands that I used, from the KB, were:

Set-VMHostAdvancedConfiguration -VMHost server.domain.com -Name DataMover.HardwareAcceleratedMove -Value 0

Set-VMHostAdvancedConfiguration -VMHost server.domain.com -Name DataMover.HardwareAcceleratedInit -Value 0

Set-VMHostAdvancedConfiguration -VMHost server.domain.com -Name VMFS3.HardwareAcceleratedLocking -Value 0

Our VMware support engineer indicated that only the first two commands were required.  I have additional questions regarding the impact of the issue and VIIA operations, to better assess the threat of this problem in the production environment.  These features can be re-enabled by changing the value back to 1.  I will post additional notes when I have answers to these questions.

An additional note, I don't believe that it is reasonable to assume the problem is limited to Windows Server 2008 R2 VMs.  This problem is at the storage layer and is OS agnostic.  Any VM in the environment is at risk from this issue.  As long as ESXi is making calls through the API to the storage system and getting flawed information and results, no attempt to allocate space should be safe.  That is why the API must be disabled as a work-around.

Good luck to us all.

Tom B.

0 Kudos
TomBakry
Contributor
Contributor

One of the main concerns that I hoped to put to rest was whether just sitting on the storage and performing normal, simple operations,  like snapshots in the course of automated backups and thin vmdk growth would put the VM at risk for corruption.

HP and VMware seem to have differing opinions on the subject, HP saying that they believed that snapshots were at risk. VMware support engineer says:

Using snapshots should not put your VMs at risk, which means you are free to continue doing virtual machine backups and other snapshot related activities. The issue at hand is a problem with the way the HP EVA array is handling the VAAI feature called Block Zero, which essentially means that the ESX host offloads the requirement of writing zeroes and preparing LBAs for data that would have otherwise been done thru our vmkernel stack. the vStorage API for Array Integration (VAAI) is a completely different set of APIs than VADP or even the basic snapshot code. By disabling the VAAI features, this means we do not offload anything to the array anymore and simply do all the work ourselves, much in the same way things were done in ESX 4.0 and prior versions.


Tom B.

0 Kudos
lukas_ch
Contributor
Contributor

Hi together

We still do not know any news about the issue from HP and Vmware.

We only disabled the Option DataMover.HardwareAcceleratedInit in VAAI and not the other 2. Vmware told us, this will be enough. If you have any sure informations, that also the other 2 options are mandatory to disable, please send an answer. Thanks.

Lukas

0 Kudos
EXO3AW
Enthusiast
Enthusiast

Hi all,

since we did another similar switch (EVA 4400 with VMFS3 > P6350 with VMFS5) again this week, i'd like to share some information.

Due to the lack of confirmed information in forms of security bulletin or KB entries and the sometimes misleading information of my HP support cases i disabled all 3 VAAI feature bits in advance just to be on the safe side.

The migration (live sVmotion) went through totally smooth with no errors at all on the VM storage layer nor any noticable degradation in terms of performance.

Although this does not mean there's a causal relationship since this was only one migration, my experience with this issue made me believe this is HP related VAAI stuff.

I for myself (and on behalf of my customers) disable all the VAAI features on P6350-connected ESXi until HP or VMware give us perspicuous info.

Alex

0 Kudos
EXO3AW
Enthusiast
Enthusiast

HP verified that there's an issue and published the following advisory:

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c03571575&lang=en&cc=us&taskI...

I did not see the new FW 11001100 publicly available, maybe i've missed something.

Kind regards

Alex

0 Kudos
Gladio
Contributor
Contributor

0 Kudos