In a upgrade process form ESX 4.0 to ESXi 4.1 we reinstalled a Blade servers(BL 460c G6) by script. In our update process there is a line Un-present all LUN that are presented to the related host. This is done to be sure that the installation will not install on a LUN that is presented.
And here is where the human factor kicked in we forgot 1 LUN the was presented to the host. This was a LUN that was presented form the Test and Development environment and this was temporally but not temporally enough So when running the installation script He saw that LUN as the first disks and in the script it says install on first disk. In the beginning we did noticed that the installation failed but with a adjustment of the script it worked. And all Virtual machine just keep running. The next day all look well until some of the virtual machine went for a planed reboot and did not come up any more. At this point we know we had a storage issue because the rebooted virtual Machines gave a Orphaned error. Some machine where not rebooted and could not be accessed though the vCenter Console option but could be accessed through RDP and where alive and wel.
When we looked at the storage we did not see the folders with the virtual machine files any more nor the folders of the virtual machines that where still accessible though RDP. And ik the view of the LUN we saw multi partitions that suggesting that the (in our eyes) failed installation was installed on that LUN (sorry no picture)
When we went to the console view we saw that the VMFS permission is denied on the specified LUN (see picture below)
At that moment we had a VMFS partition that was reinstalled with a ESXi 4.1 installation with no permissions for the host or other host an virtual machines running from that LUN. At this moment i had a face if had jurist say water burning!
The install option says :
autopart –firstdisk –overwritevmfs and we adjust it to autopart –firstdisk=hpsa,cciss,local –overwritevmfs
So if vmfs is realy overwritten why are my virtual machines still working?
Because vCenter was not abel to do a storage vMotion (No permission)
We had a idea lets V2V this machine and it works!!!
We V2V several machine that where on the specified LUN remove the lun throwed it away on the Storage and presented a new one to our environment Lucy us!
But how is this posible and are there sollutions to get the data back does anybody have a idea?
If there is little disk activity the VMs are quite happy to run in RAM. Most OS's will cache writes until the disk becomes available again. That may not necessarily explain your situation but I have experienced similar.
If there is any chance to rescue the LUN I would place a support call to VMware.
Thanks for the response there was al little disk activitie so this could be the case.
The support call is not nessesary I know where i went wrong bu was curious if i had other options in getting the data back but when the machine is only running in memory and the new disk division i don't know a other way .
When we V2V the running systems we delete the lun and presented a new one the missing VM's we restored it is a testing evniorment so no harm done