VMware Cloud Community
FinFreeTX
Contributor
Contributor
Jump to solution

Hung VM's - Frequently Recurring Issue

This is a repost from the original thread here: http://communities.vmware.com/message/1641669

Just as the original poster stated, the issue is that with minimal load on the host machine all of my VM's get 'hung' about once a week - usually around the same time (12.45am), but on different days. I've gone through all of the automated processes on the VM's and my network and none of them correspond to the dates and times of the issue - so I'm not really sure what is causing this. It also happens occassionally at other times throughout the day/night. Here are the symptoms when the VM's 'hang':

  • All VM's do-not respond to ping requests or any other method of remote connection

  • CAN connect to host with vSphere Client, but no indication of issue under Performance or Events - all looks 'normal' (no spikes)

  • Attempting to click 'Console' tab, or launch console with the button causes vSphere Client to 'lock up' - cannot do anything else until vSphere Client is restarted

  • Attempting to reboot host via vSphere Client starts Maintenance Mode, begins attempting to shut down guest OS'es, then kicks you off - but never comes back online

  • After attempting to reboot via vSphere Clilent, Host console is 'locked' - no response when pushing F12 or any other keys

  • Host machine must be manually restarted by physically powering off and back on

I've enabled Tech Support Mode, so I am collecting logs of the issue - but I am unable to make heads or tails of them - and even worse the time stamps don't seem to be accurrate which makes it even harder. If there is anything specific I should be looking for, please advise and I will post.

Here are the details of my environment:

Host:

  • ESXi 4.1.0 26027

  • Dell PowerEdge 2900

  • Dual Xeon Quad-Core's

  • 16GB RAM

  • Local SATA RAID 5 array

  • 1x 1GB NIC - no VLANs

Guests:

  • 1x Windows Server 2008 x64 - 2 vCPU - 4088MB

  • 1x Windows Server 2008 x32 - 1vCPU - 2048MB

  • 5x Windows XP 32bit (only 2 used - others are always OFF unless
    needed) - 1 vCPU each - 256MB each

Host BIOS and firmwares on latest versions, no hardware issues detected

with Dell self diagnostics.

Please help??? Any suggestions would be greatly appreciated!

0 Kudos
37 Replies
golddiggie
Champion
Champion
Jump to solution

What do you currently have for a B&R solution?? If nothing, and have a zero budget (seems like you're going to be in that category) you could try using the vDR product for now... I would advise getting some funds allocated/budgeted for a solution that includes S&M (Support and Maintenance) with it... That way, IF you have an issue, you can place a call and get support right away... I favor the VizionCore vRanger Pro solution for VM B&R... Tried the veeam product and didn't like it even half as much... That's a whole other thread though...

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

0 Kudos
jamesbowling
VMware Employee
VMware Employee
Jump to solution

I would use something like Veeam Backup and Replication or something of that type. Or I guess you could take cold copies of the VMs or even use VMware Converter for some V2V action and put them on an external drive. You can actually also use VMware Data Recovery to backup to an external destination.

James Bowling

James B. | Blog: http://www.vSential.com | Twitter: @vSential --- If you found this helpful then please awards helpful or correct points accordingly. Thanks!
0 Kudos
FinFreeTX
Contributor
Contributor
Jump to solution

Currently we have no Backup/Recovery solution in-place, and if we implement one we really need to try and keep it FREE like you mentioned (the point of ESXi, right?). I tried looking up the 'vDR' product you mentioned which I assume is 'VMware Data Recovery', but when I go to download it I get an error 'Sorry, at the moment you arenot authorized to download VMware Data

Recovery 1.2'. I guess I need to request a separate license for this virtual appliance? I have not been able to find any other FREE backup solutions - anyone know of any?

It looks like james' suggestion to use the vCenter Converter and save to disk should work - so I'll do that this weekend if nothing else works, but if anyone has suggestions for a FREE virtual machine/plugin I'd like to try one of these out - especially if it has any type of automation.

Thanks for all of your suggestions and help!!!

0 Kudos
golddiggie
Champion
Champion
Jump to solution

You could give ghettoVCB a try too... Never used it, so you'll need to do some reading and set it up to test with...

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

0 Kudos
jamesbowling
VMware Employee
VMware Employee
Jump to solution

Unfortunately, vDR is only available to Essentials Plus and higher licensees. I would look into the V2V method if you want to keep it free but I would highly recommend purchasing either vRanger or Veeam Backup & Replication. Both are great products in my eyes. I personally use Veeam Backup & Replication but have also used vRanger.

If you found this at all helpful please award points by using the correct or helpful buttons! Thanks!

James B. | Blog: http://www.vSential.com | Twitter: @vSential --- If you found this helpful then please awards helpful or correct points accordingly. Thanks!
0 Kudos
FinFreeTX
Contributor
Contributor
Jump to solution

OK - thanks for clarification guys. For now I'll use the vCenter Converter option and we'll see where we stand once we get this issue resolved.

I did also stumble on the GhettoVCB option while searching, so I may try this. Here is a link to a walkthrough for my own personal reference if I need to come back to this: https://dakeung.com/2009/09/28/how-to-backup-esxi-4-0-virtual-machines/

Also saw it may be possible to do with vSphere CLI using some scripts posted in this thread, but not sure if this will work for me with my free license either since the CLI is read-only? http://communities.vmware.com/thread/164134?tstart=0

I'll get these VM's backed up manually this weekend, and then proceed with troubleshooting the SCSI devices/controllers...

Thanks guys!

0 Kudos
chadwickking
Expert
Expert
Jump to solution

Hi Fin,

I would recommend looking at dells support site for any update firmware/driver for your esxi host. Often times there can be some crtical updates that could address a variety off issues. In fact in that class of server there could even be drive firmware that you could apply as well. I would go that route before engaging vmware. You dont use any type of VSA for iSCSI do you? Also as a common practice see if your esxi host has update 1 for 4.0. We applied that in our environement and it resolved a number of oddities especailly when working with the vmkernel. I hope this helps and have a good weekend.






Cheers,

Chad King

VCP-410 | Server+

Twitter: http://twitter.com/cwjking

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

Cheers, Chad King VCP4 Twitter: http://twitter.com/cwjking | virtualnoob.wordpress.com If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
FinFreeTX
Contributor
Contributor
Jump to solution

Thanks for the suggestions chadwickking, but as the original post states, we have all the latest BIOS and firmware updates for the hardware. That was the first thing we checked after upgrading the host to 4.1 (which rules out your second suggestion) - but nothing has changed. In-fact, the issue appears to be getting worse over time. It went from happening fairly randomly - maybe once every couple months - to twice a week or so!!! I'm fairly certain we have narrowed the issue down to a problem with the SCSI devices or controller, but at this point I am somewhat stumped as to how to proceed with troubleshooting to determine the exact issue so that I can get Dell to replace whatever hardware needs to be replaced. I've already run the Dell self-diagnostics and it did-not turn up any issues. I've also followed some instructions I found online to get Dell OpenManage installed and working on one of the VM's, but I can't get the Dell Online Diagnostics to work - after installing both the OpenManage and Online Diagnostics, I still don't have a 'Diagnostics' tab within the OpenManage console/webui. The hardware is all covered by Dell, but without being able to provide them with a specific Dell self-diagnostics error number, and without being able to duplicate the issue on-demand, they are going to make us jump through hoops to get it repaired/replaced.

Any other suggestions anyone???

0 Kudos
jamesbowling
VMware Employee
VMware Employee
Jump to solution

Have you shown them the clips from the logs that point directly at the

SCSI errors that are happening at the times this happens? That should

be plenty enough for them to replace under warranty. You have a

failing piece of hardware that is causing downtime. This is one of the

reasons why I never buy Dell hardware, they have pulled some insane

tricks on me in the past just to replace stuff. Just demand the

replacement or escalate the issue with them in order to get it

replaced.

James Bowling

James B. | Blog: http://www.vSential.com | Twitter: @vSential --- If you found this helpful then please awards helpful or correct points accordingly. Thanks!
0 Kudos
FinFreeTX
Contributor
Contributor
Jump to solution

But that's just it - replace WHAT? The controller? The disks? Both??? At this point it could even be a motherboard issue...the last thing I want is to spend a day rebuilding my data store and VM's just to have the issue continue...

I need to troubleshoot further and be able to put my finger on exactly what the problem is before I go escalating...don't you agree???

0 Kudos
jamesbowling
VMware Employee
VMware Employee
Jump to solution

Yes but they should be helping you in this situation. That is the

whole point. Escalate the issue with them. That would be a point to be

made with them. At least that is what I think. Have you looked at any

information that could be pulled from your controller? I would

imagine that they can pull information from it. Hell, get them to help

diagnose the issue. The logs specifically point to a particular

controller so I would start there with them and show them those logs

where it gives the adapter that is having the problem.

James Bowling

James B. | Blog: http://www.vSential.com | Twitter: @vSential --- If you found this helpful then please awards helpful or correct points accordingly. Thanks!
0 Kudos
chadwickking
Expert
Expert
Jump to solution

Yes, I can understand your frustrations.  I took a look at your logs and found something worthy of note.

ScsiDeviceIO: 1672: Command 0x2a to device "naa.600508e00000000059cd61e3918dc00b" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

Naa indicates the device name which is more than likely your SCSI controller on the Server - It also throws a sense data error when trying to issue a command to the controller.  The controller is either not responding or there is a disk that is causing noise on the bus.  We have seen this on many Dell systems in our environment and usually we would have dell replace the SCSI controller and a Drive - this doesn't indicate a drive because of the alert is being generated from issuing a command to the controller.  I would suggest take an outage time for the host.  Run a consistency check on the RAID 1 - Container to see that it passes.  You could also check the physical HDD in the array by accessing the controller in the BIOS - You could see if anything shows up like medium errors and so on. Replacing the SCSI controller may involve a motherboard or riser in this case.  Just make sure they send you the correct parts and be sure to make sure they are up to date as well - firmware that is.  On a final note have you checked the drive firmware and ensured it is up to date as well?  You can do this through OM or the controller in the BIOS.  I also find it odd that you are not using CIM compnent in vSphere for being able to check you servers hardware and alert you on failures.  Are you using the DELL provided ESX/ESXi image for there servers? These come with the CIM comonents build in so you dont have to do this extra work with trying to get OM to work.  I know its been a while but if you still have this issue can you upload your logs to the forums.  These would include the vmkernel, vmkwarning, and hostd.  I will do may best to help you sorry for the late reply.

Chad King

VCP 4

Cheers, Chad King VCP4 Twitter: http://twitter.com/cwjking | virtualnoob.wordpress.com If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
FinFreeTX
Contributor
Contributor
Jump to solution

Thanks for the reply Chad!  Better late than never!  I've been meaning to update this post myself...

I've been working with Dell on this issue since my original post, and I THINK we MAY have finally found a resolution - however since the issue is so intermittent, it's hard to say for sure yet.  After providing numerous vm-support and DSET reports, Dell decided to replace the controller which DID-NOT help - the VM's locked up again within a week.  Then they replaced the physical disks about a week ago, and SO-FAR this seems to have resolved the issue.  Instead of trying to backup/restore the VM's or clone the disks, I simply created a new/additional storage array (RAID1), set it up as a datastore, then used the vCenter Converter to migrate the VM's from the old datastore to the new one.  I fired them up and ran them for several days from the new datastore before deleting the images from the old datastore, deleting the old datastore itself, deleting the RAID container from the controller, and then finally pulling the old disks completely out of the system.  All the VM's have been running from the new datastore/array for about a week now without locking up - and for several days before that, so I am REALLY hoping the issue is resolved - but I really won't know for at least another couple of weeks, since the issue has been as infrequent as once a month at times.  If this does resolve the issue, it will still not tell us exactly what the issue was because by doing this we have not only replaced the physical disks themselves, but also rebuilt the RAID config, and even re-configured the VM's (an option I selected by mistake because it was the default when using the vCenter Converter which made me have to re-activate Windows on one of the VM's).  I'll be glad if it is fixed for good, but would still like to know what the original issue was for sure...

I will definitely look into the CIM component you mentioned - I've never heard of this, as I am a bit of a VM n00b, so thanks for the tip!  Can you link me to more info?  I found this by searching on Google, but if you have links to any other info/tutorials I would apprecaite it greatly!  http://blogs.vmware.com/esxi/2010/04/hardware-health-monitoring-via-cim.html

To answer your final question, NO we are not using any type of ESXi image - we installed from scratch using the installable ISO downloaded from the VMware website.

Thanks again to all for your help!  I will check back in a few weeks and let you guys know if the issue is resolved (you will hear from me sooner if it's not)...

0 Kudos
chadwickking
Expert
Expert
Jump to solution

Customized Binaries for DELL can be found on the VMware site here - this SHOULD have the CIM componenets pre-built so you can check hardware status through your vCenter Server.  I hope to hear good news in the future. Let's hope things are good.  I was curious do you run a lot of VM's on this host?  I was curious of your load because VM's generate high I/O when reading/writing to storage. What type of storage are you using? It sounds like local but I am not sure.  Do you present this storage local storage through a Virtual Server Appliance?  Anyways best of luck and I hope this helps.

https://www.vmware.com/tryvmware/p/activate.php?p=free-esxi&lp=1

I had to browse down some to locate it but it is there.

Nevermind - I see now how many VMs you run. It shouldn't pose a problem.  As for the server being a 2900 i think it should have the CIM binaries you need.

They do have the Dell Open Manage offline that can be installed as well that can be found here:
http://support.dell.com/support/downloads/download.aspx?c=us&cs=2684&l=en&s=pub&releaseid=R288955&Sy...

Regards,

Chad King

VCP4

Cheers, Chad King VCP4 Twitter: http://twitter.com/cwjking | virtualnoob.wordpress.com If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
FinFreeTX
Contributor
Contributor
Jump to solution

Thanks for the link - I'll check it out, but one more question - we aren't using a vCenter Server - just the single ESXi host and vSphere from my PC to manage.  Will I still be able to utilize CIM in this environment?

I guess you found the answer to your question about the number of VM's.  To answer your storage question, it's a local RAID1 array on a Dell Perc 6/iR controller that is installed locally on the PE2900.

Thanks again for the help and tips - I'll follow up when I know more...

][Q][

0 Kudos
chadwickking
Expert
Expert
Jump to solution

Yep - You are correct.  You dont use vCenter you just use the vSphere client to connect to the host.  I think the CIM hardware component will still show up when connect directly to the host.  At least in our environment when I connect to the host I can see the Hardware status tab.  

regards,

Chad King

Cheers, Chad King VCP4 Twitter: http://twitter.com/cwjking | virtualnoob.wordpress.com If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
FinFreeTX
Contributor
Contributor
Jump to solution

OK, it's been several weeks so I think it is finally safe to say that this issue is resolved (knock on wood).  I'm still unsure if the resolution was replacing the physical disks, or the fact that by replacing the physical disks the RAID config was recreated - but either way it is working now.  Thanks to all for their help and suggestions, and I hope this thread will help someone else with the same problem fix it MUCH faster than we did...

@Chad - I have still not been able to find any documentation to get the 'Hardware Status' tab to show up in the vSphere client.  I got Dell OpenManage working on one of the VM's using these instructions: http://www.excaliburtech.net/archives/92

Based on this, I assume I have the Dell CIM providers working properly - but I don't know how to proceed from here?  Any further suggestions would be greatly appreciated!!!

Thanks again, and happy new year everyone!

][Q][

0 Kudos
DSTAVERT
Immortal
Immortal
Jump to solution

If you are still using wood to knock on you probably should consider a technology upgrade. :smileylaugh:

-- David -- VMware Communities Moderator
0 Kudos