VMware Cloud Community
jimbo84
Contributor
Contributor

Issues with VM maybe from exporting an OVA image from ESXI 6.0 and using it in ESXI 6.5?

Hello. I am looking for assistance getting to the root of a problem that we are having on a pretty large number of VMs that were created from an image we created. When we originally set this VM up to create an image it was on an ESXI 6.0 server. We exported (using vsphere) as an OVA to a deployment server. We have since used ovftool and some scripting to mass deploy it to several sites.

Everything was fine for the first couple of days and then we started seeing VMs crashing oddly. The symptoms are, network becomes unresponsive. Our monitoring (which uses ICMP) shows the VM (not the host) as down. You cannot SSH to the server, or open the console to view the machine. When I go to look at the VM performance stats, everything looks okay. When I view the host performance stats half of the CPUs assigned to the vm are pegged (We are using CPU affinity). Console does not work and restarting the VM fails. Trying several shell options on the host to reboot or kill the VM process failed as well. Right now, the only recovery method we have found is to reboot the physical host to recover the VM.

Reviewing the VM kernel logs showed this over and over:

2018-09-26T10:00:31.770Z cpu57:65963)lsi_mr3: fusionWaitForOutstanding:2898: megasas: [ 0]waiting for 0 commands to complete

2018-09-26T10:01:02.770Z cpu57:65963)lsi_mr3: mfi_TaskMgmt:560: Processing taskMgmt virt reset for device: vmhba2:C2:T1:L0

2018-09-26T10:01:02.770Z cpu57:65963)lsi_mr3: mfi_TaskMgmt:564: VIRT_RESET cmd # 3245337

2018-09-26T10:01:02.770Z cpu57:65963)lsi_mr3: mfi_TaskMgmt:565: Virtual Reset not implemented, calling fusion reset

That led me to this KB article VMware Knowledge Base

Now I know that was referencing esxi 6.0 and 5.5 but our driver was 6.910.18.00-1. Decided to upgrade the driver to 7.703.18.00-1 and hoped that would repair the issue. A few days later (while I was out) the first server I patched with the new driver went down again. I don't have all the logs. I am waiting for another failure to get more information but I am wondering if the community here has any ideas and if this is maybe a known issue.

I have also seen mention that 6.0 uses vmfs5 and 6.5 uses vmfs6. Could that be part of this issue?

Let me know what other info you may need.

0 Kudos
1 Reply
daphnissov
Immortal
Immortal

Just have to ask the question: Why are you using CPU affinity? Unless you have an *extremely* specific use case this is almost always a disaster in the making, and it could be attributing to your issue.

0 Kudos