Hi,
We have a strange problem/bug in our new VMware cluster.
Environment
BL460 gen10 with HP FlexFabric 20Gb 2-port 650FLB Adapter
HPE C7000 chassis
vSphere 6.0 (build 6775062)
ESX01,ESX03 and ESX03 are in chassi01
ESX04,ESX05 and ESX are in chassi02
VMs intermittent loses network connectivity.
When this happens the “remedy” is to migrate the specific VM to some other host in the cluster.
So far it seems that it doesn’t matter if I migrate the VM to a VMhost inside the same chassis or to the other chassis , just a migration seems to solve the issue. (I can’t migrate it back to the same host though)
I have around 150 VMs in this cluster and so far I’ve had issues with 5-6 of them , completely random.
They could be on any of my VMhosts in the cluster.
Haven’t created any support case with VMware or HPE yet , this forum post is my first advance to tackle this problem.
All firmware is updated to the latest from HPE
Someone who have seen similar issues?
Regards
Johan
Ok... so far, so good. No network issues on any VM since VC upgrade. Nothing in the release note that can explain this!
I am cautiously optimistic but not convinced.
I have had a case open with VMware.
He could not see anything particular from the logs. The support engineer was leaning toward the direction of the external switches. But it is very limited what you can do in Virtual Connect controlled by Oneview
In the core switch everything looked fine.
I just had the issue again. This time a Windows 2008R2 Server with VMXNET3.
In the past Linux servers was the most affected ones.
Damn this is is beginning to frustrate me
We have upgraded everything but the Onboard administrators (they are not part of the communication path anyway) and we are still experiencing the issues with VMs randomly losing network connectivity. Everything seems fine, but we cant ping in or out.
Most of the time a simple disconnect/reconnect of the network interface within the VM works. But sometimes we have to do a vmotion to get it to work again and that always seems to work.
Will raise the priority of the case to HP on Monday and I will request that they engage VMware rather than me having to coordinate two vendors.
I will post again once I have an update.
Hello,
We are experiencing the exact same problem.
We too have a case open on both HP and vmware.
It seems to occur only on Linux VMs.
Lets hope they will fix it quick. It is becoming very annoying.
Bye.
You can try update the elxnet driver, to >=11.2.1271 if you were using a 11.2.x driver, or >=11.4.1255 if you were using a 11.4.X driver. These versions contain a workaround for an issue that may cause loss of VM connectivity. If you see large number for "rx_pkts_on_quiesce" in the stats of elxnet vmnics after loss of VM connectivity, it's almost certainly this issue, and these driver updates should resolve it.
chnb, I have search on google like crazy but I am unable to find anything related to elmulex/elxnet 11.4.1255 or 11.2.1271.
Can you provide a deep link to a release note or download page?
We've seen something similar on the HP 556FLR Emulex adapter. The latest driver HP has published is 11.4.1205....
HPE has published this advisory:
HPE Support document - HPE Support Center
Advisory: VMware - HPE ProLiant Server Configured With Certain Network Adapters And Running VMware ESXi 6.0 U3 May Randomly Lose Connection to Individual Virtual Machines
Hi,
We have installed the driver released by vmware (11.2.1271) with good results. We have 1,5Weeks of stable operations and counting.
It may look like it's an older version than the one released by HP but apparently, according to HP support, they use different numbering (although the same basic format).
We have now upgraded all our hosts to the latest HP SPP (2018.03.0.B) and the latest vmware release and latest HP drivers (apart from the Vmware supplied ELXNET driver). All our blades are BL460c Gen9 with 650FLB adapters and we run virtual connect 4.62. Hopefully this means we will have stable operations during summer, but you never know
Hello,
Please provide the below details as -
1. Can the virtual machine ping other virtual machine on the same host-same port group , when the issue occurs ?
2. If yes, then let us know the kind of adapter used for the virtual machine - e1000/ vmxnet3 .
3. If e1000 , have we tried to change the adapter to vmxnet3 adapter ( If possible, please install latest VMware tools and replace e1000 with vmxnet3 adapter )
4. If vmxnet3, then make sure the we got enough buffer on the adapter ( VMware Knowledge Base )
5. Also provide the below details as
If required , we can request for some advanced stats after checking the above.
Regards,
UJ
Is everything working fine after performing the updated driver ?
We haven't seen any issues so far, and we've been running them for about a month.
Great ! Let me know if the issue re-occurs .
Regards,
UJ
Are things still good, or did the problem come back??
Hey Johan
I have faced the similar issue on Dell servers.
In my case after downgrade the network driver issue resolved, but the driver is not supported for the hardware as per VMware HCL.
Now I have upgraded the dirver to latest and disbled the NetQueue on ESXi host (depends upon your driver) now it seems to be no issues.
Hi all,
We are facing the problem again in one of our environments.
ESXi 6.7
VC 4.62
HP FlexFabric 20Gb 2-port 650FLB Adapter
elxnet version 12.0.1115.0 with firmware 12.0.1110.11
Regards
Johan
Hi, Did you find any solution for this problem?
Regards,
Ray.