Hi,
We have a strange problem/bug in our new VMware cluster.
Environment
BL460 gen10 with HP FlexFabric 20Gb 2-port 650FLB Adapter
HPE C7000 chassis
vSphere 6.0 (build 6775062)
ESX01,ESX03 and ESX03 are in chassi01
ESX04,ESX05 and ESX are in chassi02
VMs intermittent loses network connectivity.
When this happens the “remedy” is to migrate the specific VM to some other host in the cluster.
So far it seems that it doesn’t matter if I migrate the VM to a VMhost inside the same chassis or to the other chassis , just a migration seems to solve the issue. (I can’t migrate it back to the same host though)
I have around 150 VMs in this cluster and so far I’ve had issues with 5-6 of them , completely random.
They could be on any of my VMhosts in the cluster.
Haven’t created any support case with VMware or HPE yet , this forum post is my first advance to tackle this problem.
All firmware is updated to the latest from HPE
Someone who have seen similar issues?
Regards
Johan
HPE has published this advisory:
HPE Support document - HPE Support Center
Advisory: VMware - HPE ProLiant Server Configured With Certain Network Adapters And Running VMware ESXi 6.0 U3 May Randomly Lose Connection to Individual Virtual Machines
This is an issue with vLAN availability on the hosts the VM is currently running on.
Or vLAN availability on one of the 2 or more nics you are using for that portgroup.
Please check if the vLAN is available on all the nics the switch uses.
you can use CDP to discover the same.
or below command form ESXi ssh.
vim-cmd hostsvc/net/query_networkhint
Hi hussainbte,
It's not a VLAN availability problem.
We use SUS (Shared uplink set) on our virtual connect switches , and VLAN config is verified both on VC/ServerProfiles and on our juniper switches.
A VM suddenly loses network connectivity, No vmotion has happened when this occur.
Like I wrote before the remedy is to migrate the VM to some other host , then after 5-10 minutes its possible to vMotion the VM to its original host.
To me, it sounds like some CAM table somewhere which won't update mac addresses or maybe some bug in the VC switches. Or maybe some garp issue somewhere ...
Regards,
Johan
Hi Johan!
I got the same problem, my setup:
esxi 6.5 - 6765664
HP C7000 enclosure with HP VC Flex-10/10D Module
ProLiant BL460c Gen9 with HP FlexFabric 20Gb 2-port 650FLB Adapter
I updated all hosts from latest spp (Service Pack for ProLiant (SPP) Version 2017.10.1)
I've opened cases with VMware and HPE, but still no luck. Now we are trying to find right combination of network card firmware and drivers, sound a little bit weird
Can you show output from these commands?
esxcli software profile get
esxcli network nic get -n vmnic0
What version of VC do you have?
I got 4.61 and looks like this can be a cause https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00029108en_us
But to downgrade to 4.5 i have to shutdown whole enclosure, and setup everything from scratch, i hope to find another solution.
Hi,
VC version 4.60 and 4.61 seems to be the problem. (We run on 4.61)
We also have a case with HPE , they tell us to downgrade to 4.50
But like you say , to be able to downgrade it seems that we have to shutdown the whole VC domain which isn't an option for us right now....
I find it very strange that their isn't a simple way to downgrade the VC.
Hopefully, HPE will get back to us during the day with some guidance.
Cheers
It may be, or may be not related to 4.6.0-1
I have another VMware cluster in this enclosure, it uses the same virtual distributed switch, and the same uplinks through the same virtual connect modules.
But the servers are ProLiant BL460c Gen8, 3 of them with "HP Flex-10 10Gb 2-port 530FLB Adapter" - and there are no issues with virtual machines on them.
And one with "HP FlexFabric 20Gb 2-port 650FLB Adapter" like my problem cluster with BL460c Gen9.
And guess what? It also has this problem.
So the theory about right combination of firmware and driver may be true.
These three combination was wrong:
Firmware: 11.1.183.62
Driver: 11.2.1149.0
Firmware: 11.2.1263.19
Driver: 11.4.1205.0
Firmware: 11.2.1263.19
Driver: 11.2.1149.0
And now i'm testing (like HPE support was told me)
Firmware: 11.1.183.23
Driver: 11.1.196.3
Can you check you firmware and driver with these commands?
esxcli network nic list
esxcli network nic get -n vmnic0
Hi ,
Like you I only see the "issues" on gen9 and gen10 servers.
Gen8 with HP "FlexFabric 10Gb 2-Port 534FLB Adapter" > No issues (Have around 12 gen8 servers)
Gen9 with HP "FlexFabric 20Gb 2-port 650FLB Adapter" > Some issues, seen VMs acting weird, packet drops etc.
Gen10 HP "FlexFabric 20Gb 2-port 650FLB Adapter" > Big issues, VMs random loses network connectivity, packet drops.
Gen10
esxcli network nic list
Name PCI Device Driver Admin Status Link Status Speed Duplex MAC Address MTU Description
------ ------------ ------ ------------ ----------- ----- ------ ----------------- ---- -----------------------------------------------------------
vmnic0 0000:37:00.0 elxnet Up Up 10000 Full 70:10:6f:43:84:48 1500 Emulex Corporation HP FlexFabric 20Gb 2-port 650FLB Adapter
vmnic1 0000:37:00.1 elxnet Up Up 10000 Full 70:10:6f:43:84:50 1500 Emulex Corporation HP FlexFabric 20Gb 2-port 650FLB Adapter
vmnic2 0000:37:00.2 elxnet Up Up 10000 Full 70:10:6f:43:84:49 1500 Emulex Corporation HP FlexFabric 20Gb 2-port 650FLB Adapter
vmnic3 0000:37:00.3 elxnet Up Up 10000 Full 70:10:6f:43:84:51 1500 Emulex Corporation HP FlexFabric 20Gb 2-port 650FLB Adapter
vmnic4 0000:37:00.4 elxnet Up Up 10000 Full 70:10:6f:43:84:4a 1500 Emulex Corporation HP FlexFabric 20Gb 2-port 650FLB Adapter
vmnic5 0000:37:00.5 elxnet Up Down 0 Half 70:10:6f:43:84:52 1500 Emulex Corporation HP FlexFabric 20Gb 2-port 650FLB Adapte
esxcli network nic get -n vmnic0
Advertised Auto Negotiation: true
Advertised Link Modes: 1000baseT/Full, 10000baseT/Full, 20000baseT/Full
Auto Negotiation: true
Cable Type:
Current Message Level: 4631
Driver Info:
Bus Info: 0000:37:00:0
Driver: elxnet
Firmware Version: 11.2.1263.19
Version: 11.2.1149.0
Link Detected: true
Link Status: Up
Name: vmnic0
PHYAddress: 0
Pause Autonegotiate: true
Pause RX: true
Pause TX: true
Supported Ports:
Supports Auto Negotiation: true
Supports Pause: true
Supports Wakeon: true
Transceiver: external
Virtual Address: 00:50:56:5f:66:dc
Wakeon: MagicPacket(tm)
Right now we have evacuated one chassis and downgrading to 4.50
Cheers
Johan
Hi Johan, thank you for information!
Please post you results on 4.5, I have to decide what to do next
I spend one day on Firmware 11.1.183.23 and on Driver 11.1.196.3 with no errors.
Try isolating it when the VM is losing network, check the esxtop -> Press N -> find which NIC it is using. If you have multiple NIC's configured for the VM portgroup, try to uncheck the VM network (click ok) and check it back, which then switches the NIC, you can confirm that in the esxtop.
If the VM network is working fine, then you can isolate the NIC that way.
If you are seeing this same way on multiple hosts then one nic on each host (need to isolate) and check the physical switch configuration to which the NIC's are connected or try to check if they are same as the other NIC where the VM running.
Thanks,
MS
Hi Sergey,
We have 4 chassis
In each chassis we have:
3 gen10 servers.
10 gen9 servers.
3 gen8 servers.
All blades except 2 are ESXi 6.0u3 hosts.
In our case it feels like the problem escalated somehow when we took the gen10 server into production. But we are not certain...
To try to pinpoint the problem we have now done the following:
In chassis 1 and 2 we have downgraded the CNA firmware/driver (your hint) on the gen10 ESXi hosts, VC firmware is still 4.61
In chassis 3 and 4 we have put the gen10 servers into maintenance mode. VC firmware is downgraded to 4.50
After we downgraded the VC firmware in C3 and C4 we still had issues. (Random packet loss on VMs running in C3 and C4)
But after we downgraded the CNAs on the gen10 blades in C1 and C2 we havent seen any issues and our environment seems stable. ( Only 8 hours now though)
It's a very strange problem, hard to troubleshoot, so intermittent.
How are things in your environment? Still good after the downgrade?
Do you have any types of loadbalancers? (Wonder if our F5s could have something to do with the problem)
Cheers
Johan
Hi Johan, thank you for sharing results with vc 4.5.
I have not seen any issue for 50+ hours with CNA firmware: 11.1.183.23 and driver: 11.1.196.3
We dont have any load balancers it this configuration and also we don't have any G10 servers yet.
During troubleshooting i try to simplify everything as possible. So right now our configuration looks like this:
bay 1 - HP VC Flex-10/10D Module
bay 2 - HP VC Flex-10/10D Module
Two SUS, each with only one physical uplink (no LACP). Every uplink is a trunk, so there is a bunch of vlans in it.
profile attached to esxi server:
vlans from sus uplink1 goes to port 1, and from uplink 2 to port 2
On distributed switch we have distributed port group called "servers", all problem virtual machines attached to it. All traffic goes through one uplink "servers01" which points to vmnic0 on every esxi server. So no load balancing here to.
Our issues started after we replace or old virtual connect modules with new HP VC Flex-10/10D Modules (and update it to 4.61 from very beginning), and add new G9 servers to the enclosure (and update them from latest SPP). We do all this as one step, that's why i'm uncertain what to blame VC or CNA
Hope that right cna firmware/driver will help us.
Hi Sergey,
We are pretty sure that we have pinpointed the "bug"
Has nothing to do with our new gen10 blades and it's not the VC firmware.
It's the CNA firmware (11.2.1263.19) from the Okt SPP.
Done a lot of testing and we can reproduce the problem on VMhosts with the "11.2.1263.19" firmware. (Both on VC 4.50 and 4.61)
We have now downgraded the CNA firmware to "11.1.183.62" and our environment is stable again.
I find it very strange that HPE doesn't know about this problem, must be many customers around the world who have issues like we did.
Cheers
Johan
Hello all,
Just wanted to add to the investigation here as we are seeing the same issue of VMs intermittently dropping off the network and the fix being to vMotion the VMs to another host.
We are running a similar setup:
6x BL460C Gen10 blades with 650FLB adapters
C7000 chassis with FlexFabric 20/40 F8 modules
vSphere 6.0 (build 6921384)
ESX01, 02 and 03 are in chassis01
ESX04, 05 and 06 are in chassis02
vDS version 6.0
PortGroup Settings:
Promiscuous Mode: Reject
MAC Address Changes: Accept
Forged Transmits: Accept
Load Balancing: route based on physical NIC load
Network Failover Detection: Beacon Probing
Notify Switches: Yes
Failback: Yes
dvUplink1 and dvUplink2 both Active Uplinks
The VCs are firmware version 4.50 (we previously downgraded this because 4.60 and 4.61 were revoked by HPE)
We first started seeing this issue with all 6 hosts running
Firmware Version: 11.1.183.62 (having to use older firmware due to an issue with recovering from a fibre cable loss on newer firmware - host unable to see paths to storage again even after fibre cable replaced until the host was rebooted)
Driver Version: 11.2.1149.0
We have since upgraded two of the hosts to firmware version 11.2.1263.19 but have had repeat issues with VMs on these hosts so this hasn't fixed the issue.
So to re-clarify some of the suggestions on here and cover them off:
VC firmware downgrade to 4.50 doesn't fix the issue
Firmware version 11.1.183.62 doesn't fix the issue (with driver 11.2.1149.0)
Firmware version 11.2.1263.19 does fix the issue (with driver 11.2.1149.0)
I'll be logging this with VMware and HPE today. Does anyone else have any other open cases with them that I could reference to improve our chances of finding a fix?
There are suggestions in other posts (with not so similar hardware setup) that the issue is likely to be with the MAC address tables on the physical switches. Because we have Notify Switches turned on on the vDS PortGroups when a vMotion completes it notifies switches to update their MAC Address tables and this fixes the issue. So possibly somehow the Physical Switches are losing the correct MAC address for the IP address of the VM and the vMotion fixes this by notifying the switches of the MAC address.
Thanks,
glamic26
Hi,
Just a quick update.
On our gen10 servers, we have this setup which has been stable for the last 36 hours.
@glamic26
You say that you have seen problems with "11.1.183.62" ? (with driver 11.2.1149.0)
Cheers
Johan
Any updates from HPE on this?
I think we're experiencing the same issue. 2 C7000 enclosures, 24 blades with Virtual Connect modules and 650FLB NIC's in the blades. ESXi 6, VM's will randomly drop off the network and come back when vMotioned.
Rebooting the hosts seems to make the issue go away for a while - last time we didn't see it for about 20 days after rebooting all the blades.
We are currently downgrading firmware on the NIC's to see if this helps.
We are having very similar issues in our environment now but only after upgrading to 6.5 from 6.0. We've upgraded our emulex drivers to 11.2.1269 and network drivers to 11.2.1149 but continue to have issues with VMs dropping communication with VMs outside of the host on the same port-group (can communicate with VMs on the same port-group on the same host). Our VC firmware version is on 4.45 but it seems from the dialogue that the VC isnt the problem. Additionally, these VMs cannot talk to any other port-group on the same or different host either. It's not until we vmotion the VM or disable / enable the VM NIC that the RARP brings the VM online again - with the upstream switches / gateway that is. We may have a VM or VMs go down all within a short time, or we may go a couple of days without an issue - we dont see any pattern to what is triggering this event.
Environment:
C7000
BL460C Gen9 blades
FlexFabric 20Gb 2-port 650FLB Adapter 11.2.1269 / 11.2.1149
Virtual Connect firmware 4.45
esxi 6.5 w/ distributed switches
We have done a lot to try and stabilize the situation, including:
Initially upgraded our emulex drivers from 10.5 to 11.2.1269
Recreated the port-groups for the original migrated DVS
Recreated the DVS from scratch along with all of the port-groups.
Rebooted the upstream switches
Changed the port-group load balancing method to 'route based on originating virtual port' from 'NIC load'
Created static MAC address entries on 2 VMs to test communication between each other (failed)
Created interface IP(s) on the upstream switch(es) on the failed VM subnet to test connectivity to the VM (failed)
Removed MAC address entry in the address table on the upstream switch
Upstream switches do not show any issues with flapping during a failure event
VMWare logs/Log InSight/vROPS/ have no visibility into the issue as no events are logged during these failures
We had VMs fail on both sides of the chassis/VC
Any update to your own situation would be appreciated.
Thank you.
Hey,
Did anyone get an answer to this issues? Does anyone have an HPE case number I can reference my local HPE support team with?
Thanks
Hi There,
Have you tried doing this ?
To reduce burst traffic drops in Windows Buffer Settings:
This is applicable for vmxnet3
and most of the time this resolves the issue
Anyone get answer from HPE or VMWare ?
We had like same issue using Flexfabric 650M.
But the issue has gone after reboot host a few times or down/up vmnic usng esxcli command.
The issue is happened on E1000 adapter.
Guest‘s MAC address record on Flex-10 did not change from old port to new port when I did vMotion.
I think Flex-10 does not receive RARP or something packets for updating MAC address table...