Hello Together.
I have since a few month a strange and serious problem with vMotion in a Multi-Nic setup. This issue doesn't happend in a single NIC configuration Unfortnately the VMware support didn't found anything and maybe here somebody can help me to identitfy the source of my problem!
My environment:
Two HP Proliant DL380G7 running ESXi 5.1
Both HP Server have an NC360T Dual-Port NIC installed
vMotion setup:
Seperate vMotion VLAN
Both vmknic are in the same subnet and VLAN
Both vmknic are connected to one vSwitch
vSwitch has two pNICs in active/active state
vmknic have different failover order on nics
Regarding the pNICs vMotion:
first NIC is onboard / second NIC is located at NC360T
both NICs are connected to one Cisco Catalyst 2960 switch
Now my problem description:
I am selecting around 15VMs to migrate from one host to another. Everything is going fine for maybe a few minutes till my whole backbone is going "crazy" and certain VMs, Router, Switches, etc. are not reachable anymore for around 10 seconds. At the same time I can see at my switch, where the vMotion pNICS are connected, that the links from the target Host are going down and come up a few seconds later (like a restart of the adapters)!
I don't thnik it's something related to an hardware defect because it happens regardless from Host1 or Host2
Does anybody have an idea where my problem is? or how to identify the source ?
Thanks for any comment
Results for VLAN555 (vMotion)...
-> show spanning-tree vlan 555 summary
Switch is in rapid-pvst mode
Root bridge for VLAN0555 is xxxx.xxxx.xxxx.xxxx
EtherChannel misconfig guard is enabled
Extended system ID is enabled
Portfast Default is enabled
PortFast BPDU Guard Default is disabled
Portfast BPDU Filter Default is disabled
Loopguard Default is disabled
UplinkFast is disabled
Stack port is StackPort1
BackboneFast is disabled
Configured Pathcost method used is short
Name Blocking Listening Learning Forwarding STP Active
---------------------- -------- --------- -------- ---------- ----------
VLAN0555 1 0 0 6 7
-> show spanning-tree vlan 555
VLAN0555
Spanning tree enabled protocol rstp
Root ID Priority 8747
Address xxxx.xxxx.xxxx
Cost 4
Port 24 (GigabitEthernet1/0/24)
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec
Bridge ID Priority 33323 (priority 32768 sys-id-ext 555)
Address 68bd.abdd.1d80
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec
Aging Time 300 sec
Interface Role Sts Cost Prio.Nbr Type
------------------- ---- --- --------- -------- --------------------------------
Gi1/0/21 Desg FWD 4 128.21 P2p Edge -> Host A - vmnic0
Gi1/0/22 Desg FWD 4 128.22 P2p Edge -> Host B - vmnic0
Gi1/0/23 Desg FWD 4 128.23 P2p -> other Trunk
Gi1/0/24 Root FWD 4 128.24 P2p -> Uplink Backbone A
Gi2/0/21 Desg FWD 4 128.77 P2p Edge -> Host A - vmnic4
Gi2/0/22 Desg FWD 4 128.78 P2p Edge -> Host B - vmnic4
Gi2/0/24 Altn BLK 4 128.80 P2p -> Uplink Backbone B
That looks good, all ports to hosts are in forwarding mode and only the backup-link being blocked. Could you do a vMotion and at the same time re-run this commands a few times and see if anything changes?
However, since you run the Cisco RPVST version even if Spanning Tree was involved it should be very quick, so I do not think this is the cause.
I will check this when I have a new try...At the moment I think the problem is related to the saturation of the uplinks because the unicast traffic got broadcasted through my whole network...
Do you use Cisco VTP? If so, you might have to enable VLAN pruning. Possible the vMotion traffic is extended over your uplinks if they are members of the same VLAN.
VTP is configured and pruning is active
But you still see that the uplink gets congested while doing vMotion?
I am not hundred perecent sure and I have to analyze this point more in detail. I hope I can proceed soon but I am waiting for a maintenance window....
For the meantime I really want to thank you for your thoughts and I will keep you updated
Interesting resloved issue in ESXi 5.1 U1...
Long running vMotion operations might result in unicast flooding
When using the multiple-NIC vMotion feature with vSphere 5, if vMotion operations continue for a long time, unicast flooding is observed on all interfaces of the physical switch. If the vMotion takes longer than the ageing time that is set for MAC address tables, the source and destination host start receiving high amounts of network traffic.
This issue is resolved in this release.
....could be my problem?!