Re: vMotion Multi-NIC setup / network outage - Page 2

KKrtz · ‎04-13-2013

Hello Together.

I have since a few month a strange and serious problem with vMotion in a Multi-Nic setup. This issue doesn't happend in a single NIC configuration Unfortnately the VMware support didn't found anything and maybe here somebody can help me to identitfy the source of my problem!

My environment:

Two HP Proliant DL380G7 running ESXi 5.1

Both HP Server have an NC360T Dual-Port NIC installed

vMotion setup:

Seperate vMotion VLAN

Both vmknic are in the same subnet and VLAN

Both vmknic are connected to one vSwitch

vSwitch has two pNICs in active/active state

vmknic have different failover order on nics

Regarding the pNICs vMotion:

first NIC is onboard / second NIC is located at NC360T

both NICs are connected to one Cisco Catalyst 2960 switch

Now my problem description:

I am selecting around 15VMs to migrate from one host to another. Everything is going fine for maybe a few minutes till my whole backbone is going "crazy" and certain VMs, Router, Switches, etc. are not reachable anymore for around 10 seconds. At the same time I can see at my switch, where the vMotion pNICS are connected, that the links from the target Host are going down and come up a few seconds later (like a restart of the adapters)!

I don't thnik it's something related to an hardware defect because it happens regardless from Host1 or Host2

Does anybody have an idea where my problem is? or how to identify the source ?

Thanks for any comment

KKrtz · ‎04-15-2013

Results for VLAN555 (vMotion)...

-> show spanning-tree vlan 555 summary

Switch is in rapid-pvst mode
Root bridge for VLAN0555 is xxxx.xxxx.xxxx.xxxx
EtherChannel misconfig guard is enabled
Extended system ID           is enabled
Portfast Default             is enabled
PortFast BPDU Guard Default is disabled
Portfast BPDU Filter Default is disabled
Loopguard Default            is disabled
UplinkFast                   is disabled
Stack port is StackPort1
BackboneFast                 is disabled
Configured Pathcost method used is short

Name Blocking Listening Learning Forwarding STP Active
---------------------- -------- --------- -------- ---------- ----------
VLAN0555 1 0 0 6 7

-> show spanning-tree vlan 555

VLAN0555
Spanning tree enabled protocol rstp
Root ID    Priority    8747
             Address     xxxx.xxxx.xxxx
             Cost        4
             Port        24 (GigabitEthernet1/0/24)
             Hello Time   2 sec Max Age 20 sec Forward Delay 15 sec

Bridge ID Priority    33323 (priority 32768 sys-id-ext 555)
             Address     68bd.abdd.1d80
             Hello Time   2 sec Max Age 20 sec Forward Delay 15 sec
             Aging Time 300 sec

Interface           Role Sts Cost      Prio.Nbr Type
------------------- ---- --- --------- -------- --------------------------------
Gi1/0/21            Desg FWD 4         128.21   P2p Edge -> Host A - vmnic0
Gi1/0/22            Desg FWD 4         128.22   P2p Edge -> Host B - vmnic0
Gi1/0/23            Desg FWD 4         128.23   P2p -> other Trunk
Gi1/0/24            Root FWD 4         128.24   P2p -> Uplink Backbone A
Gi2/0/21            Desg FWD 4         128.77   P2p Edge -> Host A - vmnic4
Gi2/0/22            Desg FWD 4         128.78   P2p Edge -> Host B - vmnic4
Gi2/0/24            Altn BLK 4         128.80   P2p -> Uplink Backbone B

rickardnobel · ‎04-15-2013

That looks good, all ports to hosts are in forwarding mode and only the backup-link being blocked. Could you do a vMotion and at the same time re-run this commands a few times and see if anything changes?

However, since you run the Cisco RPVST version even if Spanning Tree was involved it should be very quick, so I do not think this is the cause.

My VMware blog: www.rickardnobel.se

KKrtz · ‎04-15-2013

I will check this when I have a new try...At the moment I think the problem is related to the saturation of the uplinks because the unicast traffic got broadcasted through my whole network...

rickardnobel · ‎04-15-2013

Do you use Cisco VTP? If so, you might have to enable VLAN pruning. Possible the vMotion traffic is extended over your uplinks if they are members of the same VLAN.

My VMware blog: www.rickardnobel.se

KKrtz · ‎04-15-2013

VTP is configured and pruning is active

rickardnobel · ‎04-15-2013

But you still see that the uplink gets congested while doing vMotion?

My VMware blog: www.rickardnobel.se

KKrtz · ‎04-15-2013

I am not hundred perecent sure and I have to analyze this point more in detail. I hope I can proceed soon but I am waiting for a maintenance window....

For the meantime I really want to thank you for your thoughts and I will keep you updated

KKrtz · ‎04-26-2013

Interesting resloved issue in ESXi 5.1 U1...

Long running vMotion operations might result in unicast flooding
When using the multiple-NIC vMotion feature with vSphere 5, if vMotion operations continue for a long time, unicast flooding is observed on all interfaces of the physical switch. If the vMotion takes longer than the ageing time that is set for MAC address tables, the source and destination host start receiving high amounts of network traffic.

This issue is resolved in this release.

....could be my problem?!

All

vMotion Multi-NIC setup / network outage