vMotion Multi-NIC setup / network outage

KKrtz · ‎04-13-2013

Hello Together.

I have since a few month a strange and serious problem with vMotion in a Multi-Nic setup. This issue doesn't happend in a single NIC configuration Unfortnately the VMware support didn't found anything and maybe here somebody can help me to identitfy the source of my problem!

My environment:

Two HP Proliant DL380G7 running ESXi 5.1

Both HP Server have an NC360T Dual-Port NIC installed

vMotion setup:

Seperate vMotion VLAN

Both vmknic are in the same subnet and VLAN

Both vmknic are connected to one vSwitch

vSwitch has two pNICs in active/active state

vmknic have different failover order on nics

Regarding the pNICs vMotion:

first NIC is onboard / second NIC is located at NC360T

both NICs are connected to one Cisco Catalyst 2960 switch

Now my problem description:

I am selecting around 15VMs to migrate from one host to another. Everything is going fine for maybe a few minutes till my whole backbone is going "crazy" and certain VMs, Router, Switches, etc. are not reachable anymore for around 10 seconds. At the same time I can see at my switch, where the vMotion pNICS are connected, that the links from the target Host are going down and come up a few seconds later (like a restart of the adapters)!

I don't thnik it's something related to an hardware defect because it happens regardless from Host1 or Host2

Does anybody have an idea where my problem is? or how to identify the source ?

Thanks for any comment

rickardnobel · ‎04-13-2013

KKrtz wrote:
Does anybody have an idea where my problem is? or how to identify the source ?

It could be that your physical switches could not handle the load and causes this kind of disconnections. With multi-nic vMotion the amount of traffic sent between the hosts could be very high.

What kind of physical switch (or switches) do you have connected to the hosts?

My VMware blog: www.rickardnobel.se

KKrtz · ‎04-13-2013

Cisco Catalyst 2960-S ... Backplane should be able to handle the traffic!

rickardnobel · ‎04-14-2013

When you say that you lose connections to routers and other network devices, are that from inside the ESXi hosts or do you lose the connections as well from other network attached devices?

My VMware blog: www.rickardnobel.se

michaelstump · ‎04-14-2013

I wonder if using the HP Customized ESXi ISO would help? Maybe this is a driver issue that only appears under heavy load?

Data Center Virtualization with VMware - theeagerzero.blogspot.com

KKrtz · ‎04-14-2013

@richard noble: my network-monitoring sends me several alarms and aswell my client pc cannot reach some VMs in the LAN or WAN

@michaelstump: i am already using HP image

at the moment i am following some hint regarding unicast flooding and arp timeouts which are affecting the spanning tree infrastructure...did somebody had some experiences or hints about this?

rickardnobel · ‎04-14-2013

KKrtz wrote:
at the moment i am following some hint regarding unicast flooding and arp timeouts which are affecting the spanning tree infrastructure...did somebody had some experiences or hints about this?

vMotion and ARP should not in any way impact Spanning Tree. However if you have some issue with your network card being disconnected by some error then Spanning Tree might make the situation worse, not the least if running the original and obsolete STP.

My VMware blog: www.rickardnobel.se

KKrtz · ‎04-14-2013

as far as i understood the spanning tree can get flooded if the switch looses the destination mac in the table after an timeout...this can fill up the uplinks to the core...

a_p_ · ‎04-14-2013

A few thoughts:

please provide the physical switch ports configurations (i.e. show run int giX/X)
is your vMotion VLAN a routed or non-routed subnet
can you confirm the vSwitches as well as the port groups use default settings, except for the active/standby configuration
can you confirm none of the other VMKernel ports have "vMotion" accidentally enabled

André

rickardnobel · ‎04-14-2013

KKrtz wrote:
as far as i understood the spanning tree can get flooded if the switch looses the destination mac in the table after an timeout...this can fill up the uplinks to the core...

Spanning Tree does not really care for any MAC forwarding, it has only the responsibility to make sure we have a loop free layer two topology.

The ordinary switch forwarding enginge has control over the MAC to port mappings and might flood frames if the tables get full. This should in normal cases just be for some small fraction of seconds, since once a reply gets in the switch re-learns the MAC-to-port.

My VMware blog: www.rickardnobel.se

KKrtz · ‎04-14-2013

you are right ... and i will check if there is some debug possibilities! ... still i dont understand why the vmotion pnics links are going down

KKrtz · ‎04-14-2013

@a.p.:

switchport config for all vmotion/esxi ports:
switchport mode trunk
switchport nonegotiate
spanning-tree portfast trunk
vmotion is an non-routed vlan but available in the whole switch infrastructure
all vswitch have default settings except failover order
three vmknic are configured
1x management / replication - vswitch1
2x only vmotion - vswitch0

a_p_ · ‎04-14-2013

I don't see anything wrong with the settings.

Although very unlikely the issue, but to rule this out, it might be worth a try to configure the physical ports as access ports and remove the VLAN tag from the port groups.

André

rickardnobel · ‎04-14-2013

Could you also double-check that you have the "Notify Switches" enabled on the vMotion VMkernel portgroup?

My VMware blog: www.rickardnobel.se

KKrtz · ‎04-14-2013

The setting "notify switches" is activated on all vswitches in my environment!

rickardnobel · ‎04-15-2013

KKrtz wrote:
my network-monitoring sends me several alarms and aswell my client pc cannot reach some VMs in the LAN or WAN

Is any of those alarms regarding physical devices that could not reach other physical devices?

My VMware blog: www.rickardnobel.se

KKrtz · ‎04-15-2013

physical to physical have the same problems like vm's...

rickardnobel · ‎04-15-2013

KKrtz wrote:
physical to physical have the same problems like vm's...

That is interesting of course, since that means it is just not only a logical problem inside the vSphere hosts, but something that really happens on your physical network.

Do you have access to the 2960 from CLI / Telnet / SSH? Could you look at the logs if something obvious does happen when you do the vMotion? Look for any Spanning Tree events, which should not happen, but if the host NIC for some reason is overloaded and disconnected then it might trigger a STP recalculation.

My VMware blog: www.rickardnobel.se

KKrtz · ‎04-15-2013

Yes it is...I have to enable the debug for spanning tree on the involved switches and make an test. Unfortnetely this is not something what I can do quickly during working period 😉

rickardnobel · ‎04-15-2013

KKrtz wrote:
Yes it is...I have to enable the debug for spanning tree on the involved switches and make an test. Unfortnetely this is not something what I can do quickly during working period 😉

You might not have to enable the debug mode, could you just run this command and paste the results?

show spanning-tree

My VMware blog: www.rickardnobel.se

All

vMotion Multi-NIC setup / network outage