Hello Together.
I have since a few month a strange and serious problem with vMotion in a Multi-Nic setup. This issue doesn't happend in a single NIC configuration Unfortnately the VMware support didn't found anything and maybe here somebody can help me to identitfy the source of my problem!
My environment:
Two HP Proliant DL380G7 running ESXi 5.1
Both HP Server have an NC360T Dual-Port NIC installed
vMotion setup:
Seperate vMotion VLAN
Both vmknic are in the same subnet and VLAN
Both vmknic are connected to one vSwitch
vSwitch has two pNICs in active/active state
vmknic have different failover order on nics
Regarding the pNICs vMotion:
first NIC is onboard / second NIC is located at NC360T
both NICs are connected to one Cisco Catalyst 2960 switch
Now my problem description:
I am selecting around 15VMs to migrate from one host to another. Everything is going fine for maybe a few minutes till my whole backbone is going "crazy" and certain VMs, Router, Switches, etc. are not reachable anymore for around 10 seconds. At the same time I can see at my switch, where the vMotion pNICS are connected, that the links from the target Host are going down and come up a few seconds later (like a restart of the adapters)!
I don't thnik it's something related to an hardware defect because it happens regardless from Host1 or Host2
Does anybody have an idea where my problem is? or how to identify the source ?
Thanks for any comment
KKrtz wrote:
Does anybody have an idea where my problem is? or how to identify the source ?
It could be that your physical switches could not handle the load and causes this kind of disconnections. With multi-nic vMotion the amount of traffic sent between the hosts could be very high.
What kind of physical switch (or switches) do you have connected to the hosts?
Cisco Catalyst 2960-S ... Backplane should be able to handle the traffic!
When you say that you lose connections to routers and other network devices, are that from inside the ESXi hosts or do you lose the connections as well from other network attached devices?
I wonder if using the HP Customized ESXi ISO would help? Maybe this is a driver issue that only appears under heavy load?
@richard noble: my network-monitoring sends me several alarms and aswell my client pc cannot reach some VMs in the LAN or WAN
@michaelstump: i am already using HP image
at the moment i am following some hint regarding unicast flooding and arp timeouts which are affecting the spanning tree infrastructure...did somebody had some experiences or hints about this?
KKrtz wrote:
at the moment i am following some hint regarding unicast flooding and arp timeouts which are affecting the spanning tree infrastructure...did somebody had some experiences or hints about this?
vMotion and ARP should not in any way impact Spanning Tree. However if you have some issue with your network card being disconnected by some error then Spanning Tree might make the situation worse, not the least if running the original and obsolete STP.
as far as i understood the spanning tree can get flooded if the switch looses the destination mac in the table after an timeout...this can fill up the uplinks to the core...
A few thoughts:
André
KKrtz wrote:
as far as i understood the spanning tree can get flooded if the switch looses the destination mac in the table after an timeout...this can fill up the uplinks to the core...
Spanning Tree does not really care for any MAC forwarding, it has only the responsibility to make sure we have a loop free layer two topology.
The ordinary switch forwarding enginge has control over the MAC to port mappings and might flood frames if the tables get full. This should in normal cases just be for some small fraction of seconds, since once a reply gets in the switch re-learns the MAC-to-port.
you are right ... and i will check if there is some debug possibilities! ... still i dont understand why the vmotion pnics links are going down
@a.p.:
I don't see anything wrong with the settings.
Although very unlikely the issue, but to rule this out, it might be worth a try to configure the physical ports as access ports and remove the VLAN tag from the port groups.
André
Could you also double-check that you have the "Notify Switches" enabled on the vMotion VMkernel portgroup?
The setting "notify switches" is activated on all vswitches in my environment!
KKrtz wrote:
my network-monitoring sends me several alarms and aswell my client pc cannot reach some VMs in the LAN or WAN
Is any of those alarms regarding physical devices that could not reach other physical devices?
physical to physical have the same problems like vm's...
KKrtz wrote:
physical to physical have the same problems like vm's...
That is interesting of course, since that means it is just not only a logical problem inside the vSphere hosts, but something that really happens on your physical network.
Do you have access to the 2960 from CLI / Telnet / SSH? Could you look at the logs if something obvious does happen when you do the vMotion? Look for any Spanning Tree events, which should not happen, but if the host NIC for some reason is overloaded and disconnected then it might trigger a STP recalculation.
Yes it is...I have to enable the debug for spanning tree on the involved switches and make an test. Unfortnetely this is not something what I can do quickly during working period 😉
KKrtz wrote:
Yes it is...I have to enable the debug for spanning tree on the involved switches and make an test. Unfortnetely this is not something what I can do quickly during working period 😉
You might not have to enable the debug mode, could you just run this command and paste the results?
show spanning-tree