Re: ESXi Management not working Cisco LAG

ZFSRocks · ‎01-09-2013

I am setting up a 3 host ESXi cluster with an Essentials + license. I am using a pair of stacked Cisco SG500-28 switches for switching redundancy. Each host has 8 NICs. 4 to each switch. I have successfully setup a 3 NIC LAG with 1 path to one switch and 2 paths to the other. These LAGs work. When I setup a 2NIC LAG via the console for management, and the associated ports on the switches, I lose managment communication with the host. Before setting up the LAG in the ESXi console, I set that vswitch properties to us IPHASH as instructed here bit.ly/VLaTEt I have attempted to follow those instructions as closely as possible. It works on the other vswitch that is for vMotion. I can vmkping between the hosts over that LAG. But setting up a LAG on the management vSwitch causes the host to disappear from vCenter?

Any help would be greatly appreciated as I need to get this cluster up.

Thanks,

Stephen

rickardnobel · ‎01-09-2013

How have you setup the Link Aggregation Group on the physical switch? Is it static, i.e. no PaGP or LACP? Typically called "mode on" in Cisco devices.

My VMware blog: www.rickardnobel.se

ZFSRocks · ‎01-09-2013

It is static. LACP is off. I did find in the manual that on the SG500 series supports two modes of load balancing. src-dst-mac and src-dst-mac-ip. Default is src-dst-mac. I changed it to src-dst-mac-ip. It didn't seem to help. The other lags that are working are src-dst-mac still.

Stephen

rickardnobel · ‎01-09-2013

ZFSRocks wrote:
did find in the manual that on the SG500 series supports two modes of load balancing. src-dst-mac and src-dst-mac-ip. Default is src-dst-mac. I changed it to src-dst-mac-ip. It didn't seem to help.

The actual load balancing mode does not matter, each side is free to select how it will distribute the outgoing frames.

The other lags that are working are src-dst-mac still.

One question here: are you sure that they work? You would have to have some VMs and quite a lot of clients with different IP addresses to be sure to traffic is distributed on all links and depending on how you have tested this - it might actually not work fully there too.

One more question: are you sure that this specific Cisco switches support cross-switch etherchannel? I have not worked with the SG500 so I do not know, but many switches do not.

My VMware blog: www.rickardnobel.se

ZFSRocks · ‎01-09-2013

Rickard,

I guess I define "work" as I do not lose the connection. With the other vswitch, I can vmkping the vMotion interfaces on each of the three LAGs to each of the three hosts in the cluster. Which means the those LAGs at least are functioning to pass traffic. Whether it would actually load balance is a different question.

Now, I am setting up the LAG on the management interface via the host console and not via vCenter. Docs say to do it via vCenter but I don't want to do anything that I can't undo if I can't get them to talk. With as much trouble as I have been having I guess I am worried I will lose the host.

Stephen

rickardnobel · ‎01-09-2013

ZFSRocks wrote:
Which means the those LAGs at least are functioning to pass traffic. Whether it would actually load balance is a different question.

What I primarily mean is that the multi-switch LAG might not be fully functional and that you by luck have not seen the effects of this yet. However, that is just speculation coming from how the VMware IP Hash algorithm work.

Now, I am setting up the LAG on the management interface via the host console and not via vCenter. Docs say to do it via vCenter but I don't want to do anything that I can't undo if I can't get them to talk. With as much trouble as I have been having I guess I am worried I will lose the host.

With setting up the LAG, do you mean to set the vSwitch NIC teaming policy to IP Hash and then attach the specific vmnics?

You could always quite easy remove extra vmnics from vSwitch0 through the ESXi DCUI console, so it might be possible to work from vSphere Client and still be able to revert. It might make it somewhat more easy to look at the settings and verify they are correct.

I am still curious if the Cisco devices support multi-switch etherchannel or not, however: do you really need IP Hash / LAG on the Management interfaces? You could still get redundancy with multiple vmnics connected to both physical switches using NIC teaming policy "Port ID" which needs no physical switch configuration except VLAN tagging.

My VMware blog: www.rickardnobel.se

jasonvp · ‎01-09-2013

Rickard Nobel wrote:
I am still curious if the Cisco devices support multi-switch etherchannel or not

The SG500 series switches support a feature called "Stacking" which turns 2 or more switches into a single logical switch (basically). So a LAG across 2 switches to a single server should work just fine, assuming the OP has in fact enabled the stacking feature.

jas

rickardnobel · ‎01-09-2013

jasonvp wrote:
The SG500 series switches support a feature called "Stacking" which turns 2 or more switches into a single logical switch (basically). So a LAG across 2 switches to a single server should work just fine, assuming the OP has in fact enabled the stacking feature.

From a quick view in the manual, starting page 150, I see nothing really that states that it is either supported or unsupported. However, the stacking might be so solid that it does not have to be specific mentioned at all.

http://www.cisco.com/en/US/docs/switches/lan/csbms/Sx500/administration_guide/500_Series_Admin_Guide...

My VMware blog: www.rickardnobel.se

ZFSRocks · ‎01-09-2013

It is most definitely a stacked switch. The management allows for creating LAGs across units.

Here is what I am starting to think. I am wondering if the load balancing that is set from the console "Configure Management Network" settings isn't the same. If I add a NIC from that menu, but don't create the LAG on the switch, vCenter shows two NICs with one in standby mode. I did test to see if disconnecting the active link caused failover to happen and it didn't.

Stephen

rickardnobel · ‎01-09-2013

ZFSRocks wrote:
I am wondering if the load balancing that is set from the console "Configure Management Network" settings isn't the same. If I add a NIC from that menu, but don't create the LAG on the switch, vCenter shows two NICs with one in standby mode.

If possible, it would be good to enter the graphical vSphere Client and verify the settings now. The network configuration options done from the DCUI is really only meant for a quick way to do the initial IP configuration before using vSphere Client for the first time, and also to repair network config mistakes. Through that it is somewhat less understood and unclear what will happen with the vSwitch NIC Teaming Policy and similar settings when making changes from the DCUI.

And again - are you sure you need IP Hash - LAG for the Management network? You could get full redundancy with default Port ID with no LAG configuration at the physical switches.

My VMware blog: www.rickardnobel.se

ZFSRocks · ‎01-09-2013

Rickard,

The cluster is going to be using VSA. So, I am trying to replicate this layout http://bit.ly/13ezh3v except with a total of 8 NICs per host. Since the VSA front end network has to be on the same VLAN as management, I was just going to put them in the same LAG. If there is an alternate or better way, I am open to it. I am not at all familier with Default Port ID. Can you point me to some info on it?

Stephen

ZFSRocks · ‎01-09-2013

So, I made an image of my ESXi boot disk, then I went into vCenter and setup LAG on the management vSwitch. Then set it up on the pSwitches. The host came back, but is in a very odd state. I seem to have full control of it from vCenter but I cannot vmkping it from the other hosts. vCenter has it in alarm state and says "HA detected a possible host failure of this host." I cannot get it to clear. So, what is up? It seems to be in an operational state yet it isn't?

Stephen

ZFSRocks · ‎01-09-2013

It looks like vCenter eventually lost track of it. Odd because it took a very long time. In fact, it wasn't until I tried to add 2 more NICs that it actually failed.

Stephen

rickardnobel · ‎01-09-2013

ZFSRocks wrote:
I am not at all familier with Default Port ID. Can you point me to some info on it?

The VMware Port ID gives you a simple kind of load balancing together with failover. However, a single VM could not get more bandwidth than a single physical NIC port can give. The only "extra" for IP Hash is that it could give a VM the bandwidth of several physical NICs port at the same time, if there are multiple external clients.

Here are a good overview: http://kensvirtualreality.wordpress.com/2009/04/05/the-great-vswitch-debate%E2%80%93part-3/

ZFSRocks wrote:
I seem to have full control of it from vCenter but I cannot vmkping it from the other hosts. vCenter has it in alarm state and says "HA detected a possible host failure of this host." I cannot get it to clear. So, what is up? It seems to be in an operational state yet it isn't?

The possible host failure detection from HA comes from the loss of management connectivity, so the network is not stable in this condition.

I see two possible reasons at the moment:

1. The configuration from the DCUI might do something "strange" with the vSwitch IP Hash configuration. For IP Hash you must have that NIC teaming policy set on both the vSwitch and on all port groups, including the vmkernel. If the vmkernel (management) has for some reason a different NIC Teaming Policy that it will not work.

2. Another possibility is that either is multi-switch etherchannel (LAG) not working on your switch type, or it has in some way not been correctly configured.

The sympthoms you get indicates a non-working Link Aggregation, in the meaning that the both sides do not agree over which ports are part of the team, and one of the parts sends packets over a link that the other parts belives is a non-member and the frames are throwed away.

Do you have any log on the physical switches? Do you see any MAC flapping errors or other?

My VMware blog: www.rickardnobel.se

jwhitehv · ‎01-11-2013

Before you do too much work on getting static link aggregation working, you might want to consider that VSA won't do a brownfield installation onto vSwitches with anything except NIC teaming with failover. The failover order of VSA-Front End has to be opposite of VM Network and Management Network. And it won't work both ways, it has to be specific. For example, VSA-Front End might work with active vmnic4 and standby vmnic0, but not the other way around. The VM Network and Managment portgroups need to be set up the opposite of whichever way works.

Lost of few hours today nudging LAG into just the right shape, only to have the brownfield installer point out that the instructions had very specific configuration requirments listed.

Considerations for Brownfield Installation

I blog at vJourneyman | http://vjourneyman.com/

ZFSRocks · ‎01-14-2013

John,

Okay, so I was just reading through the setup on VSA and realized it was going to do a bunch of setup for me. This isn't a brownfield install because it is a fresh cluster, however, I wanted to get the LAGs all set up. I guess maybe the best way to do this is to let the VSA installer set it up and then adjust the network afterward?

Thanks,
Stephen

darkbgr123 · ‎02-05-2013

Hi Stephen,

Were you able to get LAG working?

I am using the SG200-26 switches from Cisco and I had the exact same problem when the second vmnic was being added by the installer. It would lose connectivity. I didn't see anything suspicious in the switch logs. I had to manually log on to my iLo and disable the standby adapter through the ESXi console (while the installer was configuring networking), in order to make it work.

Now I have just manually configured networking: one vmnic on each port group (VSA-Front End, VSA-Back End, vMotion, and VM network)

I only have 4 NIC's on each host but it's working OK so far. No redundancy at the moment other than the network RAID1.

Did you try posting on the Cisco forum

ZFSRocks · ‎02-05-2013

darkbgr123,

John White gave me the answer. You can find it here http://virtual-journeyman.john-refactored.com/

Stephen

jwhitehv · ‎02-05-2013

ZFSRocks wrote:
Okay, so I was just reading through the setup on VSA and realized it was going to do a bunch of setup for me. This isn't a brownfield install because it is a fresh cluster, however, I wanted to get the LAGs all set up. I guess maybe the best way to do this is to let the VSA installer set it up and then adjust the network afterward?

It's the only way if you want it to work.

Did you get VSA working on LAGs? It wasn't stable for me. And with two nodes, it's hard to argue that it's an advantage over Active/Standby NIC Teaming.

I blog at vJourneyman | http://vjourneyman.com/

darkbgr123 · ‎02-05-2013

Thank you both..

I gave up on the LAGs, will keep things simple for now. It's not supported as per the following VMWare FAQ for v1 of the VSA:

## http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=200138...

Q. Can you configure VSA to do NIC teaming with more than two physical NICs (whether you do from VSA or from vCenter Server/ESX)?

A. Yes, from the vSphere Client on each host.

The VSA installer does not utilize more than 4 NIC ports configured as active/standby uplinks across the two VSA virtual switches. However, the administrator can manually configure additional active uplinks for either of the vSwitches or their component port groups via vCenter Server.

Note: This can be used to add redundancy, but not to increase network bandwidth between any two ESXi hosts. None of the vSphere NIC teaming load-sharing policies load balance/share network IO across multiple active teamed uplinks for the same TCP connection. In a three-node VSA cluster, the IP Hash load-balancing policy can be used to distribute network traffic among multiple uplinks, such that each pair of ESXi hosts communicates over a different channel.

For more information, see VSA Cluster Network Architecture in the VSA documentation.

All

ESXi Management not working Cisco LAG