Hello there,
I see a lot of best practice examples out there for how to configure teaming policies and port groups for A/A designs, but almost none for A/S designs. Can I get a sanity check?
As you can see, I have an Edge VM on an ESXi host with two physical nics. The edge VM's interfaces are attached to a port groups on an Edge VDS.
First question: Is it bad to have external traffic failover to nic#2 in the event that nic#1 should die? This would be accomplished by setting the trunked port groups with one nic as active, and the other as standby. I would assume that this kind of failover would cause ARP chaos for the physical router that could not be corrected in a timely manner. Is this assumption correct? I notice that in A/A designs where people have two ToR vlans, people do not trunk both VLANS onto both portgroups; they let that VLAN die if a NIC dies. Thoughts?
Second question: What is considered best practice for traffic steering in A/S designs? One option would be to pin VLAN and overlay traffic to specific NICs, preventing the scenario described above. A second option might be to let overlay traffic load-balance across both NICs, and have ToR vlan traffic pinned to one.
In my environment, I am partial to the first option to avoid ARP problems in the event of failure, and also because my NICs are low-bandwidth (10Gbe each); I would rather the edge node just die and failover to a healthy one if something goes wrong with a nic, than to try to shove both ToR and overlay traffic down one link. The only consideration I have would be about what might happen if the NIC pinned to the TEP died, and the ToR nic stayed up. Would NSX be smart enough to figure out the edge was unusable, despite having an active uplink on the external VLAN? I'm going around in circles here with "what-ifs" and not finding tons of resources to inform me.
Any insight would be appreciated.
---------
The failover diagram contains an error. The external traffic would go up e1, through fp-eth0, hit the left trunk portgroup and then exit through uplink-2. In otherwords, the failover would be handled by the portgroup.
Hi,
I've gone through your questions, and have a few observations.
"I'm not 100% sure which scenario you're describing, but I think you're saying that the edge has little awareness of what is happening upstream of its own virtual nics (i.e., the fastpath interfaces). Therefore, in my scenario where the external traffic is pinned to vmnic#1 and TEP traffic to vmnic#2, if vmnic#2 dies, the Edge doesn't know and blackholes the traffic because his fastpath interfaces are still up. Is that correct?"
For the failure scenario, that is correct, especially if you have the portgroups attached to the vnic of the Edge in Active/unused (no redundancy).
"How then, do VM Edges know that they are in a down state?"
There are various mechanisms used to determine a failure condition, without going too deep into the architecture of the Edge node, it requires management connectivity, and at least one TEP tunnel to be considered up. There are smarts to determine various states of the Edge and strategic active SR failover.
Keep in mind, within the NSX-T fabric there are GENEVE tunnels between each endpoint and BFD sessions to determine tunnel and endpoint availability. This is the 'heartbeat' functionality that you are referring to.
And no probs, happy to help!
Hi,
I've gone through your questions, and have a few observations.
Hi again @shank89 , thanks for replying.
<quote> If the vmnic fails, the status goes to failback, and the TEP interface on the Edge is not aware of it and is still online</quote>
I'm not 100% sure which scenario you're describing, but I think you're saying that the edge has little awareness of what is happening upstream of its own virtual nics (i.e., the fastpath interfaces). Therefore, in my scenario where the external traffic is pinned to vmnic#1 and TEP traffic to vmnic#2, if vmnic#2 dies, the Edge doesn't know and blackholes the traffic because his fastpath interfaces are still up. Is that correct?
How then, do VM Edges know that they are in a down state? A "down" state could be because its paths to the ToR router are down or all the paths to its TEP interfaces are down. In the case of ToR traffic, I can see how this would be handled by either a dynamic routing protocol or static + BFD in an A/A scenario, but I have no clue in the case of Active-Standby. For checking the TEP interface, there would have to be some sort of heartbeat mechanism. Any idea how this works?
And thank you for your insights thus far.
"I'm not 100% sure which scenario you're describing, but I think you're saying that the edge has little awareness of what is happening upstream of its own virtual nics (i.e., the fastpath interfaces). Therefore, in my scenario where the external traffic is pinned to vmnic#1 and TEP traffic to vmnic#2, if vmnic#2 dies, the Edge doesn't know and blackholes the traffic because his fastpath interfaces are still up. Is that correct?"
For the failure scenario, that is correct, especially if you have the portgroups attached to the vnic of the Edge in Active/unused (no redundancy).
"How then, do VM Edges know that they are in a down state?"
There are various mechanisms used to determine a failure condition, without going too deep into the architecture of the Edge node, it requires management connectivity, and at least one TEP tunnel to be considered up. There are smarts to determine various states of the Edge and strategic active SR failover.
Keep in mind, within the NSX-T fabric there are GENEVE tunnels between each endpoint and BFD sessions to determine tunnel and endpoint availability. This is the 'heartbeat' functionality that you are referring to.
And no probs, happy to help!