Solved: Re: Active-Standby + Static Routing - Teaming Poli...

PhoenixVM · ‎08-26-2021

Hello there,

I see a lot of best practice examples out there for how to configure teaming policies and port groups for A/A designs, but almost none for A/S designs. Can I get a sanity check?

As you can see, I have an Edge VM on an ESXi host with two physical nics. The edge VM's interfaces are attached to a port groups on an Edge VDS.

First question: Is it bad to have external traffic failover to nic#2 in the event that nic#1 should die? This would be accomplished by setting the trunked port groups with one nic as active, and the other as standby. I would assume that this kind of failover would cause ARP chaos for the physical router that could not be corrected in a timely manner. Is this assumption correct? I notice that in A/A designs where people have two ToR vlans, people do not trunk both VLANS onto both portgroups; they let that VLAN die if a NIC dies. Thoughts?

Second question: What is considered best practice for traffic steering in A/S designs? One option would be to pin VLAN and overlay traffic to specific NICs, preventing the scenario described above. A second option might be to let overlay traffic load-balance across both NICs, and have ToR vlan traffic pinned to one.

In my environment, I am partial to the first option to avoid ARP problems in the event of failure, and also because my NICs are low-bandwidth (10Gbe each); I would rather the edge node just die and failover to a healthy one if something goes wrong with a nic, than to try to shove both ToR and overlay traffic down one link. The only consideration I have would be about what might happen if the NIC pinned to the TEP died, and the ToR nic stayed up. Would NSX be smart enough to figure out the edge was unusable, despite having an active uplink on the external VLAN? I'm going around in circles here with "what-ifs" and not finding tons of resources to inform me.

Any insight would be appreciated.

---------

The failover diagram contains an error. The external traffic would go up e1, through fp-eth0, hit the left trunk portgroup and then exit through uplink-2. In otherwords, the failover would be handled by the portgroup.

shank89 · ‎08-26-2021

Hi,

I've gone through your questions, and have a few observations.

You want to be careful of limiting vmnic failover of the uplink port groups on the Edges, it is good to have them in active/standby. Remember your TEP interfaces are also plumbed through there, and if they do not failover, you will end up with traffic that is blackholed. If the vmnic fails, the status goes to failback, and the TEP interface on the Edge is not aware of it and is still online. As long as the Active SR is on the Edge and traffic is still being forwarded to it from the host transport nodes, you will have packet loss. Hopefully this point makes sense.
Traffic steering with A/S works similarly to A/A, use named teaming policies, dictating VLANs, and attach said policy to uplink segments that are subsequently attached to your T0 interfaces.
I have not seen ARP issues relating to the problem you described. That's not to say there haven't ever been, but they were related to bugs.
You could let TEP traffic be balanced with A/A uplink portgroups, but that would make the traffic egress less deterministic.

Hopefully this has helped in someway!

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

View solution in original post

shank89 · ‎08-27-2021

"I'm not 100% sure which scenario you're describing, but I think you're saying that the edge has little awareness of what is happening upstream of its own virtual nics (i.e., the fastpath interfaces). Therefore, in my scenario where the external traffic is pinned to vmnic#1 and TEP traffic to vmnic#2, if vmnic#2 dies, the Edge doesn't know and blackholes the traffic because his fastpath interfaces are still up. Is that correct?"

For the failure scenario, that is correct, especially if you have the portgroups attached to the vnic of the Edge in Active/unused (no redundancy).

"How then, do VM Edges know that they are in a down state?"

There are various mechanisms used to determine a failure condition, without going too deep into the architecture of the Edge node, it requires management connectivity, and at least one TEP tunnel to be considered up. There are smarts to determine various states of the Edge and strategic active SR failover.

Keep in mind, within the NSX-T fabric there are GENEVE tunnels between each endpoint and BFD sessions to determine tunnel and endpoint availability. This is the 'heartbeat' functionality that you are referring to.

And no probs, happy to help!

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

View solution in original post

shank89 · ‎08-26-2021

Hi,

I've gone through your questions, and have a few observations.

You want to be careful of limiting vmnic failover of the uplink port groups on the Edges, it is good to have them in active/standby. Remember your TEP interfaces are also plumbed through there, and if they do not failover, you will end up with traffic that is blackholed. If the vmnic fails, the status goes to failback, and the TEP interface on the Edge is not aware of it and is still online. As long as the Active SR is on the Edge and traffic is still being forwarded to it from the host transport nodes, you will have packet loss. Hopefully this point makes sense.
Traffic steering with A/S works similarly to A/A, use named teaming policies, dictating VLANs, and attach said policy to uplink segments that are subsequently attached to your T0 interfaces.
I have not seen ARP issues relating to the problem you described. That's not to say there haven't ever been, but they were related to bugs.
You could let TEP traffic be balanced with A/A uplink portgroups, but that would make the traffic egress less deterministic.

Hopefully this has helped in someway!

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

PhoenixVM · ‎08-27-2021

Hi again @shank89 , thanks for replying.

<quote> If the vmnic fails, the status goes to failback, and the TEP interface on the Edge is not aware of it and is still online</quote>

I'm not 100% sure which scenario you're describing, but I think you're saying that the edge has little awareness of what is happening upstream of its own virtual nics (i.e., the fastpath interfaces). Therefore, in my scenario where the external traffic is pinned to vmnic#1 and TEP traffic to vmnic#2, if vmnic#2 dies, the Edge doesn't know and blackholes the traffic because his fastpath interfaces are still up. Is that correct?

How then, do VM Edges know that they are in a down state? A "down" state could be because its paths to the ToR router are down or all the paths to its TEP interfaces are down. In the case of ToR traffic, I can see how this would be handled by either a dynamic routing protocol or static + BFD in an A/A scenario, but I have no clue in the case of Active-Standby. For checking the TEP interface, there would have to be some sort of heartbeat mechanism. Any idea how this works?

And thank you for your insights thus far.

shank89 · ‎08-27-2021

"I'm not 100% sure which scenario you're describing, but I think you're saying that the edge has little awareness of what is happening upstream of its own virtual nics (i.e., the fastpath interfaces). Therefore, in my scenario where the external traffic is pinned to vmnic#1 and TEP traffic to vmnic#2, if vmnic#2 dies, the Edge doesn't know and blackholes the traffic because his fastpath interfaces are still up. Is that correct?"

For the failure scenario, that is correct, especially if you have the portgroups attached to the vnic of the Edge in Active/unused (no redundancy).

"How then, do VM Edges know that they are in a down state?"

There are various mechanisms used to determine a failure condition, without going too deep into the architecture of the Edge node, it requires management connectivity, and at least one TEP tunnel to be considered up. There are smarts to determine various states of the Edge and strategic active SR failover.

Keep in mind, within the NSX-T fabric there are GENEVE tunnels between each endpoint and BFD sessions to determine tunnel and endpoint availability. This is the 'heartbeat' functionality that you are referring to.

And no probs, happy to help!

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

All

Active-Standby + Static Routing - Teaming Policies & Deterministic Traffic Best Practice