Implicit MTU mismatch in NSX architecture? (No wa...

menckend · ‎02-15-2023

TLDR: Is there a way to set per-segment interface MTUs on the distributed routers' "segment" interfaces? (For the NSX-backed segments that they're attached to, along with the workload VMs.) If not, how does this not break path MTU discovery if we have any variation in MTU values on different segments?) And, to a lesser extent even if we use a consistent MTU on all workload VMs?)

Long version: I've been working on pulling together some explicit standards in my organization around MTU configuration, and I'm having a hard time wrapping my head around how we're not signing up for at least some level of MTU mismatch and associated PMTUD blackholes when we use NSX virtual-routing.

The crux of my concern is that it seems that the only place we can explicitly configure layer-3 interface MTU on the virtual-router path is on the Tier-0 SR's uplink interfaces and an extremely poorly documented setting called "Gateway interface MTU" in the "global network configuration. We've been told by support that this setting determines the interface MTU values on *all* of the internal tunnel-interfaces between T0SR<>T0DR, T0SR<>T1SR, T0DR<>T1SR, T1SR<>T1DR and... presumably(?) the T1 or T0 SR's "interfaces" on the NSX-backed segments that we VM's vNICs to.

The place where this seems problematic to me is if we want to deploy workload VMs in the same NSX domain that are using different MTUs. For instance, if we have a small handful of VMs that we want to set an 8800-byte MTU on. In that case, the T0-SR-uplink interfaces get configured with a 9K MTU, and the global "gateway interface MTU' gets set to 8800. And our handful of jumbo-frame VMs get their vNIC configured with an 8800-byte MTU.

But that leaves all of our other VMs with their standard 1500-byte MTU. And that leaves me waking up at night in a cold sweat wondering: "what happens when someone tries to send an >1500-byte packet to all those 1500-byte MTU VMs?

Best-case scenario: The 1500-byte MTU VM's vNIC is "gracious" enough to accept the oversized frame and try to hand it off to the guest-OS. (We've seen it go both ways, depending on which vNIC adapter type was selected on the VM.) And the guest-OS itself has an implicit MRU that's big enough to handle the oversized packet. Those VM's are receiving the oversized frames in the first place because PMTUD gets broken because it should be the DR that drops the packet and generates an IP PTB messages -- but it won't do that, because it's stuck with the globally-defined 8800-byte MTU 😞

Worst-case scenario: The oversize packets get dropped by either the vNIC adapter at L2, or the host-OS at L3 and the PMTUD blackhole prevents the transport layer from recovering, and we get stuck troubleshooting.

Aside from that, I think the same problem applies even if we use a consistent MTU (say, 1500-bytes) across all workload VMs; and we set the global-gateway-interface-MTU and T0-uplink-interface MTUs to 1700 bytes. The design recommendation from VMWare to include 200 bytes of MTU overhead instead of 100-bytes (for potential additional encapsulation overhead) creates the same problem.

If the inbound routers all think they have a 1700-byte MTU on their interfaces (up-to-and-including the NSX-backed segments), and the GENEVE header only uses 100-bytes today, someone on the underlay can send a 1600-byte packet to the VMs, and it should get through all the way to the VM. (When the T0SR receives it; it just gives it a 100-byte GENEVE encap, and shoves it over to the T0-DR.)

The only way I can see out of this is if the behavior of the NSX virtual-routers is very SDN in its approach and does something like importing the L3 MTU configured on each VM's vNIC and populating the vswitch with flow-rules telling it to drop any packets that are larger than that MTU of the MAC address of the vNIC they're destined to. That, while concurrently sending a control-plane message back to the NSX manager, telling it to craft an IP PTB message back to the source of the oversized packet, so the PMTUD works correctly. (That would actually be pretty nice!)

Am I just missing something simple? Is anyone else out there running virtual-routing and trying to run a mix of standard/jumbo MTU workload VMs in a single NSX domain?

menckend · ‎02-16-2023

I opened a support case and the support engineer on the VMWare side confirmed for me that the *expected* behavior of the NSX virtual-routers is what I "feared"/suspected.

They will forward packets towards VMs on GENEVE-backed segments using the MTU value assigned on the NSX global "gateway MTU" configuration parameter.

Even if you configure completely homogeneous MTUs on the VM's vNICs (say 8800 bytes), and that value is 200-bytes lower than the gateway-MTU of 9000 bytes. If an underlay host sends an 8900 byte packet to your VM, the NSX VRs will deliver it (because post-GENVE-encap, it's still only 9000-bytes, which doesn't exceed the router MTU.) At least in that scenario, you only have a 100-byte window in packet size (8800-8900) where the PMTUD blackhole scenario exists.

Let alone wanting to run heterogenous workload MTUs on different subnets/segments. If we standardized on 1500 and 8800 byte MTUs for our workload VMs/segmetns, all of our 1500-byte MTU workload would be behind a PMTUD blackhole for any packets between 1500 and 8800 bytes.

This is table stakes behavior for a router (virtual or otherwise.) VMWare needs to do better on its virtual-router functionality. (again, unless I'm missing something --- would be extremely happy to be wrong here.)

menckend · ‎02-21-2023

I've raised a feature request ("idea") for this here: https://vsphere.ideas.aha.io/ideas/VSP-I-1252

All

Implicit MTU mismatch in NSX architecture? (No way to set per-"segment" interface MTU on DRs?)

Distributed Router

NSX-T

routing