Solved: Re: NSX-T and 2 Sites Standalone with same vCenter

MinoDC · ‎05-25-2021

Hello to all...

I should design an NSX-T integration on the existing infrastructure.

The existing infrastructure consists of 2 Sites (Site-A and Site-B) in Active-Active mode, without shared storage, but only L2-Stretched network and L3 HA network between the sites.

Site-A is primary site; Site-B is secondary site.

Each site has its own Cluster and both clusters are managed by the same vCenter.

Similar to this one:
NSX-T Active Active deploy

I've NSX-T Professional license...

Workload VMs are replicated between sites with Veeam and VCSA is in HA on the secondary site.

Is it possible to integrate NSX-T in this architecture, so that in case of Site-A Failure, everything works on Site-B and vice versa?

I've read some of documentation on the Internet, but have not found a solution...

Can you help me in this hard work for me ?

Thanks for any suggestions.

shank89 · ‎05-26-2021

First 5 dot points look good, what is the network you are stretching?

2 Manager Node in Site-A and 1 Node in Site-B --- keep in mind while you only have 1 manager up and running, NSX-T will be in read-only mode. You need at least 2 up for write access, but best to get all nodes up and running ASAP.
For the segments to exist at both sites, yes same TZ for easiest DR process.
All Edges will be in the same TZ so they can route traffic for the segments in those TZ's.
2 Edges can be in each site, make sure you understand the traffic flow from hosts to edges, it will be balanced using all paths available to the T0SR;s (I linked you to this earlier).
Correct, route maps, ,prepends, peering, BFD as required.
A/S is your choice for T1, what is your reason for this?
Segment in Overlay TZ, this TZ will be linked to hosts and edges
T0DR, T1DR and Segments works in active site, because Host and Edge are in the same TZ -- not exactly, see my note above about datapaths and ECMP.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

View solution in original post

p0wertje · ‎05-25-2021

Hi,

Have you looked at NSX-T 3.1 Multi-Location Design Guide (Federation ... - VMware Technology Network VMTN ?
This is a good guide for multi-site.

Cheers,
p0wertje | VCIX6-NV | JNCIS-ENT | vExpert
Please kudo helpful posts and mark the thread as solved if solved

MinoDC · ‎05-25-2021

Hi,

yes, I read it, but my issue is that I've Professional license and it not include Multisite and Federation feature.

for this reason I am looking for a solution, which may be good.

For example, I was thinking about replicating the Edge VM with Veeam and restoring NSX-T, but does this solution works/is supported?

Or other solutions ...
I accept suggestions 😁

Thanks.

p0wertje · ‎05-25-2021

Hi,

Just an idea:

Have 4 edge vm's (2 in each DC for local redundancy)
Create t0 in active-active ecmp to your core.
Deploy a T1 A-S for DC1, with active on DC1 and standby on DC2 (use failure domains)
And for DC2 vica-versa.
Use a stretched L2 for the vtep network. (you could do it routed, but you have to add some routing somewhere to it)

In this case you will have all your segment on both dc. And still benefit having the T1 in the correct datacenter.

Cheers,
p0wertje | VCIX6-NV | JNCIS-ENT | vExpert
Please kudo helpful posts and mark the thread as solved if solved

MinoDC · ‎05-25-2021

Hi @p0wertje ,

thanks for your reply...

I am not very familiar with the functionality of the fault domain, so I try to explain what I understand.

I install and configure 3 node NSX-T Manager.

I create 2 Edge for each site and then create an Edge Cluster with all Edges.

I create a T0-Gw (Act/Act) with 8 Uplinks (two for each Edge), enable ECMP and configure Route Maps for correct routing in case of site failure.

I create one T1-Gw (Act/Stb) for each site (with Only DR or also SR ?)

For each Edge, I configure Failure Domain (Edge1 and Edge2 in Failure Domain-A ;Edge3 and Edge4 in Failure Domain-B), in this way T1-A Std will position on the Edge of Site-B and T1-B Std on those of Site-A, right?

Now, I've some questions...

Is it possible to create 2 NSX-T Manager nodes in Site A and 1 in Site B if the hosts are in two different clusters? (This way I can avoid restoring NSX-T in the event of a site failure)

Does T0-Gw use Failure Domain function like T1-Gw when I implement it in Act / Act mode? If, NO ... how will the T1 traffic be forwarded to the edge if the T0 is not present in the event of a site fault?

Sorry and Thanks again...🙏

shank89 · ‎05-25-2021

You can split your NSX-T manager cluster as long as they meet the RTT and other requirements.

The T0 gateways do not support failure domains.

The T1DR (if you have no SR component), uses ECMP paths to the T0DR, each edge acts as a path to a prefix. https://communities.vmware.com/t5/VMware-NSX-Documents/NSX-T-and-ECMP/ta-p/2840738

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

p0wertje · ‎05-25-2021

Hi,

The 'downside' in this design is because of the t0-ecmp to the outside world, that you have 4 paths incoming. And that is over two datacenters.
I don't know if that is acceptable for you. You might be able to steer it with route-maps,I have not tested that, so i don't know the result.

If you really need incoming and outgoing to be on one datacenter, you could go with
4 edge nodes, but in two edge clusters. One t0 active-standby on the node on DC1, standby on the node on DC2. and visa-versa.
You can only run one t0 per edge node.
The downside of this is the upgrading of you edge nodes, because the traffic goes over the standby node and thus the other DC when you upgrade the active node.

Cheers,
p0wertje | VCIX6-NV | JNCIS-ENT | vExpert
Please kudo helpful posts and mark the thread as solved if solved

shank89 · ‎05-25-2021

You can steer traffic to the T0 or edges, but T1SR to T0DR uses 2 tuple load balancing with the paths it has available to active SRs.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

MinoDC · ‎05-26-2021

Thanks @shank89 & @p0wertje ...

I don't want to complicate the design with the T1SR ... the professional license does not have the LB feature, so for now let's leave the T1SR out ... thanks and sorry if I wrote it, it was just to understand better.

Both the site are Active, so both are receiving N/S traffic.

Honestly what I haven't been able to understand is how to configure T0 when a site goes into fault. (Because the critical point of this design seems to be precisely the routing of traffic from T1DR to T0 in case of fault)

We said that:

I can distribute NSX-T Manager on both site, even are in different cluster on the same vCenter. (This solves the NSX-T problem in case of DR)
I can use Failure Domain on T1DR Atc/Stb (This solves the T1DR problem in case of DR)

We said that T0 doesn't support Failover Domain. So how can the T0 of Site-A be deployed on the Edge of Site-B, if I create 2 Edges Cluster?

For the T0 and Route Map I saw this link: https://www.lab2prod.com.au/2020/09/nsx-t-active-active-multisite-part2.html

Is there a specific configuration on the T0 side that I have to do in order not to have problems in case of DR?

Thanks a lot , again 🙂

shank89 · ‎05-26-2021

To clarify, failure domains are to predictively place the SR component of the T1 gateways. The DR component is meant to be distributed and does not have an active and standby component. Here are a couple of links for that;

For the T0, you can have it in Active/Active or Active/Standby, that is up to you. If you have 4 edges and have them placed at either site or not, you steer the traffic using prepends and local preference as shown in the link you added in your previous response. If you would like faster failover and are using BGP, consider using BFD.

As with anything, test failover, ensure the behaviour is what you expect and predicted.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

MinoDC · ‎05-26-2021

Thanks @shank89 for clarifying the T1SR Failure Domain.

I know that the T0DR and T1DR is deployed within each ESXi host belonging to the cluster where the Edge VM is present to which the T0, and consequently the T1, is connected, right?
If so, the T0DR of Site-A is not present in the hosts of Site-B.
So how will the traffic work in the event of a fault?

If this is not the case, when I create a T0DR and a T1DR, these are distributed in all hosts prepared with NSX-T, it means that in the event of a Fault I will not have problems as the T0DR and T1DR are already present on the hosts of the other site.

How exactly does it work?

shank89 · ‎05-26-2021

This will come down to how you prep the environment. If you need segments etc available in the second site, you will need to have them all part of the same overlay transport zone. If this does not happen, the transport nodes in Site-2 will not see the networks you want them to have.

There may be very manual methods of DR to get around this or scripted if you want to (connect the T0 to an edge cluster on the failure site once Site A goes down), but my general recommendation is to simplify DR to avoid any human failures.. I mean if it is a true DR, there's enough going on anyway.

You should find what you need from slide 26 onwards. https://www.dropbox.com/s/tvwqhjhbwd7hy4j/Multisite_NSX-T_3.1-v1.0.pptx?dl=0.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

MinoDC · ‎05-26-2021

Of course, I will create segments to connect to T1DR.
All segments of the two sites will be on the same TZ Overlay.
All hosts and Edges from the two sites will be on the same TZs.

Excuse me if I insist, but what I don't understand is if the T0DR and T1DR are distributed on all the hosts of the cluster where there is the Edge to which the T0 is connected and consequently the T1, or on all the hosts prepared with NSX-T, regardless of the cluster where the Edge is positioned to which the T0 and consequently T1 is connected.

Because based on how the T0DR and T1DR are distributed there will be different considerations for the DR ... right?

I saw the ppt on the DR and NSX-T Multisite, thanks.

shank89 · ‎05-26-2021

T1DRs and T0DRs exist on all transport nodes that are prepared for NSX-T.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

MinoDC · ‎05-26-2021

ah okk ... sorry I didn't understand / read correctly ...

Excuse me again ... I try to summarize everything to see if I understand correctly ....

I have two sites like in the drawing above:

servers and storage are dedicated for each site
each site has its own vmware cluster
both the vmware clusters are managed by the same vCenter
between the two sites there is an L2-Stretched network
The connectivity between site is 10Gb/s and RTT is <150ms

I create:

2 Manager Node in Site-A and 1 Node in Site-B
All TZs are the same for both sites
All ESXi and Edge node are in the same TZs
2 Edge in each Site but all in the same Edge Cluster
1 T0 in Act/Act mode, with ECMP/Route Map/BGP/BFD , at each site
1 T1 in Act/Stb mode at each site
n Segment Overlay connected to T1 for each site

I don't use Fault Domain because I don't have T1SR (if I had T1SR then I would also use FDs)

In the event of Site failure, everything works (or should 😅) , because:

NSX-T Manager Node (2 or 1) is active in active site
T0DR, T1DR and Segments works in active site, because Host and Edge are in the same TZ

I hope to understood correctly...

Thanks again @shank89 🙏 for you patience 😇

shank89 · ‎05-26-2021

First 5 dot points look good, what is the network you are stretching?

2 Manager Node in Site-A and 1 Node in Site-B --- keep in mind while you only have 1 manager up and running, NSX-T will be in read-only mode. You need at least 2 up for write access, but best to get all nodes up and running ASAP.
For the segments to exist at both sites, yes same TZ for easiest DR process.
All Edges will be in the same TZ so they can route traffic for the segments in those TZ's.
2 Edges can be in each site, make sure you understand the traffic flow from hosts to edges, it will be balanced using all paths available to the T0SR;s (I linked you to this earlier).
Correct, route maps, ,prepends, peering, BFD as required.
A/S is your choice for T1, what is your reason for this?
Segment in Overlay TZ, this TZ will be linked to hosts and edges
T0DR, T1DR and Segments works in active site, because Host and Edge are in the same TZ -- not exactly, see my note above about datapaths and ECMP.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

p0wertje · ‎05-26-2021

Hi,

Sounds correct.
And if you do not need T1-SR, you do not have to deploy it. you just have a DR only then.
And keep the points @shank89 mentions in mind.

Cheers,
p0wertje | VCIX6-NV | JNCIS-ENT | vExpert
Please kudo helpful posts and mark the thread as solved if solved

MinoDC · ‎05-26-2021

what is the network you are stretching? I can extend all L2 networks needed

2 Manager Node in Site-A and 1 Node in Site-B --- keep in mind while you only have 1 manager up and running, NSX-T will be in read-only mode. You need at least 2 up for write access, but best to get all nodes up and running ASAP.
- ok But if NSX-T is in Read-Only, the network traffics works, but i can't change configuration, right ?
A/S is your choice for T1, what is your reason for this?
- No specific reason, but if T1DR is in A/S I can manage ECMP traffic in ESXi (2 tuple) better, not ?
Segment in Overlay TZ, this TZ will be linked to hosts and edges
- Yes... Same TZ Overlay for Segments,Host and Edge
T0DR, T1DR and Segments works in active site, because Host and Edge are in the same TZ -- not exactly, see my note above about datapaths and ECMP.
- If a site is down, the only paths that work are those of the active site. Then the segment should route traffic through the remaining T0-T1-Edge active, right ... what is it that I could not understand, I'm sorry?

Just a clarification ... all this work with Professional license (no Multisite-Federation feature), right ?

shank89 · ‎05-26-2021

Dataplane still works if the management plane is down / in readonly.

The choice of A/S is up to you, you will have to work out what is best for your scenario.

Yes, if there is active workload on a segment in the remaining site, and only those hosts and edges exist, the traffic will egress that site.

I would say so, as this just comes down to cluster design within a single instance of NSX-T.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

MinoDC · ‎05-26-2021

Perfect ...

Thank you very much for your time and patience