Keep your passwords, I’m installing SSH keys to my NSX-T appliances

I’m not a fan of passwords. Too easy to mistype, too complex to remember without sticky notes (or a password manager). Too much room for error.

But SSH keys? I like ‘em a lot. Create something secure once, use ssh_copy_id to get it on the target system, and blammo! No more passwords.

Except that I can’t use ssh_copy_id to install my SSH key to an NSX-T Manager appliance. Or an Edge node. So what’s a person to do when you don’t have a complete shell to work with on the target system?

I don’t know for all systems, that’s for sure, but in the NSX CLI shell we’re presented in NSX-T, it’s pretty straight forward. All you need is your SSH public key and credentials to get to your NSX appliance. I’ve even produced a video to show you how!

It’s pretty simple, though – use the following command in nsxcli.  Piece of cake!

set user <username> ssh-keys label <label> type <type> value <SSH public key>

My Edges are too small!

This weekend has been all about getting PKS deployed in my lab, since I just attended the beta of the VMware PKS: Install, Configure, Manage workshop. That was 4 intense days in Atlanta! And it’s helped – I’m getting different errors deploying this time around, all about my NSX-T Edges. They’re just too small.

I had Medium Edge nodes deployed in the lab, see, because it was a lab, and I needed to bump from small to medium to tinker with the load balancer. But since PKS uses load balancing as more or less a core function (gotta have connectivity to all of those pods you’ve deployed), it needs to be able to deploy more load balancers. The medium LB in NSX-T 2.3 supports all of 4 virtual servers – not nearly enough. So PKS requires (and actually checks while you’re applying config changes to the PKS tile) that you have a large sized load balancer.

Fortunately, I’ve got hypervisors in my lab that support VMs with 8 vCPUs and 16 GB of RAM, but that’s starting to push some limits.

You don’t have to build a new Edge Cluster to get bigger Edges, you can simply replace what’s there.

First, go to Fabric > Nodes > Edges in the NSX-T admin UI. Then deploy your net new Edges, however many you need (I only have a 2-node Edge cluster). I went ahead and gave the new Edges unique FQDNs and management IP addresses. Select the Large size radio button, finish the wizard and let things happen. This will likely take a few minutes, so be patient.

Then go to the Transport Nodes tab, and promote your shiny new Edges to transport nodes, however you need to configure them in your environment. Configure your transport zones and N-VDSs to match your existing Edges.

Once they’re done, move one tab to the right to Edge Clusters. You probably want to place your Edges (one at a time as they’re being replaced) in Maintenance Mode, to keep things clean. You’ve got to get your API client out for this one, however. More info about that in the documentation. You know a node is in maintenance mode when the Configuration State colum in the Transport Nodes tab shows the yellow warning icon with the text “Maintenance Mode”.

Once an Edge is in Maintenance Mode, go to the Edge Clusters tab, then check the box next to the cluster that contains the Edges you want to replace, pull down the Actions menu, and select Replace Edge Cluster Member. Specify the old and new Edge transport nodes and click Save. This is pretty quick. Rinse and repeat for each other Edge. Note that Maintenance Mode will trigger a failover of any services runnin on the Edge, so this should be considered an outage-inducing event (even if the failover is only going to introduce the tiniest blip), and should likely be done during a maintenance window where such things are more acceptable.

After that, you should clean up your old Edge nodes, and you’re done. It’s that easy.

Nested ESXi on an NSX-T Logical Switch

TIL (Today I Learned) (ok, full disclosure – it wasn’t actually today – this post has been sitting in my drafts for a couple of weeks now) that MAC Learning on a vSwitch (distributed or NSX) doesn’t like that ESXi automagically inherits the physical MAC address of vmnic0. If you’re curious about this, prepare for a wall of text.


Here’s the scenario (there are two options here, but both behave the same)

1a. you have a lab running ESXi 6.7 and are using the native MAC Learning function introduced to the DVS. (see here for some great info on getting that set up)

1b. You have a lab running NSX-T and are using a non-default Mac Managment switching profile that has MAC Learning enabled.  Generally, nested ESXi is the use case for this switching profile.

2. You have deployed ESXi VMs, and attached them to a MAC Learning enabled port group

Symptom: single-MAC VMs (think a nested vCenter server, or NSX Manager, or just a Linux or Windows VM) function just fine – they can communicate with all kinds of things. Except your ESXi managment interfaces.

Extra fun bonus configuration: You have a separate VMkernel port dedicated to vMotion, with a separate subnet address, though still using the same uplink (example: Virtual Switch has 2 port groups: Managment and vMotion. vmk0 is attached to Managment PG, and vmk1 is attached to vMotion PG. Uplink vmnic0 is attached to the virtual switch. vMotion interfaces can communicate with each other just fine (ping ++netstack vmotion -I vmk1 <other host vmk1 IP>)

So, this exact scenario has been driving me bonkers. When I upgraded my lab to vSphere 6.7, I tried the native MAC Learning on my port groups. That was a mistake, as all 4 of my nested environments just evaporated. So I scrambled to undo the MAC Learning config, and went back to enabling Promiscuous Mode, Forged Transmits, and MAC Address changes on the port groups.  So much for the new features.

In this configuration, with NSX-V logical switches, my nested environments continued to work just fine, so long as I remembered to set the security options on any newly-created logical switch port groups to Accept.  So I just ran with it.

Fast forward to a couple of weeks ago. I ripped NSX-V off of the physical environment, and rebuilt it with NSX-T. That’s another adventure. Anyway, MAC Learning switching profiles are the option – there’s no editable dvPortGroup for NSX-T logical switches. So I had to figure this one out.

As you may know, when you install ESXi, it takes the MAC address of vmnic0 and assigns it to vmk0. Normally, that’s not a big deal. But when you enable MAC Learning, something goofy happens and Ethernet frames don’t get forwarded from vmnic0 all the way up to vmk0. So, here’s what I did to work around that little challenge.

I set the /Net/FollowHardwareMac flag in esxfg-advcfg to 1, from the default 0 ( After a reboot, this didn’t change the MAC address of vmk0, as the purpose of the FollowHardwareMac setting is to define whether the VMkernel MAC address should change when the underlying vmnic is replaced.  I didn’t replace the vmnic, so nothing should have changed.  This is not a necessary change, but I wanted to try it for completeness sake.  However, this is also useful if you install ESXi to a USB stick, and then move that USB stick to a different physical host, or clone your ESXi VM.

So I had to wipe out and recreate vmk0 from each of my hosts. I was a bit concerned about this, as my VMkernel ports were on a distributed switch. The only gotcha there is that you have to specify the dvPort ID for both deleting AND creating the new VMkernel port with esxcfg-vmknic.  That’s easily identified with esxcfg-vswitch -l.

And now, that nested environment works again. On an NSX-T logical switch.

Hopefully, those of you with home labs won’t have to spin your wheels so much if you run into this.

One Does Not Simply Walk Into Mordor. A Migrationary Tale.

So let’s talk about the migration of my homelab from NSX-V to NSX-T. For the faint of heart, this is a daunting tale.  Brought to you by the NSX vSwitch, and a limited number of physical NICs in my ESXi hosts.

Here’s the deal – a physical NIC (vmnic) in ESXi can be owned by exactly one vSwitch. That vSwitch can be a vSphere Standard Switch, a vSphere Distributed Switch, or an NSX vSwitch. But only by one of them at any given time.

So your host has to have at least one unused vmmic to be assigned to the NSX vSwitch during implementation. If you’ve got more than two attached to any given vSwitch, you’re fine, since you can move devices around without creating a single point of failure.

But what if your hosts were built with 4 pNICs, and you have 2 for infrastructure traffic (managment, vMotion, vSAN, etc) and 2 for virtual machine traffic? That’s still not so bad, as you can deploy NSX during a maintenance window, take a NIC away from VM traffic, and be prepared for the jeopardy state you’ve put yourself in.

How about a worst-case scenario – hypervisors with only a pair of pNICs installed. Maybe you have blade servers. Maybe you have rack servers with a pair of 25 Gbps or 40 Gbps NICs. That’s an easy implementation for NSX-V, since it simply leverages the vSphere Distributed Switch for its logical switches. It’s not so simple on NSX-T, since we’d have to pull a NIC to assign to the NSX vSwitch, and then what? With only two NICs, there’s no redundancy. This, in the modern data center, is sort of a party foul.

This is the situation I find myself in. Only one of my ESXi hosts has more than two available NICs. So I had to read the NSX-T docs to find out more about the ability to move VMkernel ports to Logical Switches.  This feature was introduced as API-only in NSX-T 2.1.  2.2 brings a UI element to the table.

VMkernel ports on Logical Switches? That’s crazy talk! Isn’t that going to create some kind of horrible circular dependency? Well, yes, it would if those Logical Switches were overlay switches backed by Geneve.

But you have to remember that NSX-T can support both Geneve-backed Logical Switches and VLAN-backed Logical Switches. Think about it like this: on any other vSwitch, Port Groups are created, and one of the port group policies available is VLAN Tagging. All a VLAN-backed Logical Switch really is, is a Port Group on the NSX vSwitch with the VLAN Tagging policy set. It’s no more complicated than that.

So how did I accomplish this crazy migration? Being my home lab, I had some flexibility in shutting down all of my VMs, so I powered everything down. I had to think about how to do that well, though, as my nested environments use storage from a DellEMC Unity VSA, and my DNS server is a virtual machine. As is vCenter Server, and NSX Manager for the physical infrastructure.  I temporarily dumped all of the NICs for the VMs on some distributed port groups to get everything off the logical switches.  After juggling DNS, vCenter, and NSX Manager around a bit, I unprepared my management cluster, then unprepared the compute cluster, and finally the old T5610 that I’ve kept around for some extra resources.  I then followed these instructions from to make sure everything was gone, including the vSphere Web Client plugin.  There may still be some vSphere Client plugin baggage hanging around, but I didn’t worry too much about that.

I was about to cross the point of no return.  I took a few minutes to breathe, and to review Migrating Network Interfaces from a VSS Switch to an N-VDS Switch in the NSX-T documentation, only to realize that I needed the titular vSphere Standard Switch, and not the vSphere Distributed Switch I had my endive environment running on.  So I jumped in, and built out a vSphere Standard Switch infrastructure to support my lab environment.  I’d like to say that I was awesome and whipped out a quick PowerShell or Python script to knock that out for me, but I’ll admit that I did it the old-fashioned, brute force way of clicking my way, host by host, through the vSphere Client.  Good thing I only have 5 hosts and 4 networks that live outside of NSX.  Once everything was migrated to a single standard switch (with a single uplink attached), I tore down the vDS in my lab, deleted my old NSX manager, and started deploying NSX-T 2.2.

I’d like to say the story is over there, but it’s not.  I prepared all of my hosts as Transport Nodes (can you see where this is going yet?), then migrated all of my management VMkernel ports over to a VLAN Logical Switch.  No sweat, until I went ahead and migrated the remaining uplinks to the N-VDS.  That may seem innocent enough, but I hadn’t moved my NSX Manager over to the N-VDS.  So there I am, with NSX Manager hanging off what is now an internal virtual switch, just chilling out with no one but the Controllers to talk to.  Realizing my conundrum, I tried to move the virtual machine over to the logical switch hosting my management network, only to realize that it doesn’t show up in the list of networks to which I can connect the vNIC.  So I scrambled to move the NSX Manager to a host outside my management cluster (good thing I attacked that cluster first, I suppose), and proceeded to reverse the migration process.  At this point, the two hosts in my management cluster are happily chugging along with their two uplinks attached to a vSphere Standard Switch.

Which is actually better, because my management cluster also hosts my edges, and those need to connect to both my transport network, and a VLAN network, both of which are configured on my vSwitch.

Looking back at the trouble I’d gotten myself into, it makes a ton of sense to not allow NSX Manager on an N-VDS, because who wants to be in circular dependency hell?  Some things you just have to learn the hard way.

With all that said, everything else has gone smoothly.  Mostly.  I still need to do some digging to figure out why Mac Learning isn’t doing me any good for my nested ESXi hosts on my logical switches. I’m sure I’m missing a quick API-only configuration switch or something. But routing works in and out of the nested environment (I can at least get my nested vCenter and other virtual machines that aren’t ESXi hosts.

So, smooth migration?  Not exactly.

Lessons learned? Lots.  For example, planning is crucial.  I know this.  Hell, I preach this in my classes.  But, as my RN wife keeps telling me about her time as an ER nurse, “bad things only happen to other people”.  I still tell her to be careful, because they do happen to other people, until you _are_ other people.  In my lab, I was “other people”.  Fortunately, it’s a lab.  The most I would have been out is about 2 days of manually rebuilding stuff from vCenter up if things had gone any more sideways.  But what if this was a production environment?  That was way too big a risk to take without writing up the migration plan, test plan, and testing in a lab first.  I just jumped in, because I’m an instructor, and I get to live in my technical ivory tower devoid of maintenance windows and serious consequences.

Also learned – leave your management cluster alone.  All too often, we forget that management is actually a pretty important workload category.

All in all, it’s done.  I’m glad I did it, if nothing else for the experience of doing it.  And I got to tear down an entire nested NSX-T environment, so I freed up a ton of resources that I don’t have (my hosts live in a perpetual state of red memory alerts).  NSX-T is a big workload, even in a small lab environment.

And, my favorite part of the whole migration?  I can now manage my entire lab from my Linux box.  No more flash for me. Well, mostly.  The vSphere Client in 6.7 does enough that I don’t have to fire up the Flash-based Web Client on the Mac very often at all.  I haven’t in at least 3 weeks.  And I’m ok with that. Maybe when the next version of vSphere drops, I’ll be able to uninstall Flash Player from my Mac.

Tooling and Operations

NSX-T comes with some pretty nice tooling to support the abstractions of software-defined networking.

First up is the Port Connection tool. I have to ask – where was this in NSX-V? This is a spectacular tool that lets you choose a source and destination virtual machine or logical port, and get a topology diagram. What TN is the source on? How about the destination? What is the network layout between point A and point B? Even better? Everything is hyperlinked, so you can click on a logical port in the diagram and get the name, UUID, and status of the thing. Oh, did I mention that it also shows the connection state between TEPs on different transport nodes? Yeah, it’s pretty cool.

Then there’s the time-tested Traceflow. Excellent, as always, but better than ever. In NSX-V, for example, if you define a source VM and want to trace to an external endpoint, you just can’t. NSX-T will allow it, but naturally can only trace out to Edge Uplink. If you’re tracing from one VM to another inside the NSX domain, you get the trace results, as well as a Port Connection diagram (for unicast flows). This is the first place I go when someone tells me that two VMs can’t communicate with one another.

Port mirroring is available, and you can create Local, Remote, Remote L3 and Logical SPAN sessions.

IPFIX is still around, and allows for multiple IPFIX collectors, and different Switch or Firewall profiles to forward data out to your preferred collector. The cool thing about this is that you could, for example, define different Observation Domain IDs for different tenants to keep everyone’s flow data separate.

Like I said at the beginning of this series, we’re not looking at anything comprehensive here, nor was this intended to be a deep technical look at NSX-T. This was just a Shameless plug – you should come to one of our NSX-T: Install: Configure, Manage classes for that.

~$ history
Introduction: From NSX-V to NSX-T. An Adventure
NSX-T: The Manager of All Things NSX
The Hall of the Mountain King. or “What Loot do We Find in nsxcli?”
Three Controllers to Rule Them All (that just doesn’t have the same ring to it, does it?)
Beyond Centralization: The Local Control Plane
Transport Zones, Logical Switchies, and Overlays! Oh, My!
Which Way Do We Go? Let’s ask the Logical Router!
If You’re Not Living on the Edge, You’re Taking Up Too Much Room
Welcome to the Edge, I’m at Your Service

Welcome to the Edge, I’m at Your Service

I’ve mentioned, more than once by now, that the Edge is where the services routers exist. They do their thing, whether tied into a Tier-0 or Tier-1 router.

So what services do we have? I’ mentioned them in the last post, so let’s run through them again pretty quickly:

How much can we actually do with NAT? I mean, Source NAT and Destination NAT are pretty straightfoward functions, I think. Both have been supported in NSX for, well, quite a while now.
In NSX-T, stateful NAT is supported if the Edge is configured for Active-Standby HA.

We also support reflexive, or stateless NAT. This is useful when I’m running the Edge in Active-Active HA.

The Edge also has a firewall, much like we’re used to in NSX-V. This is a pretty straightfoward L3 firewall, and is used for the same things we use the ESG firewall for in NSX-V – controlling north/south traffic at the software-defined network perimeter.
But we get to add a new twist – the Edge firewall is available on both Tier 0 and Tier 1 routers, so tenants can have a little more control over what they allow into the tenant perimeter.

VPN Services
We also have L2 and L3 VPNs avaialble in the Edge, but they’re completely built and managed through the APIs at the moment. Sorry folks, no real VPN love in this series, just an honorable mention. No API-fu with this software-defined joker yet.

Load Balancer
Just remember, if you’re using a VM form factor Edge, it has to be Medium or Large to configure Load Balancing. Learn from my fail, where I deployed a Small Edge, and wondered for an hour or so why I couldn’t attach the LB to the Edge.

Ultimately, we recommend Large if you’re going to use Load Blaancing. Medium supports the LB function, but really only for proof of concept use cases.

The Load Balancer in NSX-T is still ultimately a function of the Edge. No real news here. The configuration is pretty straightforward, where a Load Balancer instance is created, at least one Virtual Server is attached to the Load Balancer, and the LB is attached to a Tier 1 Logical Router.
Server Pools need to be created – the Load Balancing Algorithm is defined here, with Round Robin and Least Connections settings (along with weighted versions for those), as well as a simple IP Hash algorithm. You decide whether you’re going to do TCP multiplexing.

Then you decide whether you’re doing a Transparent load balancer, or statically or dynamically mapping the NAT. Transparent load balancing is only supported if the LB is inline with the servers in the pool.
Then you statically or dynamically choose pool members. Dynamic pools use NSGroups, where static is just that – predefined and static.

Finally, you decide what health monitors you want for pool members – these can be active or passive (or both, if you really want). NSX-T ships with default Active Health Monitors for HTTP, HTTPS, ICMP, and TCP. You can create more if you want, and decide how you want them to check for health.

When you create Virtual Servers, you get to decide whether you’re load balancing L4 or L7, TCP or UDP, and what Application Profile you’re going to use, it’s IP address, associated Server Pools, and what kind of persistence you want.

Then you just toss all that together with the LB itself, and off you go! The biggest differences here (that I see) are the things we need to create to make a load balancer are somewhat consolidated, and we need to make sure we have large enough Edges to support the Load Balancer. And Application Rules are gone, but replaced with other rewrite and redirection policies called LB Rules, managed in the Virtual Server.

L2 Bridging
We can’t forget the venerable bridging capabilities. They haven’t been mentioned anywhere else yet. In NSX-T 2.1, we had to define a Bridge Cluster and populate it with ESXi Transport Nodes to get L2 bridging. That’s still around, but in NSX-T 2.2, a far better option is available (relatively speaking, of course – it’s 2018, bridging should be a niche use case these days, but it’s here if you need it).

Today, we can create Bridge Profiles and attach them to the Edge, which does a number of things for us:
* we can use Edges rather than ESXi hosts for bridging – that sounds less expensive already!
* Edges use DPDK for forwarding – so now bridging is (potentially) faster
* Edges have a firewall, which means we can limit traffic going to or from the bridged network

I think we have a winner here, for those rare occasions that L2 adjacency is necessary, especially for things like migrations from physical networks to logical.

So, there we have it, a quick rundown of things that need Services Routers on the Edge.


~$ history
Introduction: From NSX-V to NSX-T. An Adventure
NSX-T: The Manager of All Things NSX
The Hall of the Mountain King. or “What Loot do We Find in nsxcli?”
Three Controllers to Rule Them All (that just doesn’t have the same ring to it, does it?)
Beyond Centralization: The Local Control Plane
Transport Zones, Logical Switchies, and Overlays! Oh, My!
Which Way Do We Go? Let’s ask the Logical Router!
If You’re Not Living on the Edge, You’re Taking Up Too Much Room

If You’re Not Living on the Edge, You’re Taking Up Too Much Room

We talked about the Edge last time.  But we really didn’t get into the differences that we might expect.  I just said “we’ll get to that later”.  I suppose it’s later by now.

The Edge in NSX-T is little more than a container.  No, not “container” in a Docker kind of sense, but in a “pool of resources for network services” kind of sense.

See, the Edge is still a virtual machine, except when it isn’t.  In NSX-T, we have our choice of form factors – 3 sizes of virtual machine, and then bare metal.  Need a really, really big Edge?  Get out that server, we’re making a big network device!

So what are the form factors for an Edge these days?

  • Small – 2 vCPU, 4 GB RAM, 120 GB disk
  • Medium – 4 vCPU, 8 GB RAM, 120 GB disk
  • Large – 8 vCPU, 16 GB RAM, 120 GB disk
  • Bare Metal – 8 CPU, 32 GB RAM, 200 GB disk (minimums, naturally)

The virtual machine Edges come with 4 vmxnet3 vNICs installed.  Bare metal, well, we’ve got to watch out.  Chances are, your Intel X520, X540, X550, or X710s will work.  Check the docs for specifics – it’s all spelled out there in the System Requirements section of the Installation Guide

If you give a mouse an Edge, he’ll probably realize that it’s a single point of failure, and he’ll ask for another.

We can’t use Edges right after they’ve been deployed.  They’re really just Fabric Nodes at that point, joined to the Management Plane and just taking up resources.  They need to be promoted to Transport Nodes before they’ll be useful.

Because, while an Edge will still act as the North/South perimeter of our Software-Defined Network, it’s more than that.  An Edge is actually an active participant in the network – NSX host switches and TEPs will be installed on the Edge as part of this process.  And it runs distributed routing processes so that traffic coming through the Edge destined to a workload on a logical switch has an efficient routing path to get there. But TEPs and distributed routers are not all that are deployed to the Edge.

If you recall, from the logical routing post, the services router (SR) was mentioned.  The SR always lives on an Edge, whether the SR belongs to a Tier 0 or a Tier 1 router.  You certainly can influence on _which_ Edge a SR is running, but it’ll always be on one.

Let’s recap the kinds of things that might be deployed as a SR:

  • NAT
  • BGP (Tier 0 only)
  • Firewall
  • Load Balancer

That’s an awful lot of stuff in just 4 bullet points.  We’ll get into the specifics of those later.  These services are generally stateful, and generally need to be centralized.  So we did.

Because some things are centralized, there should probably be some measures taken for high availability (thus my nod to Laura Numeroff and Felicia Bond a couple of paragraphs ago, in case you missed the reference and thought I was simply going mad).  For HA of SRs, we need an Edge Cluster.

An Edge Cluster is simply a grouping of 1 or more identical form factor Edge transport nodes (maximum of 8) put together as a larger logical construct.  A container for your network services containers, if you will.  It’s pretty straightforward to create an Edge Cluster, simply add one in the Edge Clusters tab of Fabric > Nodes, add the Edge Transport Nodes you want participating, define the cluster profile, and off you go.

There’s not much to an Edge Cluster, really.  And not much to the cluster profile, either.  The profile simply defines the BFD probe interval, how many hops are allows, and how many probes have to be lost before we declare an Edge officially dead.

Most services operate statefully, and support only Active/Standby failover.  

We can configure routers as Active/Active or Active/Standby, and that will define what you can do with that router.  It’s worth noting that, if you configure a router as Active/Active, it is essentially a stateless device.  NAT is only available as Reflexive (or Stateless) NAT. The Edge Firewall is also stateless.

In Active/Standby, however, you can so stateful NAT and firewalling.  You can attach a Load Balancer.  But you have to choose your failover mode. If you ask me, you’re really choosing your fallback mode, but that’s not how it’s called out in the UI.  You have two failover mode choices:  Preemptive and Non-Preemptive.  What does that even mean?

It’s actually pretty simple.  In preemptive mode, you define a preferred member of your Edge cluster. This is the Edge we want to use.  If it fails, we’ve got others, so we’re not out for long.  What preemptive means, however, is that when our preferred Edge returns to service, NSX will preempt the service to move it back to the preferred node.  An automatic fallback, if you will.

In a non-preemptive configuration, we do not define a preferred node, so there’s no drive by NSX to move a service back to a preferred location.  No automatic fallback of the service.

Is there more to talk about? Of course there is.  We’ve still got services to talk about, and tooling, and all kinds of good stuff.  Stay tuned!


~$ history
Introduction: From NSX-V to NSX-T. An Adventure
NSX-T: The Manager of All Things NSX
The Hall of the Mountain King. or “What Loot do We Find in nsxcli?”
Three Controllers to Rule Them All (that just doesn’t have the same ring to it, does it?)
Beyond Centralization: The Local Control Plane
Transport Zones, Logical Switchies, and Overlays! Oh, My!
Which Way Do We Go? Let’s ask the Logical Router!

Which Way Do We Go? Let’s ask the Logical Router!

Moving around the data plane, we need to think about how, exactly, we’re going to get packets off of one logical switch and onto another.  Or out to the physical infrastructure, even.  

Logical routing in NSX-T is so very similar to what we have in NSX-V, but it’s entirely new at the same time.  

We still have the concept of a distributed router, embedded in the kernel to make routing decisions.  Pretty similar so far.  

Logical routers have interfaces, called Downlinks,  connected to the logical switches to enable routing between L2 domains. Still pretty much the same as we’re used to.

Logical routers no longer have a DLR Control VM. Well, there’s a bit of a departure from NSX-V.

Let’s reel things back into similarities.  We can still maintain two tiers of routing to keep tenants separated.  But we don’t refer to the tiers as the Distributed Logical Router and Edge Services Gateway any longer.  Now, it’s Tier-1 and Tier-0 routing, respectively.  This is where we have to start unlearning NSX-V things and relearning NSX-T things.

The NSX-T routers are all distributed, meaning the Tier-0 and Tier-1 routers are programmed on all the transport nodes throughout the transport zone.

So think about this logical topology:


Sample 2-Tier Routing Topology

Everything in the diagram, save for the physical router, is distributed across the transport zone.  

Except when they’re not.  Without diving too deep here, each of these routers, Tier 0, Tenant A Tier 1, and Tenant B Tier 1, are actually comprised of two objects: the distributed router (DR), and the services router (SR).

So let’s talk about these for a minute.  The distributed router component is the part that lives on each transport node.  This is the part that makes the routing decisions.  If I had two workloads: tenantA-Web and tenantB-Web, and those workloads were instantiated on the same hypervisor, traffic between them would not have to leave the host.  If tenantA-Web sent a ping to tenantB-Web, the traffic would go VM -> Tenant A Tier 1 -> Tier 0 -> Tenant B Tier 1 -> VM.

Perhaps breaking the environment up into tenants was a poor choice, as I likely wouldn’t allow traffic between them like that.  But that’s where other capabilities come in – the distributed firewall, for example.  We’ll talk about more of that kind of stuff as we continue through this series.

Anyway, we have our DR, but what’s this services router thing?  Simply put, it’s the component that we use for centralized or non-distributed services, such as dynamic routing, NAT, Firewalling, Load Balancing, L2 Bridging, etc.  

I know, you’re wondering, at this point, just what we’re thinking with a whole new routing structure when it sounds a lot like we’ve taken the model from NSX-V, and just distributed the ESG routing. Which is sort of true.  But it’s not.  Because the Tier 1 routers can run services just as well as the Tier 0 routers, which means everyone gets a services router component, assuming you’ve turned up one of these services.

That’s right, I can do NAT at Tier 0 or Tier 1, I can firewall at either tier, and on and on and on.  

But we’re not getting into the specifics of all of that yet.  

So where do these services routers live?  Since they’re centralized, we need someplace to put them.  How about on the Edge? We still have Edges in NSX-T, though they’re definitely no longer “Edge Services Gateways” – just Edges.  We’ll talk more about the specifics of them in the next installment.  For right now, just remember that all of the stateful or centralized services will live on an Edge node.

Another thing of note, is that I don’t need to deploy a two-tier routing topology.  I can attach a Tier 0 logical router to a logical switch.  If I don’t need a complex topology, you don’t have to build a complex topology.  Just cut Tier 1 completely out of the picture, and you’ll be fine.

This is a lot of information.  It may not seem like it, but it really is.  We’ll dive into the Edge next in an effort to complete the picture here.  


~$ history
Introduction: From NSX-V to NSX-T. An Adventure
NSX-T: The Manager of All Things NSX
The Hall of the Mountain King. or “What Loot do We Find in nsxcli?”
Three Controllers to Rule Them All (that just doesn’t have the same ring to it, does it?)
Beyond Centralization: The Local Control Plane
Transport Zones, Logical Switchies, and Overlays! Oh, My!

Transport Zones, Logical Switches, and Overlays! Oh, My!

Logical Switching

Everyone loves Logical Swtiching, right?! The ability to spin up a Layer 2 network whenever you need is pretty darned cool. Arguably, VXLAN is groovy, too, taking an L2 frame from a VM and wrapping it up in VXLAN goodness to shoot over the underlay network. So VMware left that alone, right?

Wrong. VXLAN is yesterday’s overlay protocol. The future is here with GENEVE.

GENEVE, you say? What in the world is GENEVE? GENEVE stands for GEneric NEtwork Virtualization Encapsulation, and is still being standardised. Wanna know more? Check is out here.

Let’s summarize the draft, shall we? Every other network virtualization overlay out there (VXLAN, NVGRE, etc.) is fixed. They all do one thing. Not to disparage any of the others – they do what they do very well. But the data center is not a fixed thing. It needs flexibility. It changes, it grows, it has to support new use cases. And that’s why GENEVE was born. See, GENEVE supports all the same stuff as everything else for network virtualization – taking that important L2 frame and wrapping it up in a new set of headers to send across the underlay “backplane” to the destination TEP (Tunnel End Point) so that delivery to the destination VM can occur. What makes GENEVE powerful, though, is extensibility through a proposed set of TLV options that can be set. Did I mention that the Options field in the GENEVE header is variable-length? So all manner of things could possibly be done.

From a day-to-day perspective, the difference between GENEVE and VXLAN is negligible. GENEVE uses UDP/6081, where we’re used to VXLAN using UDP/4789. The Tunnel Endpoints are referred to as “TEPs” or “Tunnel Endpoints” rather than “VTEPs” or “VXLAN Tunnel Endpoints”. Wireshark, as a packet analyzer example, already recognizes and understands GENEVE. So you just keep doing things the way you’ve been doing them. Except that now your logical networks can span both ESXi and KVM hypervisors. See, we’re growing.

In addition to the change in encapsulation protocol, NSX-T no longer has a dependency on the vSphere Distributed Switch. This does mean a couple of things, though.

First, we no longer have to worry about those crazy long vxw-dvs-83-virtualwire-7-sid-10007-transitNetwork kinda port group names. Logical Networks show up as simply the network name, though they do have a pretty new icon signifying that they’re opaque objects to vCenter (meaning that vCenter really can’t do anything with them, but it knows they exist).

Second, KVM virtual machine attachment to a logical switch isn’t quite as easy as in vSphere. For those, we have to identify the UUID of the virtual NIC (hint: look at the VM’s XML file – “virsh dumpxml <VM domain>”, and look for “interfaceId”), and then create the Logical Port by attaching the VIF (Virtual Interface) in NSX Manager > Switching > Ports > Add, where we specify the VIF UUID we discovered. Not difficult, just a little tedious, and probably worth scripting if you’re adding a bunch of KVM virtual machines to a Logical Switch.

Finally, don’t chew up all of your physical uplinks – NSX needs a couple of them for the NSX vSwitch that’s installed when you promote the host to a Transport Node.  I have a story I’ll tell about that later in this series.

If you recall from NSX-V, a logical switch exists in the concept of a transport zone.  NSX-T has transport zones as well, and they do precisely the same thing – define a compute scope for our logical networks.  But they’re not exactly the same all around.

In NSX-T we have two different types of transport zone.  One for overlay networks, and another for VLAN-backed networks. In other words, I have the option to build VLAN-backed logical switches.  That sounds a bit crazy, if you ask me.  Why in the world would we want to do that?! Well, the big reason is northbound connectivity to physical routers from the Edge.  I know, we haven’t talked about the Edge yet (that’s coming really soon, now that we’re talking about things in the data plane), so I’ll keep this brief.  The Edge will be configured with one or more VLAN transport zones to connect to upstream VLANs.  I’ll use this analogy again, but it’s essentially like creating VLAN-backed port groups on an NSX vSwitch.

So there you have it, a quick summary of what’s different in Logical Switching.

Beyond Centralization: The Local Control Plane

We tend to place our focus on the centralized management and control plane, but they are distributed, existing in both a centralized component (the NSX Manager for the management plane, the multi-node controller cluster for the Centra Control Plane), and a distributed component in the Management Plane Agent (MPA) and Local Control Plane (LCP) agents installed on the nodes.

The MPA communicates with NSX Manager over a RabbitMQ channel, and has a couple of purposes:
• Bootstrapping the Transport Node
• Forwarding statistics to NSX Manager

It also works with a service, nsxa, that brokers any communications to and from the kernel.

The LCP is simply the host-local control plane agent – netcpa. We should be familiar with netcpa from NSX-V. We use a proprietary protocol to communicate between netcpa and the controller nodes that uses TCP/1234. On KVM hosts, netcpa is paired with the nsx-agent service to cover the local control plane duties, including programming OVS and ConnTrack for L2, L3, and DFW services.

The LCP is bootstrapped by the Managment Plane Agent (MPA), and is responsible for programming the data plane, as we need a user space broker to communicate to the kernel modules installed on the hypervisor.

The LCP is installed as part of the host preparation process when we add Fabric Nodes to NSX Manager, and is responsible for L2 and L3 control data – things like VNIs, VTEP, MAC, and ARP tables, etc. The LCP programs the DFW as well, and6  also contains VIF (Virtual Interface) status and other such information.

These are critical to the healthy operation of NSX-T.


~$ history
Introduction: From NSX-V to NSX-T. An Adventure
NSX-T: The Manager of All Things NSX
The Hall of the Mountain King. or “What Loot do We Find in nsxcli?”
Three Controllers to Rule Them All (that just doesn’t have the same ring to it, does it?)