Editor's note: In part one of our two-part series on the challenges of container networking at scale, expert Jeff...
Loughridge discussed issues related to the default network address translation (NAT) configuration. Part two examines ways to mitigate those and other challenges. The material is not specific to any one software container technology and applies equally to Linux containers (LXC), Docker, Canonical LXD and others.
Given software container networking's reliance on NAT -- and given NAT's own limitations -- network engineers face some significant challenges when deploying containers within their infrastructure. But understanding how container hosts support the NAT model will help us avoid these issues.
First, let's take a look at how the host creates a new network namespace -- conceptually similar to a virtual routing and forwarding instance in the MPLS/VPN model -- and a special network interface called a virtual Ethernet (vEth). A vEth interface is a pair of endpoints used to connect namespaces. The host puts one end of the vEth in the default namespace for communication to the outside world and the other in the newly created namespace.
The vEth in the default namespace is tied to a bridge with names such as docker0 for Docker and lxcbr0 in LXC. The host uses iptables in Linux to configure NAT and a lightweight Dynamic Host Configuration Protocol (DHCP) server such as dnsmasq to assign addresses.
Working around NAT
Fortunately, we have techniques to obviate NAT. Chris Swan, CTO of Cohesive Networks, aptly summarized his philosophy to container networking as making containers "first-class citizens of the network" in his talk on Docker networking at Container Camp in 2014.
We can do that by attaching containers directly to hosts' network interfaces. The containers share the local area network (LAN) with the host. They obtain an IPv4 address from the LAN's DHCP server or use static addresses. All Layer 4 ports are fully exposed. While this direct exposure is less messy than managing mapped ports, maintaining a strong security posture requires discipline.
One method for directly connecting to a physical interface is to bridge one vEth endpoint with the Internet-facing physical interface. This approach requires modification of the physical interface, however, by removing the IP address and assigning it to the bridge interface.
Instead of using the vEth network type to connect to a physical interface, system administrators can elect to use the confusingly named "macvlan" network type. The macvlan type has nothing to do with IEEE 802.1Q VLANs; instead, it can be considered a method to multiplex multiple MAC addresses onto a single network interface. The macvlan type is typically deployed in bridge mode, resulting in a much simpler bridge than a traditional learning bridge. No learning occurs and the spanning tree protocol is unnecessary.
A second way to eliminate NAT would be to turn the host into a full-fledged router -- perhaps even one that speaks Border Gateway Protocol (gasp!). The host would route a prefix to the containers that live on the host. Each container would use a globally unique IP address. Providing unique addresses with IPv4 clearly will not scale in this age of a nearly depleted IPv4 address space. IPv6 makes the host-as-a-router technique much cleaner; the protocol's sparse addressing model allows for a giant, easily managed address space.
Because Layer 3 is not subject to flooding, its use eliminates the large fault domains created in Layer 2 broadcast domains. Ever experienced a data center-wide network meltdown? A single Ethernet broadcast domain almost certainly had a spanning tree failure or other flooding-related event.
Challenge: Proliferation of MAC addresses
Attaching container interfaces to the outside network introduces a new danger, however: an exploding number of MAC addresses visible in the data center network. The number of MAC addresses that a top-of-rack (ToR) switch can support varies. Switches that can handle a greater number of MAC addresses -- without resorting to flooding all frames -- cost more. Physical network interface cards on the hosts can also switch into promiscuous mode -- thus degrading performance -- as the MAC address limit is surpassed.
How can we have end-to-end reachability between containers directly attached to host physical interfaces without melting down the Layer 2 forwarding tables on ToR switches? Enter the ipvlan feature, introduced in the Linux kernel in late 2014. Rather than using the MAC address as a demultiplexer, as the macvlan driver does, the ipvlan driver uses the Layer 3 (IP) address. When the ipvlan driver is deployed in L3 mode, the container MAC addresses are not exposed to the network. Only the host MAC address on the physical interface is visible in the network.
The ipvlan feature is available in Linux kernel 3.19, although a more stable implementation exists in 4.0-rc7 and later. At the time of this writing, you'll need to compile a custom kernel to experiment with ipvlan, as the Linux distributions use older, more stable kernels.
Challenge: VEth network type at scale
Finally, while probably not noticeable in most container environments, using the vEth network type for large-scale deployments can hurt container network performance. In his talk, Networking in Containers and Container Clusters, Victor Marmol of Google described how the company recorded a 50% reduction in performance when using the vEth network type, compared with performance in the default namespace.
The macvlan and ipvlan features eliminate some of the processing required for the vEth network type. Kernel developer Eric W. Biederman described the macvlan driver as "simple, stupid and fast." The performance is maintained in the ipvlan driver. Expect to see performance enhancements in both the macvlan and ipvlan drivers in the future.
Amazon container service tackles Docker networking challenges
Container deployment could be a bumpy ride
Docker turns attention to networking