Netns Docker For Mac

Temps de lecture :18minutes

  1. Nets Docker For Mac Download
  2. Nets Docker For Mac Os
  3. Nets Docker For Mac Installer
  4. Nets Docker For Mac Catalina

前言此篇博文是笔者所总结的 Docker 系列之一; 本文为作者的原创作品,转载需注明出处; 概述本章节主要是描述 docker 的 network namespace; docker 的隔离网络环境是通过 linux network namespace 实现的。 docker 的网络实现docker 的网络实现一共有四种模式,bridge、host、container 以及 none 模式 br. Docker is a platform for developers and sysadmins to build, run, and share applications with containers. A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

Introduction

In part 1 of this blog post we have seen how Docker creates a dedicated namespace for the overlay and connect the containers to this namespace. In part 2 we have looked in details at how Docker uses VXLAN to tunnel traffic between the hosts in the overlay. In this third post, we will see how we can create our own overlay with standard Linux commands.

Manual overlay creation

If you have tried the commands from the first two posts, you need to clean-up your Docker hosts by removing all our containers and the overlay network:

The first thing we are going to do now is to create an network namespace called “overns”:

Nets Docker For Mac Download

Now we are going to create a bridge in this namespace, give it an IP address and bring the interface up:

The next step is to create a VXLAN interface and attach it to the bridge:

The most important command so far is the creation of the VXLAN interface. We configured it to use VXLAN id 42 and to tunnel traffic on the standard VXLAN port. The proxy option allows the vxlan interface to answer ARP queries (we have seen it in part 2). We will discuss the learning option later in this post. Notice that we did not create the VXLAN interface inside the namespace but on the host and then moved it to the namespace. This is necessary so the VXLAN interface can keep a link with our main host interface and send traffic over the network. If we had created the interface inside the namespace (like we did for br0) we would not have been able to send traffic outside the namespace.

Once we have run these commands on both docker0 and docker1, here is what we have:

Now we will create containers and connect them to our bridge. Let’s start with docker0. First, we create a container:

We will need the path of the network namespace for this container. We can find it by inspecting the container.

Our container has no network connectivity because of the --net=none option. We now create a veth and move one of its endpoints (veth1) to our overlay network namespace, attach it to the bridge and bring it up.

The first command uses an MTU of 1450 which is necessary due to the overhead added by the VXLAN header.

The last step is to configure veth2: send it to our container network namespace and configure it with a MAC address (02:42:c0:a8:00:02) and an IP address (192.168.0.2):

The symbolic link in /var/run/netns is required so we can use the native ip netns commands (to move the interface to the container network namespace). We used the same addressing schem as Docker: the last 4 bytes of the MAC address match the IP address of the container and the second one is the VXLAN id.

Nets Docker For Mac Os

We have to do the same on docker1 with different MAC and IP addresses (02:42:c0:a8:00:03 and 192.168.0.3). If you use the terraform stack from the github repository, there is a helper shell script to attach the container to the overlay. We can use it on docker1:

The first parameter is the name of the container to attach and the second one the final digit of the MAC/IP addresses.

Here is the setup we have gotten to:

Now that our containers are configured, we can test connectivity:

We are not able to ping yet. Let’s try to understand why by looking at the ARP entries in the container and in the overlay namespace:

Both commands do not return any result: they do not know what is the MAC address associated with IP 192.168.0.3. We can verify that our command is generating an ARP query by running tcpdump in the overlay namespace:

If we rerun the ping command from another terminal, here is the tcpdump output we get:

The ARP query is broadcasted and received by our overlay namespace but does not receive any answer. We have seen in part 2 that the Docker daemon populates the ARP and FDB tables and makes use of the proxy option of the VXLAN interface to answer these queries. We configured our interface with this option so we can do the same by simply populating the ARP and FDB entries in the overlay namespace:

The first command creates the ARP entry for 192.168.0.3 and the second one configures the forwarding table by telling it the MAC address is accessible using the VXLAN interface, with VXLAN id 42 and on host 10.0.0.11.

Do we have connectivity?

No yet, which makes sense because we have not configured docker1: the ICMP request is received by the container on docker1 but it does not know how to answer. We can verify this on docker1:

The first command shows, as expected, that we do not have any ARP information on 192.168.0.3. The output of the second command is more surprising because we can see the entry in the forwarding database for our container on docker0. What happened is the following: when the ICMP request reached the interface, the entry was “learned” and added to the database. This behavior is made possible by the “learning” option of the VXLAN interface. Let’s add the ARP information on docker1 and verify that we can now ping:

We have successfuly built an overlay with standard Linux commands:

Dynamic container discovery

We have just created an overlay from scratch. However, we need to manually create ARP and FDB entries for containers to talk to each other. We will now look at how this discovery process can be automated.

Let us first clean up our setup to start from scratch:

Catching network events: NETLINK

Netlink is used to transfer information between the kernel and user-space processes: https://en.wikipedia.org/wiki/Netlink. iproute2, which we used earlier to configure interfaces, relies on Netlink to get/send configuration information to the kernel. It consists of multiple protocols (“families”) to communicate with different kernel components. The most common protocol is NETLINK_ROUTE which is the interface for routing and link configuration.

For each protocol, Netlink messages are organized by groups, for example for NETLINK_ROUTE you have:

  • RTMGRP_LINK: link related messages
  • RTMGRP_NEIGH: neighbor related messages
  • many others

For each group, you then have multiple notifications, for example:

  • RTMGRP_LINK:
    • RTM_NEWLINK: A link was created
    • RTM_DELLINK: A link was deleted
  • RTMGRP_NEIGH:
    • RTM_NEWNEIGH: A neighbor was added
    • RTM_DELNEIGH: A neighbor was deleted
    • RTM_GETNEIGH: The kernel is looking for a neighbor

I described the messages received in userspace when the kernel is sending notifications for these events, but similar messages can be sent to the kernel to configure links or neighbors.

iproute2 allows us to listen to Netlink events using the monitor subcommand. If we want to monitor for link information for instance:

In another terminal on docker0, we can create a link and then delete it:

On the first terminal we can see some output.

When we created the interfaces:

Nets Docker For Mac Installer

When we removed them:

We can use this command to monitor other events:

In another terminal:

Nets

We get the following output:

In our case we are interested in neighbor events, in particular for RTM_GETNEIGH which are generated when the kernel does not have neighbor information and sends this notification to userspace so an application can create it. By default, this event is not sent to userspace but we can enable it and monitor neighbor notifications:

This setting will not be necessary afterwards because the l2miss and l3miss options of our vxlan interface will generate the RTM_GETNEIGH events.

In a second terminal, we can now trigger the generation of the GETNEIGH event:

Here is the output we get:

Nets docker for mac installerNets docker for mac download

We can use the same command in containers attached to our overlay. Let’s create an overlay and attach a container to it.

The two shell scripts are available on the github repo.

Nets Docker For Mac Catalina

create-overlay creates an overlay called overns using the commands presented earlier:

attach-ctn attaches a container to the overlay. The first parameter is the name of the container and the second one the last byte of its IP address:

We can now run ip monitor in the container:

In a second terminal, we can ping an unknown host to generate GETNEIGH events:

In the first terminal we can see the neighbor events:

Nets docker for mac download

We can also look in the network namespace of the overlay:

This event is slightly different because it is generated by the vxlan interface (because we created the interface with the l2miss and l3miss options). Let’s add the neighbor entry to the overlay namespace:

If we run the ip monitor neigh command and try to ping from the other terminal, here is what we get:

Now that we have the ARP information, we are getting an L2miss because we do not know where the mac address is located in the overlay. Let’s add this information:

If we run the ip monitor neigh command again and try to ping we will not see neighbor events anymore.

The ip monitor command is very useful to see what is happening but in our case we want to catch these events to populate L2 and L3 information so we need to interact with them programmatically.

Here is simple python to subscribe to Netlink messages and decode GETNEIGH events:

This script only contains the interesting lines, the full one is available on the github repository. Let’s go quickly through the most important part of the script. First, we create the NETLINK socket, configure it for NETLINK_ROUTE protocol and subscribe to the neighbor event group (RTMGRP_NEIGH):

The we decode the message and filter to only process GETNEIGH messages:

To understand how the message is decoded, here is a representation of the message. The Netlink header is represented in orange:Once we have a GETNEIGH message we can decode the ndmsg header (in blue):

This header is followed by an rtattr structure, which contains the data we are interested in. First we decode the header of the structure (purple):

We can receive two different types of messages:

  • NDA_DST: L3 miss, the kernel is looking for the mac address associated with the IP in the data field (4 data bytes after the rta header)
  • NDA_LLADDR: L2 miss, the kernel is looking for the vxlan host for the MAC address in the data field (6 data bytes after the rta header)

We can try this script in our overlay (we recreate everything to start with a clean environment):

If we try to ping from another terminal:

Here is the output we get:

If we add the neighbor information and ping again:

We now get an L2 miss because we have added the L3 information.

Dynamic discovery with Consul

Now that we have seen how we can be notified of L2 and L3 misses and catch these events in python, we will store all L2 and L3 data in Consul and add the entries in the overlay namespace when we get a neighbor event.

First, we are going to create the entries in Consul. We can do this using the web interface or curl:

We create two types of entries:

  • ARP: using the keys demo/arp/{IP address} with the MAC address as the value
  • FIB: using the keys demo/arp/{MAC address} with the IP address of the server in the overlay hosting this Mac address

In the web interface, we get this for ARP keys:

Now we just need to lookup data when we receive a GETNEIGH event and populate the ARP or FIB tables using Consul data. Here is a (slightly simplified) python script which does this:

This full version of this script is also available on the github repository mentionned earlier. Here is a quick explanation of what it does:

Instead of processing Netlink messages manually, we use the pyroute2 library. This library will parse Netlink messages and allow us to send Netlink messages to configure ARP/FIB entries. In addition, we bind the Netlink socket in the overlay namespace. We could use the ip netns command to start the script in the namespace, but we also need to access Consul from the script to get configuration data. To achieve this, we will run the script in the host network namespace and bind the Netlink socket in the overlay namespace:

We will now wait for GETNEIGH events:

We retrieve the index of the interface and its name (for logging purposes):

Now, if the message is an L3 miss, we get the IP address from the Netlink message payload and try to look up the associated ARP entry from Consul. If we find it, we add the neighbor entry to the overlay namespace by sending a Netlink message to the kernel with the relevant information.

If the message is an L2 miss, we do the same with the FIB data.

Let’s now try this script. First, we will clean up everything and recreate the overlay namespace and containers:

If we try to ping the container on docker1 from docker0, it will not work because we have no ARP/FIB data yet:

We will now start our script on both hosts:

And try pinging again (from another terminal on docker0):

Here is the output we get the python script on docker0:

First, we get an L3 miss (no ARP data for 192.168.0.3), we query Consul to find the Mac address and populate the neighbor table. Then we receive an L2 miss (no FIB information for 02:42:c0:a8:00:03), we look up this Mac address in Consul and populate the forwarding database.

On docker1, we see a similar output but we only get the L3 miss because the L2 forwarding data is learned by the overlay namespace when the ICMP request packet gets to the overlay.

Here is an overview of what we built:

Conclusion

This concludes our three part blog post on the Docker overlay. Do not hesitate to ping me (on twitter for instance) if you see some mistakes/inaccuracies or if some part of the posts are not clear. I will do my best to amend these posts quickly.

May 11th, 2016
Never
Not a member of Pastebin yet?Sign Up, it unlocks many cool features!
  1. DEBU[0265] Calling POST /v1.23/containers/create?name=c1
  2. DEBU[0265] form data: {'AttachStderr':false,'AttachStdin':false,'AttachStdout':false,'Cmd':['sh'],'Domainname':','Entrypoint':null,'Env':[],'HostConfig':{'AutoRemove':false,'Binds':null,'BlkioBps':0,'BlkioDeviceReadBps':null,'BlkioDeviceReadIOps':null,'BlkioDeviceWriteBps':null,'BlkioDeviceWriteIOps':null,'BlkioIOps':0,'BlkioWeight':0,'BlkioWeightDevice':null,'CapAdd':null,'CapDrop':null,'Cgroup':','CgroupParent':','ConsoleSize':[0,0],'ContainerIDFile':','CpuCount':0,'CpuPercent':0,'CpuPeriod':0,'CpuQuota':0,'CpuShares':0,'CpusetCpus':','CpusetMems':','Devices':[],'DiskQuota':0,'Dns':[],'DnsOptions':[],'DnsSearch':[],'ExtraHosts':null,'GroupAdd':null,'IpcMode':','Isolation':','KernelMemory':0,'Links':null,'LogConfig':{'Config':{},'Type':'},'Memory':0,'MemoryReservation':0,'MemorySwap':0,'MemorySwappiness':-1,'NetworkMode':'demo','OomKillDisable':false,'OomScoreAdj':0,'PidMode':','PidsLimit':0,'PortBindings':{},'Privileged':false,'PublishAllPorts':false,'ReadonlyRootfs':false,'RestartPolicy':{'MaximumRetryCount':0,'Name':'no'},'SandboxSize':0,'SecurityOpt':null,'ShmSize':0,'StorageOpt':null,'UTSMode':','Ulimits':null,'UsernsMode':','VolumeDriver':','VolumesFrom':null},'Hostname':','Image':'alpine','Labels':{},'NetworkingConfig':{'EndpointsConfig':{}},'OnBuild':null,'OpenStdin':true,'StdinOnce':false,'Tty':true,'User':','Volumes':{},'WorkingDir':'}
  3. DEBU[0265] container mounted via layerStore: /var/lib/docker/296608.296608/aufs/mnt/90df06068fac304793bdaa57204e7b7b1da70de500023788ac61c46db71946e0
  4. DEBU[0265] Calling POST /v1.23/containers/35600e201ecbd508b254b42b701ca96386182e274894d62478f14819f71aea38/start
  5. DEBU[0265] container mounted via layerStore: /var/lib/docker/296608.296608/aufs/mnt/90df06068fac304793bdaa57204e7b7b1da70de500023788ac61c46db71946e0
  6. DEBU[0265] Assigning addresses for endpoint c1's interface on network demo
  7. DEBU[0265] RequestAddress(GlobalDefault/10.0.0.0/24, <nil>, map[])
  8. DEBU[0265] Assigning addresses for endpoint c1's interface on network demo
  9. DEBU[0265] Allocating IPv4 pools for network docker_gwbridge (1d3362ae7a767e90652d7feb46f79d61fdfbfe002457a19fe9391ab8bbade309)
  10. DEBU[0265] RequestPool(LocalDefault, , , map[], false)
  11. DEBU[0265] RequestAddress(LocalDefault/172.18.0.0/16, <nil>, map[RequestAddressType:com.docker.network.gateway])
  12. DEBU[0265] Received user event name:jl 20.20.20.1 752d6d03b04fd1a10626cfbc663080034efca3786af8cc679964a43bcfeb28b3 81a0074432fa02a51a190c91f5144167dd124e82b02e69a8cc90ab5c00385f07, payload:join 10.0.0.2 255.255.255.0 02:42:0a:00:00:02
  13. DEBU[0265] Parsed data = 752d6d03b04fd1a10626cfbc663080034efca3786af8cc679964a43bcfeb28b3/81a0074432fa02a51a190c91f5144167dd124e82b02e69a8cc90ab5c00385f07/20.20.20.1/10.0.0.2/255.255.255.0/02:42:0a:00:00:02
  14. DEBU[0265] Setting bridge mac address to 02:42:8f:50:4b:fc
  15. DEBU[0265] Assigning address to bridge interface docker_gwbridge: 172.18.0.1/16
  16. DEBU[0265] /sbin/iptables, [--wait -t nat -C POSTROUTING -s 172.18.0.0/16 ! -o docker_gwbridge -j MASQUERADE]
  17. DEBU[0265] /sbin/iptables, [--wait -t nat -I POSTROUTING -s 172.18.0.0/16 ! -o docker_gwbridge -j MASQUERADE]
  18. DEBU[0265] /sbin/iptables, [--wait -t nat -C DOCKER -i docker_gwbridge -j RETURN]
  19. DEBU[0265] /sbin/iptables, [--wait -t nat -I DOCKER -i docker_gwbridge -j RETURN]
  20. DEBU[0265] /sbin/iptables, [--wait -D FORWARD -i docker_gwbridge -o docker_gwbridge -j ACCEPT]
  21. DEBU[0265] /sbin/iptables, [--wait -t filter -C FORWARD -i docker_gwbridge -o docker_gwbridge -j DROP]
  22. DEBU[0265] /sbin/iptables, [--wait -A FORWARD -i docker_gwbridge -o docker_gwbridge -j DROP]
  23. DEBU[0265] /sbin/iptables, [--wait -t filter -C FORWARD -i docker_gwbridge ! -o docker_gwbridge -j ACCEPT]
  24. DEBU[0265] /sbin/iptables, [--wait -I FORWARD -i docker_gwbridge ! -o docker_gwbridge -j ACCEPT]
  25. DEBU[0265] /sbin/iptables, [--wait -t filter -C FORWARD -o docker_gwbridge -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT]
  26. DEBU[0265] /sbin/iptables, [--wait -I FORWARD -o docker_gwbridge -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT]
  27. DEBU[0265] /sbin/iptables, [--wait -t nat -C PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]
  28. DEBU[0265] /sbin/iptables, [--wait -t nat -C PREROUTING -m addrtype --dst-type LOCAL -j DOCKER]
  29. DEBU[0265] /sbin/iptables, [--wait -t nat -C OUTPUT -m addrtype --dst-type LOCAL -j DOCKER ! --dst 127.0.0.0/8]
  30. DEBU[0265] /sbin/iptables, [--wait -t nat -C OUTPUT -m addrtype --dst-type LOCAL -j DOCKER ! --dst 127.0.0.0/8]
  31. DEBU[0265] /sbin/iptables, [--wait -t filter -C FORWARD -o docker_gwbridge -j DOCKER]
  32. DEBU[0265] /sbin/iptables, [--wait -I FORWARD -o docker_gwbridge -j DOCKER]
  33. DEBU[0265] /sbin/iptables, [--wait -t filter -C FORWARD -j DOCKER-ISOLATION]
  34. DEBU[0265] /sbin/iptables, [--wait -D FORWARD -j DOCKER-ISOLATION]
  35. DEBU[0265] /sbin/iptables, [--wait -I FORWARD -j DOCKER-ISOLATION]
  36. DEBU[0265] /sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP]
  37. DEBU[0265] /sbin/iptables, [--wait -I DOCKER-ISOLATION -i docker_gwbridge -o docker0 -j DROP]
  38. DEBU[0265] /sbin/iptables, [--wait -t filter -C DOCKER-ISOLATION -i docker0 -o docker_gwbridge -j DROP]
  39. DEBU[0265] /sbin/iptables, [--wait -I DOCKER-ISOLATION -i docker0 -o docker_gwbridge -j DROP]
  40. DEBU[0265] releasing IPv4 pools from network docker_gwbridge (1d3362ae7a767e90652d7feb46f79d61fdfbfe002457a19fe9391ab8bbade309)
  41. DEBU[0265] ReleaseAddress(LocalDefault/172.18.0.0/16, 172.18.0.1)
  42. DEBU[0265] ReleasePool(LocalDefault/172.18.0.0/16)
  43. WARN[0265] Could not rollback container connection to network demo
  44. DEBU[0265] Received user event name:jl 20.20.20.1 752d6d03b04fd1a10626cfbc663080034efca3786af8cc679964a43bcfeb28b3 81a0074432fa02a51a190c91f5144167dd124e82b02e69a8cc90ab5c00385f07, payload:leave 10.0.0.2 255.255.255.0 02:42:0a:00:00:02
  45. DEBU[0265] Parsed data = 752d6d03b04fd1a10626cfbc663080034efca3786af8cc679964a43bcfeb28b3/81a0074432fa02a51a190c91f5144167dd124e82b02e69a8cc90ab5c00385f07/20.20.20.1/10.0.0.2/255.255.255.0/02:42:0a:00:00:02
  46. DEBU[0265] Releasing addresses for endpoint c1's interface on network demo
  47. DEBU[0265] ReleaseAddress(GlobalDefault/10.0.0.0/24, 10.0.0.2)
  48. failed to umount /var/lib/docker/296608.296608/containers/35600e201ecbd508b254b42b701ca96386182e274894d62478f14819f71aea38/shm: no such file or directory
  49. ERRO[0265] Handler for POST /v1.23/containers/35600e201ecbd508b254b42b701ca96386182e274894d62478f14819f71aea38/start returned error: error creating external connectivity network: cannot restrict inter-container communication: please ensure that br_netfilter kernel module is loaded