Overlay Networks (MTU considerations) #

When working with overlay networks across clusters, sometimes we have to manually set MTUs on both ends of the connection. For example, with WireGuard, which automatically determines MTU to use. After estimating the MTU of the underlying network (MTU_host), on the WireGuard interface it sets the MTU value of MTU_host - 80. For example, If the underlying network’s MTU is 1500, the WireGuard interface will have MTU = 1420.

However, often a WireGuard tunnel is established between hosts that reside in different networks in which case, it is possible that WireGuard peers choose different MTU values. This is the case when one of the peers is a AWS EC2 instance with MTU of 9001. The EC2 peer’s WireGuard MTU would be automatically set to 8921, which could lead to an unstable communication in some applications.

Additionally, WireGuard does not always correctly determine MTU of the uplink. So, it can still set MTU = 1420 on wg0, when the uplink MTU < 1500.

Checking Packet Size Limit (per host) #

Maximum DF (don’t fragment) packet size limit between hosts depends on MTU values on all networks between the two hosts. Therefore, appropriate MTU values could be different for communication between local nodes, WireGuard peers, or public servers.

The ping command can be used to determine the maximum DF packet size that can be transmitted between two hosts.

tl;dr #

# IPv4
ping -v -M do -c 8 -s 1000 1.1.1.1
# IPv6
ping -v -M do -c 8 -s 1000 2606:4700:4700::1111

where:

-v - verbose output
-M do - determine MTU by setting the DF (don’t fragment) flag
-c 8 - send 8 packets
-s 1000 - size of the payload

Final (actual) size of ping packets (IPv4 and IPv6)

In the previous example, the actual size of IPv4 packets will be

1028 = 1000 (payload) + 20 (IPv4 header) + 8 (ICMP header)

and for IPv6

1048 = 1000 (payload) + 40 (IPv6 header) + 8 (ICMP header)

Docomo Celullar MTU #

On Docomo a Docomo cellullar router as of 2024-02, the real MTU is set to 1472 on both IPv4 and IPv6 links. Docomo’s router does not advertise this value to the connected machines, but we can determine it with the ping command.

It shows that max payload sizes that pass when pinging Cloudflare DNS servers are -s 1444 over IPv4 and -s 1424 over IPv6.

Note Due to WireGuard not reliably detecting the uplink MTU, it is better to manually set it to <=1392 (= 1472 - 80)

Anything above that, leads to intermittent packet loss on the WireGuard interface.

As for why the real MTU is slightly smaller than 1500, one guess is that packets between my Docomo router and the upstream CGNAT traverse a IPv6-only network and all IPv4 packets get encapsulated in a 4-in-6 stack, possibly Dual-Stack Lite (see DS-Lite in wiki). This should happen transparently for the downstream clients. Which is mostly true, but if the rest of the IPv6-only infrastructure has to use standard-sized packets and datagram “Fragmentation and Reassembly” has not been implemented by the carrier, the effective MTU for downstream clients would have to be slightly reduced. An additional clue that there’s a DS-Lite encapsulation is that the router’s IPv4 address on the WAN port is usually 192.0.0.N which belongs to the special CIDR block used by DS-Lite (see IETF RFC 6333).

WireGuard passing through TailScale #

TailScale creates an overlay network with MTU 1280. This means that a WireGuard link that passes through a TailScale tunnel (full zero-trust), must have MTU <=1200.

Refs:

See the “MTU operational considerations” section in https://en.wikipedia.org/wiki/WireGuard

GRE #

Generic Route Encapsulation (GRE) allows transmitting data packets within regular IP packets when bridging nodes in a multi-site subnet.

Since GRE encapsulation adds a 24-byte header to the packets, when using GRE within other overlay networks, e.g. within WireGuard, the effective MTU of such communication will be reduced by addtional 24 bytes.

GRE is convenient when the underlay network is trusted.

Note Unlike WireGuard, GRE does not encrypt data. Thus, when traffic is sent between sites (over the internet), encryption should be implemented at the level of the underlying protocol bridging the sites.

Refs:

Layer 2 Overlay Networks #

This subsection could be purely for academic/research purposes, since I don’t think there are real-world use cases where systems running modern apps would operate at Layer 2 or require specific network config at this level.

GRE, WireGuard, TailScale (uses WireGuard underneath) overlay networks provide Layer 3 connectivity. However, occasionally we may want to experiment with virtual Layer 2 interface. There are various protocols that can be used to implement such an interface: VXLAN (Virtual Extensible LAN), NV-GRE (Network Virtualization using Generic Route Encapsulation), and others.

VXLAN is supported by modern Linux Kernels and is easy to configure with ip link ... (Google instructions for Debian).

In general, the standards for Layer 2 and Layer 3 overlay networks are being discussed within the Network Virtualization Overlays Working Group (nvo3).

Refs:

VXLAN and MTU #

Notice that the VXLAN header is 50 bytes, so when specifying an overlay interface of type VXLAN, we should consider setting its MTU to be lower by 50 bytes from its underlaying network.