Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

Theo Julienne
Software & Infrastructure Engineer
 
 
 
 

More posts

Ethernet MTU and TCP MSS: Why


connections stall
Aug 21, 2020
  

MTU and MSS are two terms that are easily mistaken and their misconfiguration is often the
cause of networking problems. Spending enough time working on production systems that
interface with large networks of computers or the Internet almost guarantees coming across
situations where an interface was configured with the wrong MTU, or a firewall was filtering
ICMP. This results in a client being unable to transfer large amounts of data when smaller
transfers work fine. This post will walk through MTU, MSS and packet size negotiation for
TCP connections, and the common situations where it breaks down. This post was inspired
by multiple discussions during the course of investigating errors on production systems as
part of my role at GitHub.

If you want to take away a simple snippet from this post, the summary is:

The MTU of the interfaces on either side of a physical

or logical link must be equal. Don't block ICMP.

The examples mentioned in this blog post will be reproducible in the lab from 
theojulienne/blog-lab-mtu - clone this repository and bring up the lab, then poke around at
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 1/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

these examples in a real system:

$ git clone https://github.com/theojulienne/blog-lab-mtu.git

$ cd blog-lab-mtu

$ vagrant up

$ vagrant ssh

Ethernet MTU: Maximum Transmission Unit


The MTU (Maximum Transmission Unit) on an Ethernet network specifies the maximum
payload size of the data to be transmitted along with an Ethernet header on a network.
Typically this payload will be an IP packet, in which case the MTU specifies the maximum
combined size of the IP header and IP data.

The MTU is specified at the interface level as it is a link-level setting, and is typically
propagated down to the underlying network card driver. The expectation is that packets that
are larger than this configured size that appear to be transmitted over the wire are invalid or
corrupt and should be dropped. In a valid configuration, hosts connected together via a link
will have the same MTU specified:

If interfaces on either side of a link have mismatching MTU configurations, then the smaller
side will treat packets larger than the local MTU as invalid and drop the packets before any
software has the chance to see them.

Streams of data that are larger than the MTU will be broken up into packets that completely
fill an Ethernet frame, up to the MTU in each. If the remote end has a smaller MTU
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 2/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

configured for the same link, those larger packets will be dropped. MTU should be configured
the same on both interfaces on either side of a link, and so the MTU should be considered a
bidirectional maximum.

MTU in the lab

The lab in this blog post can be used to observe this in an example system. In one terminal,
bring up the lab hosts inside the Vagrant machine:

$ vagrant ssh -- /vagrant/bin/run-lab

In another terminal, vagrant ssh then enable the first scenario from above with matching
MTU of 1500 on client and server:

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/scenario direct_1500

Log in to the client and server hosts and observe that we can send a packet with 1400 bytes
of payload as expected, since both hosts have an MTU of 1500. The -s 1400 argument to
ping sets the payload size, and the -M do argument instructs ping to set the DF (Don’t
Fragment) bit, ensuring that the whole IP packet must arrive in one piece or not at all.

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell client

root@client:/# ping -c 2 -M do -s 1400 server-direct

PING server-direct (172.28.0.40) 1400(1428) bytes of data.

1408 bytes from server-direct (172.28.0.40): icmp_seq=1 ttl=64 time=0.096 ms

1408 bytes from server-direct (172.28.0.40): icmp_seq=2 ttl=64 time=0.080 ms

--- server-direct ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 2ms

rtt min/avg/max/mdev = 0.080/0.088/0.096/0.008 ms

root@client:/# exit

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server

root@server:/# ping -c 2 -M do -s 1400 client-direct

PING client-direct (172.28.0.10) 1400(1428) bytes of data.

1408 bytes from client-direct (172.28.0.10): icmp_seq=1 ttl=64 time=0.079 ms

1408 bytes from client-direct (172.28.0.10): icmp_seq=2 ttl=64 time=0.071 ms

--- client-direct ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 32ms

rtt min/avg/max/mdev = 0.071/0.075/0.079/0.004 ms

root@server:/#

https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 3/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

Now switch to the second scenario with mismatching MTU, and observe that 1400 byte
payloads no longer succeed:

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/scenario direct_mismatch

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell client

root@client:/# ping -c 2 -M do -s 1400 server-direct

PING server-direct (172.28.0.40) 1400(1428) bytes of data.

ping: local error: Message too long, mtu=1200

ping: local error: Message too long, mtu=1200

--- server-direct ping statistics ---

2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 10ms

root@client:/# exit

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server

root@server:/# ping -c 2 -M do -s 1400 client-direct

PING client-direct (172.28.0.10) 1400(1428) bytes of data.

--- client-direct ping statistics ---

2 packets transmitted, 0 received, 100% packet loss, time 19ms

root@server:/#

Notice that the client host is immediately able to observe that it cannot send a packet this
large since the MTU on the interface is 1200. However, the server host believes the MTU
of the link is 1500, so sends the packet, however the client is unable to receive it. This
occurs at such a low level that neither host is aware of the failure - the packet just
disappears.

TCP MSS: Maximum Segment Size


The TCP MSS (Maximum Segment Size) sounds very similar to MTU, and since it relates to
the maximum size of network packets, they are easy to conflate even though they are quite
different. A TCP segment is the TCP header and TCP data that forms part of a single packet.
The MSS specifies the expected maximum size of the data component of this segment that a
host expects it would be able to receive without the IP packet being fragmented. IP
fragmentation is typically disabled for TCP packets on modern networking stacks due to the
added complexity and overhead, so the MSS represents the maximum size that the host
expects to be able to receive in any given packet.

https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 4/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

Rather than being an interface-level configuration like MTU, the MSS advertisement forms a
part of the typical TCP handshake and is calculated based on the underlying MTU of the
interface that a local host will use to communicate with a remote host. MSS can be thought of
as a TCP hint around how much data can be included in a single TCP packet, given the
current MTU. Each host calculates the MSS it will advertise by taking the local MTU and
subtracting the size of the IP and TCP headers, then includes that MSS in the TCP options of
the SYN or SYN-ACK packet as part of the TCP three-way handshake.

This is not a negotiation of a single MSS, but rather each host is giving the remote host an
indication of the maximum size of a single packet it expects will be possible to send back.
This number must be less than the MTU minus IP/TCP headers, since there’s no way any
larger packet could arrive given the local MTU. Each host will use the remote host’s
advertised MSS as a hint for what size individual outgoing packets should be. Since this is a
configurable hint, it is also only unidirectional, and although a host may advertise a lower
MSS than it can otherwise handle, that doesn’t in any way restrict it from sending packets
larger than the MSS it advertised (providing the remote host allowed for it).

The simple MSS exchange happens to work around small misconfigurations of MTU, such as
the trivial example described above:

In this case, the client would advertise an MSS of 1200 (MTU) - 20 (IP hdr) - 20 (TCP
hdr) = 1160 , which would cause the server to refrain from sending packets that contained
more than 1160 bytes in the TCP payload, which also ensures it would be able to arrive
within the bounds of the MTU of 1200 once those headers are added on.

However, the above network is still misconfigured, since even though TCP happens to work
around it, other protocols will fail since they don’t exchange MSS values. MSS is actually
intended to allow hosts to work around valid configurations where their own local networks
have different MTU, such as the following:

https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 5/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

In this example, if the server with a valid MTU of 9000 attempted to send an Ethernet frame
containing more than 1500 bytes without fragmentation being allowed, that packet would not
be able to make it to the client. The intermediary router, being the first host that is aware of
this problem as it is aware of the MTU of both links, would send an ICMPv4 “Fragmentation
required, but DF set” message or an ICMPv6 “Packet Too Big” message back to the sender
to inform it that forwarding the packet without breaking it up is not possible (and that the IP
header had the DF, or Don’t Fragment, bit set).

However, TCP will succeed in unrestricted communications between these hosts due to the
MSS advertisements. The server in this configuration will receive an MSS from the client that
will ensure no Ethernet frames with a payload larger than 1500 bytes are generated, so they
will be received successfully.

MSS in the lab

Select the scenario from above with the client on a network with 1500 MTU and the server on
a network with 9000 MTU:

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/scenario client_net_smaller

Running a ping from the side with the larger MTU, we can observe that packets larger than
the client’s MTU cause the intermediary router to return an ICMP message since it is unable
to forward the packet:

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server

root@server:/# ping -c 2 -M do -s 3000 client

PING client (172.29.0.10) 3000(3028) bytes of data.

From 172.30.0.20 icmp_seq=1 Frag needed and DF set (mtu = 1500)

ping: local error: Message too long, mtu=1500

--- client ping statistics ---

2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 18ms

pipe 2

root@server:/#

In a slightly more complex example, open up a few terminals and spin up a simple HTTP
server that sends a large payload and observe in a tcpdump that the MSS advertisements
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 6/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

allow the connection to succeed despite the differing MTU:

# reset everything so Linux doesn't remember that ICMP frag message from above

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/flush-all-route-caches

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server

root@server:/# sample-http-server

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server

root@server:/# tcpdump -i any icmp or port 80

vagrant@blog-lab-mtu:~$ /vagrant/bin/shell client

root@client:/# curl http://server/

The tcpdump should return something like the following - note the MSS advertised by each
side in the first 2 SYN packets is the MTU minus the IP and TCP header size of 40 bytes -
mss 1460 and mss 8960 . The packets with an HTTP payload are broken into smaller
packets with a TCP segment of just 1448 bytes - small enough to fit inside an MTU of 1500
with an IP and TCP header with 12 additional bytes for TCP options (you can observe those
options where it says [nop,nop,TS val 3808245569 ecr 3455879517] ).

IP client.51424 > server.80: Flags [S], seq 4195639166, win 64240, options [mss 1460,sackOK
IP server.80 > client.51424: Flags [S.], seq 3403777541, ack 4195639167, win 62636, options
IP client.51424 > server.80: Flags [.], ack 1, win 1004, options [nop,nop,TS val 3456553939
IP client.51424 > server.80: Flags [P.], seq 1:71, ack 1, win 1004, options [nop,nop,TS val
IP server.80 > client.51424: Flags [.], ack 71, win 978, options [nop,nop,TS val 3808919991
IP server.80 > client.51424: Flags [P.], seq 1:114, ack 71, win 978, options [nop,nop,TS va
IP client.51424 > server.80: Flags [.], ack 114, win 1003, options [nop,nop,TS val 34565539
IP server.80 > client.51424: Flags [.], seq 114:1562, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51424: Flags [.], seq 1562:3010, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51424: Flags [.], seq 3010:4458, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51424: Flags [P.], seq 4458:4915, ack 71, win 978, options [nop,nop,T
IP client.51424 > server.80: Flags [.], ack 1562, win 1002, options [nop,nop,TS val 3456553
IP client.51424 > server.80: Flags [.], ack 3010, win 995, options [nop,nop,TS val 34565539
IP client.51424 > server.80: Flags [.], ack 4458, win 984, options [nop,nop,TS val 34565539
IP client.51424 > server.80: Flags [.], ack 4915, win 980, options [nop,nop,TS val 34565539
IP client.51424 > server.80: Flags [F.], seq 71, ack 4915, win 1002, options [nop,nop,TS va
IP server.80 > client.51424: Flags [F.], seq 4915, ack 72, win 978, options [nop,nop,TS val
IP client.51424 > server.80: Flags [.], ack 4916, win 1002, options [nop,nop,TS val 3456553

One interesting note is that on many modern network devices, running a packet capture may
result in tcpdump and similar tools observing packets that appear larger than the configured
MTU due to Large Receive Offload and Large Send Offload and other technologies which
coalesce multiple packets that are part of the same flow into a single pseudo-packet. On

https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 7/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

receive, the network card will coalesce subsequent packets from a stream together before
passing them to the kernel as a single packet for faster processing. On send, the kernel will
provide one larger packet that the network card will split appropriately as it sends over the
wire based on the configured MSS.

This packet coelescing has been intentionally disabled in the lab to make it simpler to
observe when packets are being split up on the (virtual) wire, however if the same example
was run on real server the HTTP payload would likely appear to tcpdump as a single larger
packet, though it would still be broken up the same way on the wire.

Path MTU: Hidden bottlenecks


Although in the above example, TCP MSS was able to work around a simple configuration
where hosts had valid but differing MTUs on their links, this is still not a complete solution as
there may be additional intermediary links involved with an MTU that is lower than either the
client or server link.

In this example, all Ethernet payloads larger than 1200 bytes from either side will not be able
to be forwarded past the first hop (if IP fragmentation is disabled). However, both client and
server will advertise an MSS that will allow for Ethernet payloads larger than 1200 bytes to
be sent.

With full visibility of the network, using a diagram like we have here, we can see that packets
can only make it between client and server if they are no more than 1200 bytes including
headers. This is the Path MTU, or the minimum MTU of all links on the path between
communicating hosts. In practice, where hosts are communicating arbitrarily over the Internet
and where multiple paths could be available between those hosts, we don’t have visibility into
the full system and therefore we are unable to put a specific number on the Path MTU up
front. Instead, it must be possible for hosts to discover this Path MTU as needed during
existing communications, as the need for it arises.

Path MTU Discovery

https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 8/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

Path MTU Discovery is the process of hosts working from the local MTU and the remote
initial MSS advertisement as hints, and arriving at the actual Path MTU of the (current) full
path between those hosts in each direction.

The process starts by assuming that the advertised MSS is correct for the full path, after
reducing it if the local link’s MTU minus IP/TCP header size is smaller (since we couldn’t
send a larger packet regardless of the MSS). When a packet is sent that is larger than the
smallest link along the path, it will at least make it one hop to the first router, since we know
the local link MTU is large enough to fit it.

When a router receives a packet on one interface and needs to forward it to another interface
that the packet cannot fit on, the router sends back an ICMPv4 “Fragmentation required”
message or an ICMPv6 “Packet Too Big” message. The router includes the MTU of the next
(smaller) hop in that message, since it knows it. Upon receipt of that message, the originating
host is able to reduce the calculated Path MTU for communications with that remote host,
and resend the data as multiple smaller packets. From then on, packet size is correctly
limited by the size of the MTU of the smallest link in the path observed so far.

A full example is below, though note that in practice there may not be complete symmetry in
the path in each direction, multiple hops may progressively have smaller MTU values along
the way, and the path may even change throughout the lifetime of a single connection:

This example shows how critical it is for TCP that ICMP messages of this type are forwarded
correctly. This exchange is where most problems around MTU occur in production systems,
when firewalls along the path block or throttle ICMP traffic in a way that inhibits Path MTU
Discovery. Don’t block ICMP, it will break Path MTU Discovery and also TCP connections
with large data transfers where the initial MSS advertisement is not enough to limit the Path
MTU. At the very least, don’t block ICMPv4 “Fragmentation required” or ICMPv6 “Packet Too
Big”, even if you block other ICMP messages.

https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 9/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

The common traceroute utility observes hops between hosts using the TTL to observe
each hop via TTL Exceeded messages, and this can be extended to show Path MTU (as well
as the hops along the way), which is functionality that the tracepath utility provides.
tracepath sends large packets, starting at the maximum sendable on the local link, to a
remote host and shows any ICMP messages and the adjusted Path MTU along the way as it
gradually increases TTL and decreases packet size. tracepath is a good first place to start
when diagnosing issues observed between 2 hosts where MTU misconfiguration or ICMP
filtering is suspected.

Path MTU Discovery in the lab

Select the scenario from above with the client on a network with 1500 MTU, the server on a
network with 9000 MTU, and an additional intermediary network with 1200 MTU that packets
must traverse:

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/scenario hidden_smaller

Observe that neither side can immediately ascertain the correct Path MTU and must see an
ICMP message from the intermediary router before they become aware of the smaller link:

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell client

root@client:/# ping -c 2 -M do -s 1400 server

PING server (172.31.0.40) 1400(1428) bytes of data.

From vagrant_router-a_1.vagrant_client_router_a (172.29.0.20) icmp_seq=1 Frag needed and


DF set (mtu = 1200)

ping: local error: Message too long, mtu=1200

--- server ping statistics ---

2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 2ms

root@client:/# exit

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server

root@server:/# ping -c 2 -M do -s 1400 client

PING client (172.29.0.10) 1400(1428) bytes of data.

From vagrant_router-b_1.vagrant_router_b_server (172.31.0.30) icmp_seq=1 Frag needed and


DF set (mtu = 1200)

ping: local error: Message too long, mtu=1200

--- client ping statistics ---

2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 3ms

root@server:/#

https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 10/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

Bringing up the example HTTP server from earlier, we can also observe the full process off
Path MTU Discovery. In this case, note that we observe the tcpdump from the router-b
on its interface towards server since it has a better vantage point for observing retransmits.

# reset everything so Linux doesn't remember that ICMP frag message from above

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/flush-all-route-caches

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell server

root@server:/# sample-http-server

vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/shell router-b # better vantage point

root@router-b:/# tcpdump -i eth1 icmp or port 80

vagrant@blog-lab-mtu:~$ /vagrant/bin/shell client

root@client:/# curl http://server/

The tcpdump will return something like the following:

IP client.51428 > server.80: Flags [S], seq 644598568, win 64240, options [mss 1460,sackOK,
IP server.80 > client.51428: Flags [S.], seq 1840446146, ack 644598569, win 62636, options
IP client.51428 > server.80: Flags [.], ack 1, win 1004, options [nop,nop,TS val 3457600578
IP client.51428 > server.80: Flags [P.], seq 1:71, ack 1, win 1004, options [nop,nop,TS val
IP server.80 > client.51428: Flags [.], ack 71, win 978, options [nop,nop,TS val 3809966630
IP server.80 > client.51428: Flags [P.], seq 1:114, ack 71, win 978, options [nop,nop,TS va
IP client.51428 > server.80: Flags [.], ack 114, win 1003, options [nop,nop,TS val 34576005
IP server.80 > client.51428: Flags [.], seq 114:1562, ack 71, win 978, options [nop,nop,TS
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556

IP server.80 > client.51428: Flags [.], seq 1562:3010, ack 71, win 978, options [nop,nop,TS
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556

IP server.80 > client.51428: Flags [.], seq 3010:4458, ack 71, win 978, options [nop,nop,TS
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556

IP server.80 > client.51428: Flags [P.], seq 4458:4915, ack 71, win 978, options [nop,nop,T
IP client.51428 > server.80: Flags [.], ack 114, win 1003, options [nop,nop,TS val 34576005
IP server.80 > client.51428: Flags [.], seq 114:1262, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51428: Flags [.], seq 1262:2410, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51428: Flags [.], seq 2410:3558, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51428: Flags [.], seq 3558:4458, ack 71, win 978, options [nop,nop,TS
IP client.51428 > server.80: Flags [.], ack 1262, win 993, options [nop,nop,TS val 34576005
IP client.51428 > server.80: Flags [.], ack 2410, win 976, options [nop,nop,TS val 34576005
IP client.51428 > server.80: Flags [.], ack 3558, win 967, options [nop,nop,TS val 34576005
IP client.51428 > server.80: Flags [.], ack 4915, win 970, options [nop,nop,TS val 34576005
IP server.80 > client.51428: Flags [F.], seq 4915, ack 71, win 978, options [nop,nop,TS val
IP client.51428 > server.80: Flags [F.], seq 71, ack 4916, win 1002, options [nop,nop,TS va
IP server.80 > client.51428: Flags [.], ack 72, win 978, options [nop,nop,TS val 3809966652

https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 11/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

Note that the advertised MSS values are the same as the earlier example, which does not
reflect the Path MTU since it is not yet known. Each of the initial large packets sent from the
server to the client cause an ICMP fragmentation message:

IP server.80 > client.51428: Flags [.], seq 114:1562, ack 71, win 978, options [nop,nop,TS
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556

IP server.80 > client.51428: Flags [.], seq 1562:3010, ack 71, win 978, options [nop,nop,TS
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556

IP server.80 > client.51428: Flags [.], seq 3010:4458, ack 71, win 978, options [nop,nop,TS
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556

The server then resends the failing packets, this time respecting the newly calculated Path
MTU of 1200:

IP server.80 > client.51428: Flags [.], seq 114:1262, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51428: Flags [.], seq 1262:2410, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51428: Flags [.], seq 2410:3558, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51428: Flags [.], seq 3558:4458, ack 71, win 978, options [nop,nop,TS

We can also use tracepath to perform Path MTU Discovery and observe which routers are
responding - the tool starts with the local network’s MTU then discovers the reduced MTU
link as it progresses:

# reset everything so Linux doesn't remember that ICMP frag message from above

vagrant@blog-lab-mtu:~$ /vagrant/bin/flush-all-route-caches

vagrant@blog-lab-mtu:~$ /vagrant/bin/shell client

root@client:/# tracepath -n server


1?: [LOCALHOST] pmtu 1500

1: 172.29.0.20 0.088ms

1: 172.29.0.20 0.032ms

2: 172.29.0.20 0.030ms pmtu 1200

2: 172.30.0.30 0.046ms

3: 172.31.0.40 0.058ms reached

Resume: pmtu 1200 hops 3 back 3

root@client:/# exit

vagrant@blog-lab-mtu:~$ /vagrant/bin/shell server

root@server:/# tracepath -n client


1?: [LOCALHOST] pmtu 9000

1: 172.31.0.30 0.159ms

1: 172.31.0.30 0.034ms

2: 172.31.0.30 0.027ms pmtu 1200

2: 172.30.0.20 0.092ms

3: 172.29.0.10 0.060ms reached

https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 12/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

Resume: pmtu 1200 hops 3 back 3

root@server:/#

The tracepath tool is extremely useful in determining whether a connection stalling failure
is indeed a Path MTU blackhole due to a router or firewall blocking ICMP packets.

Path MTU Discovery and Anycast


One final complexity occurs when routers have multiple equal-cost paths (ECMP) to multiple
hosts that share the same IP address, a common situation with deployments of Anycast. In
this case, routers hash packets across the different available paths and attempt to be
consistent so that packets from the same connection arrive on the same remote host (and/or
travel via the same path).

However, the input to the hash may not (and typically does not) understand that an ICMP
fragmentation or packet too big message is related to the TCP connection that triggered it
since the IP source is different to a normal returning packet, and instead is a router along the
way, not the expected remote host. This leads to a situation where one host receives the
TCP packets for a connection, and another unrelated host receives the ICMP packet relating
to that connection, which gets disregarded. This introduces a Path MTU blackhole, as if
ICMP were being filtered.

In practice, there are ways to work around this issue. One way is to broadcast those ICMP
messages to all hosts. An alternative approach is used by GLB Director which allows the
routers to perform the ICMP-unaware ECMP hashing, but then re-hashes it correctly at the
first software load balancer layer. GLB inspects inside ICMP messages, since they contain
part of the triggering packet, and hashes those packets the same way they would be hashed
if they were the original TCP packet that triggered them, ensuring ICMP messages land on
the same host as the related TCP connection. In general, it’s important that any system
involving hashing or otherwise manipulating TCP packets ensures that ICMP messages
relating to the stream are sent to the appropriate host, as they are a crucial part of the way
that TCP operates.

Wrapping up
It is often possible to ignore the details of MTU, MSS advertisement and Path MTU Discovery
and have things continue to work to a certain extent. However, when these systems fail,
connections will stall entirely and in a very blocking way for users. This is often seen only on
large transfers, as smaller data transfers don’t trigger the issue, since the packets remain
small. It’s also often only intermittent in cases where only one path between hosts has a
reduced Path MTU, or just one path has a router blocking ICMP packets.

https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 13/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall

Thankfully, the rule for keeping networks functioning correctly with regards to MTU can be
summarised simply as:

The MTU of the interfaces on either side of a physical

or logical link must be equal. Don't block ICMP.

Asking if this rule holds true both internally and externally in any trouble ticket that has the
pattern of “Why is my connection stalling when (action that transfers large data) but not when
(action that transfers small data)?” will almost always yield an MTU misconfiguration or ICMP
filtering and a root cause.

More posts

https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 14/14

You might also like