Professional Documents
Culture Documents
Ethernet MTU and TCP MSS - Why Connections Stall
Ethernet MTU and TCP MSS - Why Connections Stall
Theo Julienne
Software & Infrastructure Engineer
More posts
MTU and MSS are two terms that are easily mistaken and their misconfiguration is often the
cause of networking problems. Spending enough time working on production systems that
interface with large networks of computers or the Internet almost guarantees coming across
situations where an interface was configured with the wrong MTU, or a firewall was filtering
ICMP. This results in a client being unable to transfer large amounts of data when smaller
transfers work fine. This post will walk through MTU, MSS and packet size negotiation for
TCP connections, and the common situations where it breaks down. This post was inspired
by multiple discussions during the course of investigating errors on production systems as
part of my role at GitHub.
If you want to take away a simple snippet from this post, the summary is:
The examples mentioned in this blog post will be reproducible in the lab from
theojulienne/blog-lab-mtu - clone this repository and bring up the lab, then poke around at
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 1/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
$ cd blog-lab-mtu
$ vagrant up
$ vagrant ssh
The MTU is specified at the interface level as it is a link-level setting, and is typically
propagated down to the underlying network card driver. The expectation is that packets that
are larger than this configured size that appear to be transmitted over the wire are invalid or
corrupt and should be dropped. In a valid configuration, hosts connected together via a link
will have the same MTU specified:
If interfaces on either side of a link have mismatching MTU configurations, then the smaller
side will treat packets larger than the local MTU as invalid and drop the packets before any
software has the chance to see them.
Streams of data that are larger than the MTU will be broken up into packets that completely
fill an Ethernet frame, up to the MTU in each. If the remote end has a smaller MTU
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 2/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
configured for the same link, those larger packets will be dropped. MTU should be configured
the same on both interfaces on either side of a link, and so the MTU should be considered a
bidirectional maximum.
The lab in this blog post can be used to observe this in an example system. In one terminal,
bring up the lab hosts inside the Vagrant machine:
In another terminal, vagrant ssh then enable the first scenario from above with matching
MTU of 1500 on client and server:
Log in to the client and server hosts and observe that we can send a packet with 1400 bytes
of payload as expected, since both hosts have an MTU of 1500. The -s 1400 argument to
ping sets the payload size, and the -M do argument instructs ping to set the DF (Don’t
Fragment) bit, ensuring that the whole IP packet must arrive in one piece or not at all.
root@client:/# exit
root@server:/#
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 3/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
Now switch to the second scenario with mismatching MTU, and observe that 1400 byte
payloads no longer succeed:
root@client:/# exit
root@server:/#
Notice that the client host is immediately able to observe that it cannot send a packet this
large since the MTU on the interface is 1200. However, the server host believes the MTU
of the link is 1500, so sends the packet, however the client is unable to receive it. This
occurs at such a low level that neither host is aware of the failure - the packet just
disappears.
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 4/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
Rather than being an interface-level configuration like MTU, the MSS advertisement forms a
part of the typical TCP handshake and is calculated based on the underlying MTU of the
interface that a local host will use to communicate with a remote host. MSS can be thought of
as a TCP hint around how much data can be included in a single TCP packet, given the
current MTU. Each host calculates the MSS it will advertise by taking the local MTU and
subtracting the size of the IP and TCP headers, then includes that MSS in the TCP options of
the SYN or SYN-ACK packet as part of the TCP three-way handshake.
This is not a negotiation of a single MSS, but rather each host is giving the remote host an
indication of the maximum size of a single packet it expects will be possible to send back.
This number must be less than the MTU minus IP/TCP headers, since there’s no way any
larger packet could arrive given the local MTU. Each host will use the remote host’s
advertised MSS as a hint for what size individual outgoing packets should be. Since this is a
configurable hint, it is also only unidirectional, and although a host may advertise a lower
MSS than it can otherwise handle, that doesn’t in any way restrict it from sending packets
larger than the MSS it advertised (providing the remote host allowed for it).
The simple MSS exchange happens to work around small misconfigurations of MTU, such as
the trivial example described above:
In this case, the client would advertise an MSS of 1200 (MTU) - 20 (IP hdr) - 20 (TCP
hdr) = 1160 , which would cause the server to refrain from sending packets that contained
more than 1160 bytes in the TCP payload, which also ensures it would be able to arrive
within the bounds of the MTU of 1200 once those headers are added on.
However, the above network is still misconfigured, since even though TCP happens to work
around it, other protocols will fail since they don’t exchange MSS values. MSS is actually
intended to allow hosts to work around valid configurations where their own local networks
have different MTU, such as the following:
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 5/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
In this example, if the server with a valid MTU of 9000 attempted to send an Ethernet frame
containing more than 1500 bytes without fragmentation being allowed, that packet would not
be able to make it to the client. The intermediary router, being the first host that is aware of
this problem as it is aware of the MTU of both links, would send an ICMPv4 “Fragmentation
required, but DF set” message or an ICMPv6 “Packet Too Big” message back to the sender
to inform it that forwarding the packet without breaking it up is not possible (and that the IP
header had the DF, or Don’t Fragment, bit set).
However, TCP will succeed in unrestricted communications between these hosts due to the
MSS advertisements. The server in this configuration will receive an MSS from the client that
will ensure no Ethernet frames with a payload larger than 1500 bytes are generated, so they
will be received successfully.
Select the scenario from above with the client on a network with 1500 MTU and the server on
a network with 9000 MTU:
Running a ping from the side with the larger MTU, we can observe that packets larger than
the client’s MTU cause the intermediary router to return an ICMP message since it is unable
to forward the packet:
pipe 2
root@server:/#
In a slightly more complex example, open up a few terminals and spin up a simple HTTP
server that sends a large payload and observe in a tcpdump that the MSS advertisements
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 6/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
# reset everything so Linux doesn't remember that ICMP frag message from above
vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/flush-all-route-caches
root@server:/# sample-http-server
The tcpdump should return something like the following - note the MSS advertised by each
side in the first 2 SYN packets is the MTU minus the IP and TCP header size of 40 bytes -
mss 1460 and mss 8960 . The packets with an HTTP payload are broken into smaller
packets with a TCP segment of just 1448 bytes - small enough to fit inside an MTU of 1500
with an IP and TCP header with 12 additional bytes for TCP options (you can observe those
options where it says [nop,nop,TS val 3808245569 ecr 3455879517] ).
IP client.51424 > server.80: Flags [S], seq 4195639166, win 64240, options [mss 1460,sackOK
IP server.80 > client.51424: Flags [S.], seq 3403777541, ack 4195639167, win 62636, options
IP client.51424 > server.80: Flags [.], ack 1, win 1004, options [nop,nop,TS val 3456553939
IP client.51424 > server.80: Flags [P.], seq 1:71, ack 1, win 1004, options [nop,nop,TS val
IP server.80 > client.51424: Flags [.], ack 71, win 978, options [nop,nop,TS val 3808919991
IP server.80 > client.51424: Flags [P.], seq 1:114, ack 71, win 978, options [nop,nop,TS va
IP client.51424 > server.80: Flags [.], ack 114, win 1003, options [nop,nop,TS val 34565539
IP server.80 > client.51424: Flags [.], seq 114:1562, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51424: Flags [.], seq 1562:3010, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51424: Flags [.], seq 3010:4458, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51424: Flags [P.], seq 4458:4915, ack 71, win 978, options [nop,nop,T
IP client.51424 > server.80: Flags [.], ack 1562, win 1002, options [nop,nop,TS val 3456553
IP client.51424 > server.80: Flags [.], ack 3010, win 995, options [nop,nop,TS val 34565539
IP client.51424 > server.80: Flags [.], ack 4458, win 984, options [nop,nop,TS val 34565539
IP client.51424 > server.80: Flags [.], ack 4915, win 980, options [nop,nop,TS val 34565539
IP client.51424 > server.80: Flags [F.], seq 71, ack 4915, win 1002, options [nop,nop,TS va
IP server.80 > client.51424: Flags [F.], seq 4915, ack 72, win 978, options [nop,nop,TS val
IP client.51424 > server.80: Flags [.], ack 4916, win 1002, options [nop,nop,TS val 3456553
One interesting note is that on many modern network devices, running a packet capture may
result in tcpdump and similar tools observing packets that appear larger than the configured
MTU due to Large Receive Offload and Large Send Offload and other technologies which
coalesce multiple packets that are part of the same flow into a single pseudo-packet. On
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 7/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
receive, the network card will coalesce subsequent packets from a stream together before
passing them to the kernel as a single packet for faster processing. On send, the kernel will
provide one larger packet that the network card will split appropriately as it sends over the
wire based on the configured MSS.
This packet coelescing has been intentionally disabled in the lab to make it simpler to
observe when packets are being split up on the (virtual) wire, however if the same example
was run on real server the HTTP payload would likely appear to tcpdump as a single larger
packet, though it would still be broken up the same way on the wire.
In this example, all Ethernet payloads larger than 1200 bytes from either side will not be able
to be forwarded past the first hop (if IP fragmentation is disabled). However, both client and
server will advertise an MSS that will allow for Ethernet payloads larger than 1200 bytes to
be sent.
With full visibility of the network, using a diagram like we have here, we can see that packets
can only make it between client and server if they are no more than 1200 bytes including
headers. This is the Path MTU, or the minimum MTU of all links on the path between
communicating hosts. In practice, where hosts are communicating arbitrarily over the Internet
and where multiple paths could be available between those hosts, we don’t have visibility into
the full system and therefore we are unable to put a specific number on the Path MTU up
front. Instead, it must be possible for hosts to discover this Path MTU as needed during
existing communications, as the need for it arises.
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 8/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
Path MTU Discovery is the process of hosts working from the local MTU and the remote
initial MSS advertisement as hints, and arriving at the actual Path MTU of the (current) full
path between those hosts in each direction.
The process starts by assuming that the advertised MSS is correct for the full path, after
reducing it if the local link’s MTU minus IP/TCP header size is smaller (since we couldn’t
send a larger packet regardless of the MSS). When a packet is sent that is larger than the
smallest link along the path, it will at least make it one hop to the first router, since we know
the local link MTU is large enough to fit it.
When a router receives a packet on one interface and needs to forward it to another interface
that the packet cannot fit on, the router sends back an ICMPv4 “Fragmentation required”
message or an ICMPv6 “Packet Too Big” message. The router includes the MTU of the next
(smaller) hop in that message, since it knows it. Upon receipt of that message, the originating
host is able to reduce the calculated Path MTU for communications with that remote host,
and resend the data as multiple smaller packets. From then on, packet size is correctly
limited by the size of the MTU of the smallest link in the path observed so far.
A full example is below, though note that in practice there may not be complete symmetry in
the path in each direction, multiple hops may progressively have smaller MTU values along
the way, and the path may even change throughout the lifetime of a single connection:
This example shows how critical it is for TCP that ICMP messages of this type are forwarded
correctly. This exchange is where most problems around MTU occur in production systems,
when firewalls along the path block or throttle ICMP traffic in a way that inhibits Path MTU
Discovery. Don’t block ICMP, it will break Path MTU Discovery and also TCP connections
with large data transfers where the initial MSS advertisement is not enough to limit the Path
MTU. At the very least, don’t block ICMPv4 “Fragmentation required” or ICMPv6 “Packet Too
Big”, even if you block other ICMP messages.
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 9/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
The common traceroute utility observes hops between hosts using the TTL to observe
each hop via TTL Exceeded messages, and this can be extended to show Path MTU (as well
as the hops along the way), which is functionality that the tracepath utility provides.
tracepath sends large packets, starting at the maximum sendable on the local link, to a
remote host and shows any ICMP messages and the adjusted Path MTU along the way as it
gradually increases TTL and decreases packet size. tracepath is a good first place to start
when diagnosing issues observed between 2 hosts where MTU misconfiguration or ICMP
filtering is suspected.
Select the scenario from above with the client on a network with 1500 MTU, the server on a
network with 9000 MTU, and an additional intermediary network with 1200 MTU that packets
must traverse:
Observe that neither side can immediately ascertain the correct Path MTU and must see an
ICMP message from the intermediary router before they become aware of the smaller link:
root@client:/# exit
root@server:/#
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 10/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
Bringing up the example HTTP server from earlier, we can also observe the full process off
Path MTU Discovery. In this case, note that we observe the tcpdump from the router-b
on its interface towards server since it has a better vantage point for observing retransmits.
# reset everything so Linux doesn't remember that ICMP frag message from above
vagrant@blog-lab-mtu:/vagrant$ /vagrant/bin/flush-all-route-caches
root@server:/# sample-http-server
IP client.51428 > server.80: Flags [S], seq 644598568, win 64240, options [mss 1460,sackOK,
IP server.80 > client.51428: Flags [S.], seq 1840446146, ack 644598569, win 62636, options
IP client.51428 > server.80: Flags [.], ack 1, win 1004, options [nop,nop,TS val 3457600578
IP client.51428 > server.80: Flags [P.], seq 1:71, ack 1, win 1004, options [nop,nop,TS val
IP server.80 > client.51428: Flags [.], ack 71, win 978, options [nop,nop,TS val 3809966630
IP server.80 > client.51428: Flags [P.], seq 1:114, ack 71, win 978, options [nop,nop,TS va
IP client.51428 > server.80: Flags [.], ack 114, win 1003, options [nop,nop,TS val 34576005
IP server.80 > client.51428: Flags [.], seq 114:1562, ack 71, win 978, options [nop,nop,TS
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 > client.51428: Flags [.], seq 1562:3010, ack 71, win 978, options [nop,nop,TS
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 > client.51428: Flags [.], seq 3010:4458, ack 71, win 978, options [nop,nop,TS
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 > client.51428: Flags [P.], seq 4458:4915, ack 71, win 978, options [nop,nop,T
IP client.51428 > server.80: Flags [.], ack 114, win 1003, options [nop,nop,TS val 34576005
IP server.80 > client.51428: Flags [.], seq 114:1262, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51428: Flags [.], seq 1262:2410, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51428: Flags [.], seq 2410:3558, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51428: Flags [.], seq 3558:4458, ack 71, win 978, options [nop,nop,TS
IP client.51428 > server.80: Flags [.], ack 1262, win 993, options [nop,nop,TS val 34576005
IP client.51428 > server.80: Flags [.], ack 2410, win 976, options [nop,nop,TS val 34576005
IP client.51428 > server.80: Flags [.], ack 3558, win 967, options [nop,nop,TS val 34576005
IP client.51428 > server.80: Flags [.], ack 4915, win 970, options [nop,nop,TS val 34576005
IP server.80 > client.51428: Flags [F.], seq 4915, ack 71, win 978, options [nop,nop,TS val
IP client.51428 > server.80: Flags [F.], seq 71, ack 4916, win 1002, options [nop,nop,TS va
IP server.80 > client.51428: Flags [.], ack 72, win 978, options [nop,nop,TS val 3809966652
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 11/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
Note that the advertised MSS values are the same as the earlier example, which does not
reflect the Path MTU since it is not yet known. Each of the initial large packets sent from the
server to the client cause an ICMP fragmentation message:
IP server.80 > client.51428: Flags [.], seq 114:1562, ack 71, win 978, options [nop,nop,TS
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 > client.51428: Flags [.], seq 1562:3010, ack 71, win 978, options [nop,nop,TS
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556
IP server.80 > client.51428: Flags [.], seq 3010:4458, ack 71, win 978, options [nop,nop,TS
IP router-b > server: ICMP client unreachable - need to frag (mtu 1200), length 556
The server then resends the failing packets, this time respecting the newly calculated Path
MTU of 1200:
IP server.80 > client.51428: Flags [.], seq 114:1262, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51428: Flags [.], seq 1262:2410, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51428: Flags [.], seq 2410:3558, ack 71, win 978, options [nop,nop,TS
IP server.80 > client.51428: Flags [.], seq 3558:4458, ack 71, win 978, options [nop,nop,TS
We can also use tracepath to perform Path MTU Discovery and observe which routers are
responding - the tool starts with the local network’s MTU then discovers the reduced MTU
link as it progresses:
# reset everything so Linux doesn't remember that ICMP frag message from above
vagrant@blog-lab-mtu:~$ /vagrant/bin/flush-all-route-caches
1: 172.29.0.20 0.088ms
1: 172.29.0.20 0.032ms
2: 172.30.0.30 0.046ms
root@client:/# exit
1: 172.31.0.30 0.159ms
1: 172.31.0.30 0.034ms
2: 172.30.0.20 0.092ms
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 12/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
root@server:/#
The tracepath tool is extremely useful in determining whether a connection stalling failure
is indeed a Path MTU blackhole due to a router or firewall blocking ICMP packets.
However, the input to the hash may not (and typically does not) understand that an ICMP
fragmentation or packet too big message is related to the TCP connection that triggered it
since the IP source is different to a normal returning packet, and instead is a router along the
way, not the expected remote host. This leads to a situation where one host receives the
TCP packets for a connection, and another unrelated host receives the ICMP packet relating
to that connection, which gets disregarded. This introduces a Path MTU blackhole, as if
ICMP were being filtered.
In practice, there are ways to work around this issue. One way is to broadcast those ICMP
messages to all hosts. An alternative approach is used by GLB Director which allows the
routers to perform the ICMP-unaware ECMP hashing, but then re-hashes it correctly at the
first software load balancer layer. GLB inspects inside ICMP messages, since they contain
part of the triggering packet, and hashes those packets the same way they would be hashed
if they were the original TCP packet that triggered them, ensuring ICMP messages land on
the same host as the related TCP connection. In general, it’s important that any system
involving hashing or otherwise manipulating TCP packets ensures that ICMP messages
relating to the stream are sent to the appropriate host, as they are a crucial part of the way
that TCP operates.
Wrapping up
It is often possible to ignore the details of MTU, MSS advertisement and Path MTU Discovery
and have things continue to work to a certain extent. However, when these systems fail,
connections will stall entirely and in a very blocking way for users. This is often seen only on
large transfers, as smaller data transfers don’t trigger the issue, since the packets remain
small. It’s also often only intermittent in cases where only one path between hosts has a
reduced Path MTU, or just one path has a router blocking ICMP packets.
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 13/14
2/7/23, 3:20 PM Ethernet MTU and TCP MSS: Why connections stall
Thankfully, the rule for keeping networks functioning correctly with regards to MTU can be
summarised simply as:
Asking if this rule holds true both internally and externally in any trouble ticket that has the
pattern of “Why is my connection stalling when (action that transfers large data) but not when
(action that transfers small data)?” will almost always yield an MTU misconfiguration or ICMP
filtering and a root cause.
More posts
https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html 14/14