Page MenuHomePhabricator

cp intermittent IPsec MTU issue
Closed, ResolvedPublic

Description

Symptoms: spikes of local ICMP destination unreachable messages on cp servers.

See for example Icmp_OutDestUnreachs in https://grafana.wikimedia.org/dashboard/db/network-performances?orgId=1&var-server=cp1074&var-datasource=eqiad%20prometheus%2Fops&panelId=14&fullscreen
(The Icmp_InDestUnreachs are unrelated, see T167691 )
And more overall (per site/cluster) https://grafana.wikimedia.org/dashboard/db/network-performances-global?orgId=1&panelId=20&fullscreen&edit&tab=metrics&from=now-24h&to=now
The issue is happening on all the cache clusters, on a more steady rate for text and larger spikes for upload.

Example of one of those packets. All packets during a spike are for a single destination IP.

cp1074$ tcpdump -nn -i loicmp[0] = 3
Internet Protocol Version 4
    Source: 10.64.48.108                       <----  From/to eth0 (ICMP packet stays local)
    Destination: 10.64.48.108                  <---- 
Internet Control Message Protocol
    Type: 3 (Destination unreachable)
    Code: 4 (Fragmentation needed)
    MTU of next hop: 1500                      <---- eth0 has a MTU of 1500
    Internet Protocol Version 4                <---- Packet triggering the ICMP (header + truncated payload)
        Total Length: 1516                     <---- 1516>1500
        Flags: 0x02 (Don't Fragment)           <---- Don't Fragment bit set
        Protocol: Encap Security Payload (50)  <---- ESP packet              
        Source: 10.64.48.108                   <---- cp1074:eth0
        Destination: 10.192.32.113             <---- cp2014:eth0
    Encapsulating Security Payload

Impact: Most likely performance degradation during those spikes, as packets are either lost or retransmitted.

Trigger: still yet to be determined. So far the spikes seem to happen at "random" times.
Using ipsec statusall they do NOT seem to match ipsec establishment or re-key.
No matching events in syslog.
Don't match spikes of TCP or UDP traffic

An unencrypted packet leaving the host is encrypted transparently by the kernel (as defined in ip xfrm policy list), and receive additional data (padding, IV, trailer, etc..) which increases its size. The kernel takes that into account automatically while informing the MTU for a specific destination:

title=bast to cp (no ESP)
bast1002:~$ ping -s 2000 10.20.0.170 -M do
PING 10.20.0.170 (10.20.0.170) 2000(2028) bytes of data.
ping: local error: Message too long, mtu=1500
title=cp to cp (ESP)
cp1074:~$ ping -s 2000 10.20.0.170 -M do
PING 10.20.0.170 (10.20.0.170) 2000(2028) bytes of data.
ping: local error: Message too long, mtu=1466

During a spike, I see the following getting populated (variable dest IP):

ip -s route get 10.20.0.170
10.20.0.170 via 10.64.48.1 dev eth0  src 10.64.48.108 
    cache  expires 102sec users 47 age 44sec mtu 1500

Expiry timer gets reset to 600 if another spike of errors happen with the same dest IP.

So my current hypothesis is that the wrong MTU gets set (pmtud, or other) and overrides the IPsec aware kernel one.
But the errors stop well before the end of the expiry timer. And I can't find any other location where the MTU could be stored (or displayed).

For comparison, the same output during a quiet time:

ip -s route get 10.20.0.170
10.20.0.170 via 10.64.48.1 dev eth0  src 10.64.48.108 
    cache  users 853 age 132sec

It would be possible to test this by temporarily forcing the MTU to a lower value on both sides with:
sudo ip route add 10.20.0.170 via 10.64.48.1 mtu lock 1400
And monitoring 1/ if the errors happen again, and 2/ if the max MTU announced in the ICMP is 1400 or 1500.

In addition turning on more verbose ipsec logs might have useful data.

Is there a way to see traffic before (or after) it gets encrypted?
Also help welcome, especially if you know more about the IPsec/kernel part. Or have suggestions, other tests, etc...

I see multiple possible fixes, some more clean/permanent, some with more impact. Also depends on the result of the tests above.
1/ Increase the interface's MTU - preferred option overall
As all of our network supports a MTU>1500 we can increase the interface MTU (eg. 3000) so the original packet + overhead doesn't hit 1500.
There is still a risk of the MSS advertised during the TCP handshake of sessions going over IPsec becomes 3000 and we still hit that same issue.
Increasing the interface MTU should improve performances overall (less fragmentation and overhead).
Need to ensure UDP traffic doesn't get blackholed, (eg. a host with a MTU3000 sending UDP packets >1500 to a host with a MTU1500): should be quickly visible with ICMP dest unreach and pmtud.

2/ MSS clamping
Use an iptable rule to re-write the MSS value in TCP handshake so they never exceed a safe value (eg. 1328)

3/ Per destination fixed MTU (ignore any kind of discovery for selected paths).
sudo ip route add 10.20.0.170 via 10.64.48.1 mtu lock 1400
Can be used with 1/, for example increase the MTU of the interface but keep a MTU1500 for some/all hosts/subnets.

Some doc:
http://packetpushers.net/ipsec-bandwidth-overhead-using-aes/
https://lists.strongswan.org/pipermail/users/2017-January/010341.html
https://www.zeitgeist.se/2013/11/26/mtu-woes-in-ipsec-tunnels-how-to-fix/
http://lartc.org/howto/lartc.cookbook.mtu-discovery.html
http://lartc.org/manpages/ip.html
https://stackoverflow.com/questions/21931614/how-to-see-outgoing-esp-packets-in-tcpdump-before-they-get-encrypted#22085477

Event Timeline

ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi renamed this task from cp intermitent IPsec MTU issue to cp intermittent IPsec MTU issue.May 23 2018, 11:36 AM

I don't have complete thoughts, but keep in mind in general it's complicated to go changing our actual host interface MTUs to anything larger than 1500 ("jumbo frames"), for a few reasons:

  1. At the local network level / real ethernet interface level, if we turn on jumbo frames for just some hosts and not all, we'll probably have problems. e.g. if we turn on jumbo for cp10xx but not other eqiad hosts, we'll run into problems with cp10xx<->others and the rejection of oversized packets (for protocols that won't discover path MTU), or perf issues with re-discovering it all the time for protocols that do. In general, PMTUD is kinda-broken and it's not wise to rely on it more than you have to.
  2. Even if we tried to turn it on for all hosts on the local VLANs, the next problem is there might be some hosts we can't control (e.g. hardware devices is the main example, like some console things or PDUs, etc... but maybe all such devices we can't control are on the mgmt network which can be left at 1500?).
  3. Even if it's all ok within our networks, the next problem is these hosts *also* talk to the public internet, where >1500 almost never works with real clients. Luckily many end-users get a full 1500 these days, but we do rely on PMTUD and/or the client's correct MSS to fix up the rest of them. Having our natural MTU be even higher than 1500 just exacerbates the situation by requiring PMTUD in many more cases.
  4. You could argue for making this split on public-vs-private VLANs (e.g. private vlans where ipsec, etc are get jumbo frames, but public vlans get 1500 MTU where they might talk directly to the internet), but then with how our LVS forwards public IPs to the loopback of private-vlan hosts, the two worlds still blend together...
  5. Regardless of all of this, it's hard to get jumbo frames working with installer-time stuff too (e.g. debian installation slows down or fails because other hosts are sending it mtu>1500 packets and it doesn't know the larger MTU at install time).

All of this is an issue in general, btw. Even if we ignore the ipsec problems, it would be nice to have more-ideal solutions for using-facing PMTUD down from 1500 as well (e.g. clamp it down to ~1460 or some other magic value which slightly-hurts the 1500 end-users but fixes up common misconfiguration cases for clients behind PPPoE and such).

For the ipsec case, It's probably simpler to just fix the MTU explicitly for the ipsec-wrapped traffic. Clearly that part works automagically most of the time, the ??? is what's creating these temporary mtu 1500 entries for the routes.

Note one of your links mentions a strongswan setting:

> Set charon.plugins.kernel-netlink.mtu to 1400 or lower.

... Which might be enough to fix this, at some magic value?

Note the docs at https://wiki.strongswan.org/projects/strongswan/wiki/ForwardingAndSplitTunneling#MTUMSS-issues say:

The charon.plugins.kernel-netlink.mss and charon.plugins.kernel-netlink.mtu may be used, too, but the values set there apply
to the routes that kernel-netlink installs and the impact of them onto the traffic and the behavior of the kernel is currently quite unclear.

But it's entirely possible we're in a situation where they do work, since we're actually using the kernel netlink route-entry stuff. Perhaps it's the lack of one or both of these settings that causes the temporary mtu=1500 entries?

I looked into charon.plugins.kernel-netlink.mtu but for what I read it is only applied to routes added by ipsec in tunnel mode, while we use it in transport (transparent) mode.

Raising the MTU above standard everywhere is indeed another can of worms and out of scope here.
With careful testing, raising it on some hosts (with well identified flows) might be more doable, especially for internal flows where we expect the MSS to be respected on TCP (/sessions based), and UDP (/similar) to be bellow 1500 by convention (and sometimes configurable).

Next test I want to do to narrow down the issue is force the MTU down to 1400 on specific routes (/32) between 2 hosts (eg. cp3035 and cp1074) and see if the errors happen again.

Raising the MTU above standard everywhere is indeed another can of worms and out of scope here.
With careful testing, raising it on some hosts (with well identified flows) might be more doable, especially for internal flows where we expect the MSS to be respected on TCP (/sessions based), and UDP (/similar) to be bellow 1500 by convention (and sometimes configurable).

If you still mean raising the interface MTU (e.g. the actual MTU setting of eth0), I don't think we can sanely change that in isolation (as in, without changing it on all other hosts), even for this very specific case of cp servers for ipsec. The key issue is they still have flows with the public users on the Internet, and setting MTU>1500 for the public connections is likely to exacerbate various PMTU-related issues with end-users.

But even if we avoided impacting user-facing flows with some clever hacks: the other problem is even in the simplest case there are a whole lot of flows to different kinds of servers in our infra (think: puppet agent conns, monitoring conns, prometheus, kafka analytics stuff, ssh from bastions, etc, etc). We can't assume we can configure explicit mss (or UDP packet sizes) on all of these things in the general long-term case (or that we want to manage that complexity), so we'd have to rely on some form of PMTUD to fixu p all the TCP situations, and then hope there's no critical unconfigurable UDP cases.

Exacerbating this situation: traditional ICMP-based PMTUD relies on the fact that a router is sitting on the MTU-changing boundary to send the ICMP PTB message. When you mix MTUs at Layer 2 in the same VLAN (our per-row VLANs), that assumption gets broken. The oversized packets just get dropped without an ICMP PTB, looking very similar to an ICMP Blackhole situation. So in the cases where some of the random flow peers happen to live on the same Layer 2 network (same row in a core DC) as the cp server, traditional PMTUD would fail.

PLPMTUD from RFC 4821 can work around this, but isn't a perfect tradeoff either. Currently we have it set just for the caches, as a fallback when ICMP Blackholes are detected for end-user cases. This incurs a performance penalty (detecting the blackhole first). The other way to configured PLPMTUD is to use it from the outset (assume ICMP blackhole and only rely on PLPMTUD), but then you have to set the lower-bound threshold and start the connection transmitting at that size before probing upwards, so this adds extra packetization overhead (fewer bytes/packet on average) during the initial phase of a connection and has a perf cost (and it's a global setting, so we can't really change PLPMTUD behaviors depending on destination networks at the OS level).

Next test I want to do to narrow down the issue is force the MTU down to 1400 on specific routes (/32) between 2 hosts (eg. cp3035 and cp1074) and see if the errors happen again.

This might be a viable workaround in the general case, for the pairs of hosts expected to have ipsec assocations. It really depends on the nature of the bug here. The core bug issue is what is creating these faulty route cache entries like:

ip -s route get 10.20.0.170
10.20.0.170 via 10.64.48.1 dev eth0  src 10.64.48.108 
    cache  expires 102sec users 47 age 44sec mtu 1500

I don't think the strongswan stuff is in any way explicitly creating these. It's possible the kernel is creating them due to some misbehavior. Perhaps it's that the kernel notices some tiny bit of TCP loss somewhere and starts a PMTUD process with the peer host, which initially tries the underlying interface MTU (1500) when it should be using the effective MTU of the xfrm involved in this traffic. Or it could be that something is going wrong related to this while the xfrm is being replaced for ipsec association refreshes, but (a) those are supposed to overlap and not leave unencrypted traffic flowing in the normal case and (b) you said earlier there didn't seem to be an assocation between the timing of the spikes and ipsec re-keying/re-associating.

What will matter here is where the triggering code/event that creates these "mtu 1500" routing entry gets the "1500" from. If it's copying it directly from the hardware interface MTU (which is probably a bug in this case!), it probably ignores any static mtu policy rule we try to inject. Perhaps its smarter than that, whatever it is. Or it could be dumber and just be using a hardcoded 1500 as some kind of generic fallback or starting-point for PMTU issues.

Another thing to worry about here, even in testing: even though we're just trying to influence path-mtu by setting these static route entries for a pair of peer IPs, I'm not 100% sure how that interacts with the ipsec xfrm stuff. It's plausible that creating such a static /32 <-> /32 route for the MTU stuff effectively causes xfrm to be bypassed and routes the traffic in the clear (!!!), for all I know.

Mentioned in SAL (#wikimedia-operations) [2018-05-25T10:37:25Z] <XioNoX> test force mtu 1400 between cp1074 and cp3039 - T195365

No more ICMP mentioning cp3039, which helps narrowing down the possible causes.
Note that adding the static /32 does not bypass xfrm, traffic stays encrypted.

Vvjjkkii renamed this task from cp intermittent IPsec MTU issue to igcaaaaaaa.Jul 1 2018, 1:09 AM
Vvjjkkii removed ayounsi as the assignee of this task.
Vvjjkkii raised the priority of this task from Low to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
ema renamed this task from igcaaaaaaa to cp intermittent IPsec MTU issue.Jul 2 2018, 9:33 AM
ema assigned this task to ayounsi.
ema updated the task description. (Show Details)
ema lowered the priority of this task from High to Low.Jul 2 2018, 9:57 AM

Change 437784 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Add static routes with MTU 1450 to ipsec destinations

https://gerrit.wikimedia.org/r/437784

Mentioned in SAL (#wikimedia-operations) [2018-07-30T22:29:57Z] <XioNoX> - puppet disabled on cp40* hosts - T195365

Change 437784 merged by Ayounsi:
[operations/puppet@production] Add static routes with MTU 1450 to ipsec destinations

https://gerrit.wikimedia.org/r/437784

Mentioned in SAL (#wikimedia-operations) [2018-07-30T22:35:30Z] <XioNoX> applying static route + fixed MTU to cp4025 - T195365

Change 449371 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] ip -6 route show, don't try to find the IP, but keyword instead

https://gerrit.wikimedia.org/r/449371

Change 449371 merged by Ayounsi:
[operations/puppet@production] ip -6 route show, don't try to find the IP, but keyword instead

https://gerrit.wikimedia.org/r/449371

Mentioned in SAL (#wikimedia-operations) [2018-07-30T23:35:10Z] <XioNoX> re-enabling puppet on cp40* - T195365

Change 449526 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Extend cp to cp ipsec MTU 1450 to codfw

https://gerrit.wikimedia.org/r/449526

Mentioned in SAL (#wikimedia-operations) [2018-08-01T15:16:26Z] <XioNoX> disable puppet on all codfw cp* servers - T195365

Change 449526 merged by Ayounsi:
[operations/puppet@production] Extend cp to cp ipsec MTU 1450 to codfw

https://gerrit.wikimedia.org/r/449526

Change 449787 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] cp to cp ipsec MTU change everywhere except eqiad

https://gerrit.wikimedia.org/r/449787

Change 449787 merged by Ayounsi:
[operations/puppet@production] cp to cp ipsec MTU change everywhere except eqiad

https://gerrit.wikimedia.org/r/449787

Change 449886 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] cp to cp ipsec MTU set to 1450 for all cp servers

https://gerrit.wikimedia.org/r/449886

Change 449886 merged by Ayounsi:
[operations/puppet@production] cp to cp ipsec MTU set to 1450 for all cp servers

https://gerrit.wikimedia.org/r/449886

This is done, the static routes with mtu lock did the trick, as expected.
No more ICMP spikes confirmed on https://grafana.wikimedia.org/dashboard/db/network-performances-global?panelId=20&fullscreen&orgId=1&from=now-24h&to=now