Symptoms: spikes of local ICMP destination unreachable messages on cp servers.
See for example Icmp_OutDestUnreachs in https://grafana.wikimedia.org/dashboard/db/network-performances?orgId=1&var-server=cp1074&var-datasource=eqiad%20prometheus%2Fops&panelId=14&fullscreen
(The Icmp_InDestUnreachs are unrelated, see T167691 )
And more overall (per site/cluster) https://grafana.wikimedia.org/dashboard/db/network-performances-global?orgId=1&panelId=20&fullscreen&edit&tab=metrics&from=now-24h&to=now
The issue is happening on all the cache clusters, on a more steady rate for text and larger spikes for upload.
Example of one of those packets. All packets during a spike are for a single destination IP.
Internet Protocol Version 4 Source: 10.64.48.108 <---- From/to eth0 (ICMP packet stays local) Destination: 10.64.48.108 <---- Internet Control Message Protocol Type: 3 (Destination unreachable) Code: 4 (Fragmentation needed) MTU of next hop: 1500 <---- eth0 has a MTU of 1500 Internet Protocol Version 4 <---- Packet triggering the ICMP (header + truncated payload) Total Length: 1516 <---- 1516>1500 Flags: 0x02 (Don't Fragment) <---- Don't Fragment bit set Protocol: Encap Security Payload (50) <---- ESP packet Source: 10.64.48.108 <---- cp1074:eth0 Destination: 10.192.32.113 <---- cp2014:eth0 Encapsulating Security Payload
Impact: Most likely performance degradation during those spikes, as packets are either lost or retransmitted.
Trigger: still yet to be determined. So far the spikes seem to happen at "random" times.
Using ipsec statusall they do NOT seem to match ipsec establishment or re-key.
No matching events in syslog.
Don't match spikes of TCP or UDP traffic
An unencrypted packet leaving the host is encrypted transparently by the kernel (as defined in ip xfrm policy list), and receive additional data (padding, IV, trailer, etc..) which increases its size. The kernel takes that into account automatically while informing the MTU for a specific destination:
title=bast to cp (no ESP) bast1002:~$ ping -s 2000 10.20.0.170 -M do PING 10.20.0.170 (10.20.0.170) 2000(2028) bytes of data. ping: local error: Message too long, mtu=1500
title=cp to cp (ESP) cp1074:~$ ping -s 2000 10.20.0.170 -M do PING 10.20.0.170 (10.20.0.170) 2000(2028) bytes of data. ping: local error: Message too long, mtu=1466
During a spike, I see the following getting populated (variable dest IP):
ip -s route get 10.20.0.170 10.20.0.170 via 10.64.48.1 dev eth0 src 10.64.48.108 cache expires 102sec users 47 age 44sec mtu 1500
Expiry timer gets reset to 600 if another spike of errors happen with the same dest IP.
So my current hypothesis is that the wrong MTU gets set (pmtud, or other) and overrides the IPsec aware kernel one.
But the errors stop well before the end of the expiry timer. And I can't find any other location where the MTU could be stored (or displayed).
For comparison, the same output during a quiet time:
ip -s route get 10.20.0.170 10.20.0.170 via 10.64.48.1 dev eth0 src 10.64.48.108 cache users 853 age 132sec
It would be possible to test this by temporarily forcing the MTU to a lower value on both sides with:
sudo ip route add 10.20.0.170 via 10.64.48.1 mtu lock 1400
And monitoring 1/ if the errors happen again, and 2/ if the max MTU announced in the ICMP is 1400 or 1500.
In addition turning on more verbose ipsec logs might have useful data.
Is there a way to see traffic before (or after) it gets encrypted?
Also help welcome, especially if you know more about the IPsec/kernel part. Or have suggestions, other tests, etc...
I see multiple possible fixes, some more clean/permanent, some with more impact. Also depends on the result of the tests above.
1/ Increase the interface's MTU - preferred option overall
As all of our network supports a MTU>1500 we can increase the interface MTU (eg. 3000) so the original packet + overhead doesn't hit 1500.
There is still a risk of the MSS advertised during the TCP handshake of sessions going over IPsec becomes 3000 and we still hit that same issue.
Increasing the interface MTU should improve performances overall (less fragmentation and overhead).
Need to ensure UDP traffic doesn't get blackholed, (eg. a host with a MTU3000 sending UDP packets >1500 to a host with a MTU1500): should be quickly visible with ICMP dest unreach and pmtud.
2/ MSS clamping
Use an iptable rule to re-write the MSS value in TCP handshake so they never exceed a safe value (eg. 1328)
3/ Per destination fixed MTU (ignore any kind of discovery for selected paths).
sudo ip route add 10.20.0.170 via 10.64.48.1 mtu lock 1400
Can be used with 1/, for example increase the MTU of the interface but keep a MTU1500 for some/all hosts/subnets.
Some doc:
http://packetpushers.net/ipsec-bandwidth-overhead-using-aes/
https://lists.strongswan.org/pipermail/users/2017-January/010341.html
https://www.zeitgeist.se/2013/11/26/mtu-woes-in-ipsec-tunnels-how-to-fix/
http://lartc.org/howto/lartc.cookbook.mtu-discovery.html
http://lartc.org/manpages/ip.html
https://stackoverflow.com/questions/21931614/how-to-see-outgoing-esp-packets-in-tcpdump-before-they-get-encrypted#22085477