Symptoms: spikes of local ICMP destination unreachable messages on cp servers.
See for example `Icmp_OutDestUnreachs` in https://grafana-admin.wikimedia.org/dashboard/db/network-performances?orgId=1&var-server=cp1074&var-datasource=eqiad%20prometheus%2Fops&panelId=14&fullscreen
(The `Icmp_InDestUnreachs` are unrelated, see T167691 )
And more overall (per site/cluster) https://grafana.wikimedia.org/dashboard/db/network-performances-global?orgId=1&panelId=20&fullscreen&edit&tab=metrics&from=now-24h&to=now
The issue is happening on all the cache clusters, on a more steady rate for text and larger spikes for upload.
Example of one of those packets. **All packets during a spike are for a single destination IP.**
```name=cp1074$ tcpdump -nn -i lo "icmp[0] = 3",lines=10
Internet Protocol Version 4
Source: 10.64.48.108 <---- From/to eth0 (ICMP packet stays local)
Destination: 10.64.48.108 <----
Internet Control Message Protocol
Type: 3 (Destination unreachable)
Code: 4 (Fragmentation needed)
MTU of next hop: 1500 <---- eth0 has a MTU of 1500
Internet Protocol Version 4 <---- Packet triggering the ICMP (header + truncated payload)
Total Length: 1516 <---- 1516>1500
Flags: 0x02 (Don't Fragment) <---- Don't Fragment bit set
Protocol: Encap Security Payload (50) <---- ESP packet
Source: 10.64.48.108 <---- cp1074:eth0
Destination: 10.192.32.113 <---- cp2014:eth0
Encapsulating Security Payload
```
Impact: Most likely performance degradation during those spikes, as packets are either lost or retransmitted.
Trigger: still yet to be determined. So far the spikes seem to happen at "random" times.
Using `ipsec statusall` they do NOT seem to match ipsec establishment or re-key.
No matching events in syslog.
Don't match spikes of TCP or UDP traffic
An unencrypted packet leaving the host is encrypted transparently by the kernel (as defined in `ip xfrm policy list`), and receive additional data (padding, IV, trailer, etc..) which increases its size. The kernel takes that into account automatically while informing the MTU for a specific destination:
```title=bast to cp (no ESP)
bast1002:~$ ping -s 2000 10.20.0.170 -M do
PING 10.20.0.170 (10.20.0.170) 2000(2028) bytes of data.
ping: local error: Message too long, mtu=1500
```
```title=cp to cp (ESP)
cp1074:~$ ping -s 2000 10.20.0.170 -M do
PING 10.20.0.170 (10.20.0.170) 2000(2028) bytes of data.
ping: local error: Message too long, mtu=1466
```
During a spike, I see the following getting populated (variable dest IP):
```
ip -s route get 10.20.0.170
10.20.0.170 via 10.64.48.1 dev eth0 src 10.64.48.108
cache expires 102sec users 47 age 44sec mtu 1500
```
Expiry timer gets reset to 600 if another spike of errors happen with the same dest IP.
So my current hypothesis is that the wrong MTU gets set (pmtud, or other) and overrides the IPsec aware kernel one.
But the errors stop well before the end of the expiry timer. And I can't find any other location where the MTU could be stored (or displayed).
For comparison, the same output during a quiet time:
```
ip -s route get 10.20.0.170
10.20.0.170 via 10.64.48.1 dev eth0 src 10.64.48.108
cache users 853 age 132sec
```
It would be possible to test this by temporarily forcing the MTU to a lower value on both sides with:
`sudo ip route add 10.20.0.170 via 10.64.48.1 mtu lock 1400`
And monitoring 1/ if the errors happen again, and 2/ if the max MTU announced in the ICMP is 1400 or 1500.
In addition turning on more verbose ipsec logs might have useful data.
Is there a way to see traffic before (or after) it gets encrypted?
Also help welcome, especially if you know more about the IPsec/kernel part. Or have suggestions, other tests, etc...
I see multiple possible fixes, some more clean/permanent, some with more impact. Also depends on the result of the tests above.
1/ Increase the interface's MTU - preferred option overall
As all of our network supports a MTU>1500 we can increase the interface MTU (eg. 3000) so the original packet + overhead doesn't hit 1500.
There is still a risk of the MSS advertised during the TCP handshake of sessions going over IPsec becomes 3000 and we still hit that same issue.
Increasing the interface MTU should improve performances overall (less fragmentation and overhead).
Need to ensure UDP traffic doesn't get blackholed, (eg. a host with a MTU3000 sending UDP packets >1500 to a host with a MTU1500): should be quickly visible with ICMP dest unreach and pmtud.
2/ MSS clamping
Use an iptable rule to re-write the MSS value in TCP handshake so they never exceed a safe value (eg. 1328)
3/ Per destination fixed MTU (ignore any kind of discovery for selected paths).
`sudo ip route add 10.20.0.170 via 10.64.48.1 mtu lock 1400`
Can be used with 1/, for example increase the MTU of the interface but keep a MTU1500 for some/all hosts/subnets.
Some doc:
http://packetpushers.net/ipsec-bandwidth-overhead-using-aes/
https://lists.strongswan.org/pipermail/users/2017-January/010341.html
https://www.zeitgeist.se/2013/11/26/mtu-woes-in-ipsec-tunnels-how-to-fix/
http://lartc.org/howto/lartc.cookbook.mtu-discovery.html
http://lartc.org/manpages/ip.html
https://stackoverflow.com/questions/21931614/how-to-see-outgoing-esp-packets-in-tcpdump-before-they-get-encrypted#22085477