Page MenuHomePhabricator

mx1001.wikimedia.org mail delivery timeouts
Closed, ResolvedPublic

Description

Post upgrade to kernel linux-image-5.10.0-10-amd64(5.10.84-1) mail delivery often timed out on. The problem appears to be identical to what occurred in T297127. The immediate resolution was to downgrade the kernel and temporarily disable puppet. During the incident I was not able to determine the underlining cause.

Event Timeline

Interesting, thanks for reverting quickly! So the mail issues in the 2012-12-03 weren't just a Heisenbug after all, but we'll probably need a less production-impacting way to reproduce this so that we can dig out more details to report this.

I think I'll initially file a bug in Debian to see if others have also seen this bug and will do another pass of changes between 5.10.46 and 5.10.70 to see if anything stands out.

To report this upstream against the 5.10.x LTS tree we'd need some further technical evidence. Did you by chance see whether the "Check size of conntrack table" Icinga check alerted? One option might be to setup a additional mx1002 or mx2002 VM (which doesn't get added to our MX records), run it with 5.10.84 and trigger some artificial mail traffic to see if we can provoke this error without production impact.

@MoritzMuehlenhoff my initial thought is that since we know reverting the kernel solves the issue, we could do some short reboots into the new kernel to gather more data for a proper bug report. Specifically I would like to see if conntrack -E provides any more detailed information about why the packets are being destroyed. I did check the size of the conntrack table during yesterday's incident, but the number of entries was well below the sysctl max. If this sounds reasonable to you I can perform the reboot tests today or tomorrow?

@MoritzMuehlenhoff my initial thought is that since we know reverting the kernel solves the issue, we could do some short reboots into the new kernel to gather more data for a proper bug report. Specifically I would like to see if conntrack -E provides any more detailed information about why the packets are being destroyed. I did check the size of the conntrack table during yesterday's incident, but the number of entries was well below the sysctl max. If this sounds reasonable to you I can perform the reboot tests today or tomorrow?

Sure thing! If we do it in a controlled manner and e.g. revert back after right after 15-30 mins, the only real impact would be a small delay for some chunk of mail. We might also be able to see the effect on mx2001 (although it receives only 1/100th of mx1001's mail volume), but using mx1001 seems also fine to narrow this down.

great, I'll report back what I find.

Did you by chance see whether the "Check size of conntrack table" Icinga check alerted?

I checked Icinga, nothing in alert or notification history for that check and it says it did not change status since 108 days. So does not look like it did, no.

@MoritzMuehlenhoff, did you see https://www.spinics.net/lists/stable/msg509296.html ?
Apparently upstream identified the issue as 09e856d54bda5f288ef8437a90ab2b9b3eab83d1 and reverted it on all stable trees (Debian might not have picked the revert though).

@Platonides that revert made it into 5.10.78, https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.78, so I don't believe that is the issue, since we were testing on 5.10.84.

@MoritzMuehlenhoff, did you see https://www.spinics.net/lists/stable/msg509296.html ?
Apparently upstream identified the issue as 09e856d54bda5f288ef8437a90ab2b9b3eab83d1 and reverted it on all stable trees (Debian might not have picked the revert though).

That's a different issue (which actually caused an outage on the cloudgw servers), see https://phabricator.wikimedia.org/T294853 cloudgw* run buster-backpors and that is still at 5.10.70-1~bpo10+1, it'll move to a new kernel whenever 5.10.78 reaches buster-backports.

Change 759344 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl

https://gerrit.wikimedia.org/r/759344

Change 759344 merged by JHathaway:

[operations/puppet@production] mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl

https://gerrit.wikimedia.org/r/759344

We are no longer seeing the timeouts after setting the sysctl net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl to 3600 which restores the setting to the same value prior to kernel
version v5.10.54.