mx1001.wikimedia.org mail delivery timeouts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jhathaway
	Jan 12 2022, 11:28 PM

Description

Post upgrade to kernel linux-image-5.10.0-10-amd64(5.10.84-1) mail delivery often timed out on. The problem appears to be identical to what occurred in T297127. The immediate resolution was to downgrade the kernel and temporarily disable puppet. During the incident I was not able to determine the underlining cause.

Details

	Subject	Repo	Branch	Lines +/-
	mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl	operations/puppet	production	+18 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Dzahn	T297017 MX record issue on mx2001.wikimedia.org
Resolved	herron	T297127 Incident: 2021-12-03 mx2001->Gmail delivery issues
Resolved	jhathaway	T299107 mx1001.wikimedia.org mail delivery timeouts

Event Timeline

jhathaway created this task.Jan 12 2022, 11:28 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 12 2022, 11:28 PM

Platonides mentioned this in T297127: Incident: 2021-12-03 mx2001->Gmail delivery issues.Jan 13 2022, 12:01 AM

Legoktm added projects: SRE, Mail.Jan 13 2022, 2:19 AM

Restricted Application added a project: Infrastructure-Foundations. · View Herald TranscriptJan 13 2022, 2:19 AM

Interesting, thanks for reverting quickly! So the mail issues in the 2012-12-03 weren't just a Heisenbug after all, but we'll probably need a less production-impacting way to reproduce this so that we can dig out more details to report this.

I think I'll initially file a bug in Debian to see if others have also seen this bug and will do another pass of changes between 5.10.46 and 5.10.70 to see if anything stands out.

To report this upstream against the 5.10.x LTS tree we'd need some further technical evidence. Did you by chance see whether the "Check size of conntrack table" Icinga check alerted? One option might be to setup a additional mx1002 or mx2002 VM (which doesn't get added to our MX records), run it with 5.10.84 and trigger some artificial mail traffic to see if we can provoke this error without production impact.

@MoritzMuehlenhoff my initial thought is that since we know reverting the kernel solves the issue, we could do some short reboots into the new kernel to gather more data for a proper bug report. Specifically I would like to see if conntrack -E provides any more detailed information about why the packets are being destroyed. I did check the size of the conntrack table during yesterday's incident, but the number of entries was well below the sysctl max. If this sounds reasonable to you I can perform the reboot tests today or tomorrow?

In T299107#7619785, @jhathaway wrote:

@MoritzMuehlenhoff my initial thought is that since we know reverting the kernel solves the issue, we could do some short reboots into the new kernel to gather more data for a proper bug report. Specifically I would like to see if conntrack -E provides any more detailed information about why the packets are being destroyed. I did check the size of the conntrack table during yesterday's incident, but the number of entries was well below the sysctl max. If this sounds reasonable to you I can perform the reboot tests today or tomorrow?

Sure thing! If we do it in a controlled manner and e.g. revert back after right after 15-30 mins, the only real impact would be a small delay for some chunk of mail. We might also be able to see the effect on mx2001 (although it receives only 1/100th of mx1001's mail volume), but using mx1001 seems also fine to narrow this down.

great, I'll report back what I find.

In T299107#7618862, @MoritzMuehlenhoff wrote:

Did you by chance see whether the "Check size of conntrack table" Icinga check alerted?

I checked Icinga, nothing in alert or notification history for that check and it says it did not change status since 108 days. So does not look like it did, no.

Dzahn added a parent task: T297127: Incident: 2021-12-03 mx2001->Gmail delivery issues.Jan 13 2022, 5:50 PM

@MoritzMuehlenhoff, did you see https://www.spinics.net/lists/stable/msg509296.html ?
Apparently upstream identified the issue as 09e856d54bda5f288ef8437a90ab2b9b3eab83d1 and reverted it on all stable trees (Debian might not have picked the revert though).

@Platonides that revert made it into 5.10.78, https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.78, so I don't believe that is the issue, since we were testing on 5.10.84.

In T299107#7620550, @Platonides wrote:

@MoritzMuehlenhoff, did you see https://www.spinics.net/lists/stable/msg509296.html ?
Apparently upstream identified the issue as 09e856d54bda5f288ef8437a90ab2b9b3eab83d1 and reverted it on all stable trees (Debian might not have picked the revert though).

That's a different issue (which actually caused an outage on the cloudgw servers), see https://phabricator.wikimedia.org/T294853 cloudgw* run buster-backpors and that is still at 5.10.70-1~bpo10+1, it'll move to a new kernel whenever 5.10.78 reaches buster-backports.

Change 759344 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl

https://gerrit.wikimedia.org/r/759344

gerritbot added a project: Patch-For-Review.Feb 2 2022, 10:16 PM

Change 759344 merged by JHathaway:

[operations/puppet@production] mx: set net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl

https://gerrit.wikimedia.org/r/759344

Maintenance_bot removed a project: Patch-For-Review.Feb 4 2022, 4:10 PM

We are no longer seeing the timeouts after setting the sysctl net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl to 3600 which restores the setting to the same value prior to kernel
version v5.10.54.

mx1001.wikimedia.org mail delivery timeoutsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

mx1001.wikimedia.org mail delivery timeouts
Closed, ResolvedPublic
Actions

Related Objects
Search...