Page MenuHomePhabricator

MX record issue on mx2001.wikimedia.org
Closed, ResolvedPublic

Description

Hey folks. Is there something up with server mx1001.wikimedia.org? Several staff members are getting intermittent Google Mail bounce backs. It's my understanding that SRE manages this server. Thanks!

Event Timeline

Hi @bcampbell do you have any examples of the bounce messages including the full raw headers? With that we could trace the messages through the mail logs. A private paste would be great. Thanks in advance!

Hey @herron thanks. I think I uploaded the eml file privately and added you as a subscriber, but let me know if you don't see it.

Thanks, looking into this I see in the private message:

Subject: Warning: message 1msZEc-006308-GG delayed 24 hours

This message was created automatically by mail delivery software.
A message that you sent has not yet been delivered to one or more of its
recipients after more than 24 hours on the queue on mx2001.wikimedia.org.

The message identifier is:     1msZEc-006308-GG
The date of the message is:    Wed, 1 Dec 2021 15:41:19 -0800

In other words mx2001 is saying that it hasn't been able to deliver this to gmail for 24 hours, and the mail is delayed.

From the most recent delivery attempt:

2021-12-03 08:38:43 1msZEc-006308-GG H=aspmx.l.google.com [142.250.114.27]: SMTP timeout after end of data (186657 bytes written): Connection timed out
2021-12-03 08:38:43 1msZEc-006308-GG == redacted@wikimedia.org R=gsuite_account T=remote_smtp defer (110): Connection timed out H=aspmx.l.google.com [142.250.114.27] DT=10m: SMTP timeout after end of data (186657 bytes written)

I am not sure off-hand why the connection with google is timing out. Ideally this could be escalated to the google postmasters for investigation on their end.

Also, to clarify, this message hasn't been lost or bounced yet. It is currently in the deferred queue on mx2001, where delivery to google will continue to be retried.

Hi Herron

What may be the next steps as ITS is receiving more tickets related to this
issue.

Thanks
Eliza

Dzahn changed the task status from Open to In Progress.Dec 3 2021, 11:57 PM
Dzahn triaged this task as High priority.

@eliza we're looking into this - next update in 15mins.

Hi @bcampbell and @eliza, thanks for the heads up.

Based on your notification, SRE investigated and found a firewall issue (potentially related to a kernel bug) that prevented mx2001 (one of our two mail servers) from reaching Google, causing emails to back up in the outgoing queue. This affected a fraction of outgoing email since November 24, and a larger fraction since a load-balancing change on December 1. [1] All other outgoing mail would have been routed through the other mail server (mx1001) and sent successfully.

We're still investigating the root cause of the config bug, but in the meantime we've applied a fix in production and the affected server is churning through its backlog of emails now. [2] It should be finished within the next few hours, at which point all the delayed email will have been delivered. No mail was permanently lost as a result of this incident.

We'll monitor for further trouble but let us know if you see anything; feel free to use https://klaxon.wikimedia.org to page us if needed over the weekend. We'll also let you handle Foundation-wide communication as needed here, but happy to provide input. Next week we'll follow up with more information including an incident report. Sorry for the inconvenience and thanks for letting us know.

[1] Incoming mail per host: https://grafana.wikimedia.org/d/000000451/mail?viewPanel=37&orgId=1&from=1635984865681&to=1638576865681
[2] Outgoing mail per host: https://grafana.wikimedia.org/d/000000451/mail?viewPanel=3&orgId=1&from=1638566026656&to=1638576826656

RLazarus renamed this task from MX record issue on mx1001.wikimedia.org to MX record issue on mx2001.wikimedia.org.Dec 4 2021, 12:22 AM

Update: The mail queue length on mx2001 is back to normal, so we're substantially caught up on the delayed emails. We'll continue to keep an eye on things and you can expect more details next week.

Dzahn claimed this task.
Dzahn added a subscriber: Dzahn.

This became incident T297127 for which we will shortly release a public incident report (as part of the incident ticket, but will also link here).

After that is done we need to followup with bringing the server back into service at T297128.

The incident itself is closed.

@bcampbell This actually turned out to be a firewall dropping packets due to a kernel bug. I shared a doc with you if you are curious.

@Dzahn Thanks for sharing the doc, that's helpful. Are there any outstanding emails left in the queue?

@bcampbell No more mails in the queue and exim is stil disabled on the server that was affected. mail is currently handled by the other server.