Hey folks. Is there something up with server mx1001.wikimedia.org? Several staff members are getting intermittent Google Mail bounce backs. It's my understanding that SRE manages this server. Thanks!
|Resolved||Dzahn||T297017 MX record issue on mx2001.wikimedia.org|
|Resolved||herron||T297127 Incident: 2021-12-03 mx2001->Gmail delivery issues|
|Resolved||herron||T297128 Bringing mx2001 back into service|
|Resolved||herron||T297144 large MX queues should page|
|Open||None||T275867 Add exim queue size to grafana graph|
|Open||None||T294166 Alert that should have paged via VictorOps was delayed because of partial networking outage|
|Resolved||jhathaway||T299107 mx1001.wikimedia.org mail delivery timeouts|
Hi @bcampbell do you have any examples of the bounce messages including the full raw headers? With that we could trace the messages through the mail logs. A private paste would be great. Thanks in advance!
Hey @herron thanks. I think I uploaded the eml file privately and added you as a subscriber, but let me know if you don't see it.
Thanks, looking into this I see in the private message:
Subject: Warning: message 1msZEc-006308-GG delayed 24 hours This message was created automatically by mail delivery software. A message that you sent has not yet been delivered to one or more of its recipients after more than 24 hours on the queue on mx2001.wikimedia.org. The message identifier is: 1msZEc-006308-GG The date of the message is: Wed, 1 Dec 2021 15:41:19 -0800
In other words mx2001 is saying that it hasn't been able to deliver this to gmail for 24 hours, and the mail is delayed.
From the most recent delivery attempt:
2021-12-03 08:38:43 1msZEc-006308-GG H=aspmx.l.google.com [220.127.116.11]: SMTP timeout after end of data (186657 bytes written): Connection timed out 2021-12-03 08:38:43 1msZEc-006308-GG == email@example.com R=gsuite_account T=remote_smtp defer (110): Connection timed out H=aspmx.l.google.com [18.104.22.168] DT=10m: SMTP timeout after end of data (186657 bytes written)
I am not sure off-hand why the connection with google is timing out. Ideally this could be escalated to the google postmasters for investigation on their end.
Also, to clarify, this message hasn't been lost or bounced yet. It is currently in the deferred queue on mx2001, where delivery to google will continue to be retried.
What may be the next steps as ITS is receiving more tickets related to this
Hi @bcampbell and @eliza, thanks for the heads up.
Based on your notification, SRE investigated and found a firewall issue (potentially related to a kernel bug) that prevented mx2001 (one of our two mail servers) from reaching Google, causing emails to back up in the outgoing queue. This affected a fraction of outgoing email since November 24, and a larger fraction since a load-balancing change on December 1.  All other outgoing mail would have been routed through the other mail server (mx1001) and sent successfully.
We're still investigating the root cause of the config bug, but in the meantime we've applied a fix in production and the affected server is churning through its backlog of emails now.  It should be finished within the next few hours, at which point all the delayed email will have been delivered. No mail was permanently lost as a result of this incident.
We'll monitor for further trouble but let us know if you see anything; feel free to use https://klaxon.wikimedia.org to page us if needed over the weekend. We'll also let you handle Foundation-wide communication as needed here, but happy to provide input. Next week we'll follow up with more information including an incident report. Sorry for the inconvenience and thanks for letting us know.
 Incoming mail per host: https://grafana.wikimedia.org/d/000000451/mail?viewPanel=37&orgId=1&from=1635984865681&to=1638576865681
 Outgoing mail per host: https://grafana.wikimedia.org/d/000000451/mail?viewPanel=3&orgId=1&from=1638566026656&to=1638576826656
Update: The mail queue length on mx2001 is back to normal, so we're substantially caught up on the delayed emails. We'll continue to keep an eye on things and you can expect more details next week.
@bcampbell This actually turned out to be a firewall dropping packets due to a kernel bug. I shared a doc with you if you are curious.
@Dzahn Thanks for sharing the doc, that's helpful. Are there any outstanding emails left in the queue?
@bcampbell No more mails in the queue and exim is stil disabled on the server that was affected. mail is currently handled by the other server.