Page MenuHomePhabricator

[mitigated] Google returning 503 error when delivering to mx1001 and mx2001
Closed, ResolvedPublic

Description

Hey folks. ITS is receiving reports from multiple users the last few days that Gmail emails are intermittently failing to deliver. When I review the Google Workspace logs, I see the following error on one of the reported email bounces:

Google tried to deliver your message, but it was rejected by the server for the recipient domain <a href="http://wikimedia.org" target="_blank">wikimedia.org</a> by <a href="http://mx1001.wikimedia.org" target="_blank">mx1001.wikimedia.org</a>. [2620:0:861:3:208:80:154:76]. The error that the other server returned was: 503 BDAT command used when CHUNKING not advertised

Can you please check the status of mx1001 and mx2001? I think the issue may be on the mail server side, but I'm unsure. Let me know if you need any more information. I've attached a couple of screenshots of the log details for two of the bounces.

Screen Shot 2022-05-08 at 9.57.51 AM.png (450×2 px, 145 KB)

Screen Shot 2022-05-08 at 10.06.13 AM.png (642×1 px, 170 KB)

Event Timeline

Reedy triaged this task as High priority.May 8 2022, 5:28 PM
Reedy added projects: Mail, SRE.

@jhathaway wonder if anything may have changed recently

I have looked in our logs and the following is an example of what we see on our side

2022-05-06 01:57:31 H=mail-lf1-x12b.google.com [2a00:1450:4864:20::12b]:42712 I=[2620:0:861:3:208:80:154:76]:25 X=TLS1.3:ECDHE_SECP256R1__RSA_PSS_RSAE_SHA256__AES_256_GCM:256 CV=no F=<translations+caf_=redacted=wikimedia.org@wikimedia.org> rejected RCPT <redacted@wikimedia.org>: Previous (cached) callout verification failure
2022-05-06 01:57:31 SMTP protocol error in "BDAT 18944 LAST" H=mail-lf1-x12b.google.com [2a00:1450:4864:20::12b]:42712 I=[2620:0:861:3:208:80:154:76]:25 BDAT command used when CHUNKING not advertised
2022-05-06 01:57:31 SMTP syntax error in "        for <redacted@wikimedia.org>; Thu, 05 May 2022 18:57:31 -0700 (PDT)" H=mail-lf1-x12b.google.com [2a00:1450:4864:20::12b]:42712 I=[2620:0:861:3:208:80:154:76]:25 unrecognized command
2022-05-06 01:57:31 SMTP syntax error in "X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;" H=mail-lf1-x12b.google.com [2a00:1450:4864:20::12b]:42712 I=[2620:0:861:3:208:80:154:76]:25 unrecognized command
2022-05-06 01:57:31 SMTP call from mail-lf1-x12b.google.com [2a00:1450:4864:20::12b]:42712 I=[2620:0:861:3:208:80:154:76]:25 dropped: too many syntax or protocol errors (last command was "X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;",  C=EHLO,STARTTLS,EHLO,MAIL,RCPT,BDAT,RSET,NOOP,MAIL,RCPT,BDAT)

@jbond I can't think of any recent changes that would have introduced this behavior. The boxes were rebooted on Friday to catch the latest kernel update. I'll start investigating.

Looking at count of log lines matching "BDAT command used when CHUNKING not advertised" on mx1001 this appears to have began on the 5th, which was also the date that exim was restarted after the host was rebooted for a kernel update.

2466 2022-05-08
3863 2022-05-07
2523 2022-05-06
1727 2022-05-05

From https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2022-05-05

21:21 jhathaway@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: new kernel
21:21 jhathaway@cumin1001: START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: new kernel
21:21 jhathaway: reboot mx1001

Hi @bcampbell, while SRE is investigating could ITS please open a case with the google postmasters about this issue as well?

We have noted that these errors are currently isolated to connections from google, and are curious if there had been any changes to their infrastructure around the time this began, which according to our logs was 2022-05-05.

SRE is continuing to investigate as well.

demonstrating the we support chunking

$ telnet -4 mx1001.wikimedia.org 25                                                      [8:50:08]  
Trying 208.80.154.76...
Connected to mx1001.wikimedia.org.
Escape character is '^]'.
220 mx1001.wikimedia.org ESMTP Exim 4.94.2 Mon, 09 May 2022 08:50:13 +0000
ehlo bob
250-mx1001.wikimedia.org Hello loki.johnbond.org [206.189.25.239]
250-SIZE 52428800
250-8BITMIME
250-PIPELINING
250-PIPE_CONNECT
250-CHUNKING
250-STARTTLS
250-SMTPUTF8
250 HELP
421 mx1001.wikimedia.org: SMTP command timeout - closing connection
Connection closed by foreign host.

Thank you @herron and @jbond I'll open a ticket with Google now and keep you updated.

My reading of https://seclists.org/oss-sec/2017/q4/324 suggests that if a BDAT command is issued after the mail or RCPT command then exim will respond with this error message. Looking at the log line above we see the commands issued where C=EHLO,STARTTLS,EHLO,MAIL,RCPT,BDAT,RSET,NOOP,MAIL,RCPT,BDAT) i.e. we do see a BDAT after RCPT

My reading of https://seclists.org/oss-sec/2017/q4/324 suggests that if a BDAT command is issued after the mail or RCPT command then exim will respond with this error message. Looking at the log line above we see the commands issued where C=EHLO,STARTTLS,EHLO,MAIL,RCPT,BDAT,RSET,NOOP,MAIL,RCPT,BDAT) i.e. we do see a BDAT after RCPT

I managed to recreate this locally but not every time

telnet -4 mx1001.wikimedia.org 25                                                                    [15:15:14]
Trying 208.80.154.76...
Connected to mx1001.wikimedia.org.
Escape character is '^]'.
220 mx1001.wikimedia.org ESMTP Exim 4.94.2 Mon, 09 May 2022 15:15:15 +0000
MAIL FROM: jbond@johnbond.org
250 OK
RCPT to: jbond@wikimedia.org
250 Accepted
BDAT 1 LAST
503 BDAT command used when CHUNKING not advertised

however i dont get the error if i start the connection eith ehlo

telnet -4 mx1001.wikimedia.org 25                                                                    [15:17:21]
Trying 208.80.154.76...
Connected to mx1001.wikimedia.org.
Escape character is '^]'.
220 mx1001.wikimedia.org ESMTP Exim 4.94.2 Mon, 09 May 2022 15:17:23 +0000
ehlo bob
250-mx1001.wikimedia.org Hello loki.johnbond.org [206.189.25.239]
250-SIZE 52428800
250-8BITMIME
250-PIPELINING
250-PIPE_CONNECT
250-CHUNKING
250-STARTTLS
250-SMTPUTF8
250 HELP
MAIL FROM: jbond@johnbond.org
250 OK
RCPT to: jbond@wikimedia.org
250 Accepted
BDAT 1 LAST
1
1
250- 1 byte chunk, total 5

think this was a read hearing it dose work when started with ehlo which seems to be the correct behaviour

$ telnet -4 mx1001.wikimedia.org 25                                                                    [15:17:21]
Trying 208.80.154.76...
Connected to mx1001.wikimedia.org.
Escape character is '^]'.
220 mx1001.wikimedia.org ESMTP Exim 4.94.2 Mon, 09 May 2022 15:17:23 +0000
ehlo bob
250-mx1001.wikimedia.org Hello loki.johnbond.org [206.189.25.239]
250-SIZE 52428800
250-8BITMIME
250-PIPELINING
250-PIPE_CONNECT
250-CHUNKING
250-STARTTLS
250-SMTPUTF8
250 HELP
MAIL FROM: jbond@johnbond.org
250 OK
RCPT to: jbond@wikimedia.org
250 Accepted
BDAT 1 LAST
1
1
250- 1 byte chunk, total 5

First messages in the logs appeared on May 4th:

$ zgrep "BDAT command used when CHUNKING not advertised" /var/log/exim4/mainlog.5.gz|awk '{split($2, t, ":" ); print $1,t[1],t[2]}'|uniq -c|head
      1 2022-05-04 01 28
      1 2022-05-04 01 29
      2 2022-05-04 01 33
      3 2022-05-04 01 34
      3 2022-05-04 01 36
      2 2022-05-04 01 37
      1 2022-05-04 01 38
      3 2022-05-04 01 39
      2 2022-05-04 01 40
      1 2022-05-04 01 41

My suggestions:

  • Please first remove the google servers from the callout cache, and you may also consider examining what caused callout failure on that host. It defintely shouldn't be there. An unexpected rejection may mess up google's SMTP side and may invalidate already seen chunking status, which would in turn invalidate BDATs. (In fact it might be our own server involved in the callout, if it was a wikimedia.org source!)
  • There is some reason if/when CHUNKING is not advertised to google, since otherwise it seems to be enabled. May also worths a look. (It may be the result of the above, though.)
  • While google isn't perfect I don't remember seeing any chunking related problems with them (and all of my servers do chunking), so I'd say to look locally first.
herron lowered the priority of this task from High to Medium.EditedMay 9 2022, 3:57 PM

'chunking_advertise_hosts =' (disabling chunking) has been applied to both MXes and we have not seen this error recur since that change was made.

Decreasing priority as this appears to have mitigated the immediate issue, however leaving this task open while continuing to try and understand the root cause.

I heard back from SADA, our Google vendor.

"Hope you're doing well! We are not currently aware of any changes to how Google would be delivering messages. Considering how messages are sent and the nature of the error in question, this would constitute a fairly major change to Google's SMTP workflow. I would presume the issue is more so tied to the server that's actually kicking back the message."

herron renamed this task from [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 to [mitigated] Google returning 503 error when delivering to mx1001 and mx2001.May 9 2022, 5:14 PM

Change 790419 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] mx: disable chunking

https://gerrit.wikimedia.org/r/790419

  • Please first remove the google servers from the callout cache, and you may also consider examining what caused callout failure on that host. It defintely shouldn't be there. An unexpected rejection may mess up google's SMTP side and may invalidate already seen chunking status, which would in turn invalidate BDATs. (In fact it might be our own server involved in the callout, if it was a wikimedia.org source!)

Thanks @grin for the input. I agree that callout verification may be the cause and perhaps disabling callout verification will solve the problem. However, the callout verification logic has existed for many years, so I am confused why it has just recently started to cause issues.

Change 790419 merged by JHathaway:

[operations/puppet@production] mx: disable chunking

https://gerrit.wikimedia.org/r/790419

I sent a message to the Exim mailing list, https://www.mail-archive.com/exim-users@exim.org/msg57216.html.

Jeremy Harris suggested trying the latest version of Exim which allows for acl controlled debug messages. I was able
to successfully capture debug messages with the following changes.

# install 4.96~RC0
$ sudo apt-get install ./exim4{,-config}_4.96~RC0-1~bullseye0_all.deb ./exim4-{base,daemon-heavy}_4.96~RC0-1~bullseye0_amd64.deb
# add acl debug rules
<snip>
acl_smtp_helo = acl_check_ehlo
<snip>
begin acl

acl_check_ehlo:
        warn condition = ${if match{$sender_helo_name}{.*\\.google\\.com}{true}{false}}
                log_message = XXX CONTROL DEBUGGING XXX
                control = debug/tag=.$sender_host_address
        accept
<snip>

This allowed me capture a debug session which I sent over to Jeremy. The debug session appeared helpful to Jeremy and he is going to see if he can determine the cause of the error.

Change 803601 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] exim: update comment on BDAT issue

https://gerrit.wikimedia.org/r/803601

Change 803601 merged by JHathaway:

[operations/puppet@production] exim: update comment on BDAT issue

https://gerrit.wikimedia.org/r/803601

This has now been fixed upstream, https://git.exim.org/exim.git/commit/462e2cd30. We will keep chunking disabled until we upgrade to 4.96 which will include the fix.