Page MenuHomePhabricator

Emails sent from @wikimedia.beta.wmflabs.org do not reach intended recipients
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:
E-mail never arrives to user. The logs at recipient server show this rejection reason:

Aug  9 16:38:45 mx01 postfix/smtpd[11788]: NOQUEUE: reject: RCPT from instance-mx-out03.cloudinfra.wmflabs.org[185.15.56.18]: 450 4.1.8 <wiki@wikimedia.beta.wmflabs.org>: Sender address rejected: Domain not found; from=<wiki@wikimedia.beta.wmflabs.org> to=<REDACTED@voyager.hr> proto=ESMTP helo=<mx-out03.wmcloud.org>

That seems to be due to sanity check (in this example case, reject_unknown_sender_domain feature of Postfix MTA), which reject e-mail when envelope-from domain has neither MX nor A DNS records, which is the case for MAIL FROM: <wiki@wikimedia.beta.wmflabs.org>:

mx01% host -t mx wikimedia.beta.wmflabs.org
wikimedia.beta.wmflabs.org has no MX record

mx01% host -t a wikimedia.beta.wmflabs.org
wikimedia.beta.wmflabs.org has no A record

What should have happened instead?:
Mail should arrive. It would, if beta site had valid e-mail envelope sender addresses.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):
Other e-mails also never arrived (opening of account etc)., most likely due to same issue, but were not critical for use of the site in my case

Event Timeline

I looked a bit into this prompted by @kostajh on IRC. Essentially, the situation in beta is like this:

  • There is a beta-specific mail server, deployment-mx03. That instance does not have a floating IP so it would be using the general Cloud VPS NAT address for outbound connectivity (which is not ideal for SPF and similar).
  • Wikimedia production systems have a manual mail route for MediaWiki-originated mail to be sent out by a specific mailer separate from the rest of the mail (wikimail_smarthost). The beta cluster has configuration to use deployment-mx03 for this.
  • However, the mail relay Puppetization is needlessly complicated. Instead of provisioning that specific mail route the specific servers are configured (which would be the logical and modern way of doing things), the entire exim4 config file is varied between realms and only the production config file is using the specific route needed.
  • The wikimedia.beta.wmflabs.org domain is also missing SPF configuration to permit anyone to send email from this. As a bonus beta.wmflabs.org has a clearly outdated policy permitting a toolsbeta host to send mail on behalf of that domain.

My recommendation for fixing this would be:

  • Assing a floating IP for the deployment-prep mail server.
  • Configure a SPF record for the wikimedia.beta.wmflabs.org domain to permit the floating IP mentioned in the previous step to send mail for that domain and fail others.
  • Fix the mail relay puppetization to do the right thing and use the mediawiki-specific mail route when needed even on beta.

I assigned float IP 185.15.56.115 to deployment-mx03

I think I set up the SPF record:

$dig wikimedia.beta.wmflabs.org txt

; <<>> DiG 9.18.19-1~deb12u1-Debian <<>> wikimedia.beta.wmflabs.org txt
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47601
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;wikimedia.beta.wmflabs.org.	IN	TXT

;; ANSWER SECTION:
wikimedia.beta.wmflabs.org. 3600 IN	TXT	"v=spf1 ip4:185.15.56.115 -all"

I forgot, I removed beta.wmflabs.org SPF record too.

On enwiki, I was just able to update my email to a Gmail address and receive the verification email. So, that's promising.

EDIT: False info. The email came to my @fastmail.com account (the previously set address on the account) but the confirmation to the @gmail.com address never arrived.

I haven't looked at the exim4 configs yet to see what needs to be done regarding routing (maybe it's already handled via ad-hoc cherry-picked puppet). exim4 configs are notoriously complicated, give me a bit.

Change 1003420 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mail: Add wikimail routing to wmcs as well

https://gerrit.wikimedia.org/r/1003420

It's getting routed to WMCS's main mailservers:

2024-02-14 13:01:57 1raEth-0006mk-Lf <= wiki-enwiki-s9-s8uk79-se/dKyYlA6mLoB6r@beta.wmflabs.org U=www-data P=local S=1644 id=enwiki.65ccb9c59d2e20.08051219@en.wikipedia.beta.wmflabs.org
2024-02-14 13:01:57 1raEth-0006mk-Lf => ladsgroup@gmail.com R=smart_route T=remote_smtp S=1680 H=mx-out03.wmcloud.org [172.16.6.237] I=[172.16.2.65] K C="250- 1680 byte chunk, total 1680\\n250 OK id=1raEth-0005rF-Ml" DT=0s

That patch should fix it. It's not perfect but meh. I cherry-pick it on beta's puppetmaster and see how it goes.

Notice: /Stage[main]/Exim4/File[/etc/exim4/exim4.conf]/content: 
--- /etc/exim4/exim4.conf	2023-12-11 16:45:46.519337352 +0000
+++ /tmp/puppet-file20240214-10749-1yae50w	2024-02-14 14:03:14.872509648 +0000
@@ -45,6 +45,12 @@
 	allow_defer
 	forbid_file
 
+wiki_mail:
+	driver = manualroute
+	condition = ${if eqi{$header_X-Mailer:}{MediaWiki mailer}}
+	transport = remote_smtp
+	route_list = *  deployment-mx03.deployment-prep.eqiad1.wikimedia.cloud
+
 # Send all mail via a set of mail relays ("smart hosts")
 
 smart_route:

Info: Computing checksum on file /etc/exim4/exim4.conf

Prgoress?

2024-02-14 14:09:13 1raFwn-0005fc-Me <= wiki@wikimedia.beta.wmflabs.org U=www-data P=local S=2898 id=enwiki.65ccc989a56369.31715278@en.wikipedia.beta.wmflabs.org
2024-02-14 14:09:13 1raFwn-0005fc-Me no IP address found for host  deployment-mx03.deployment-prep.eqiad1.wikimedia.cloud
2024-02-14 14:09:13 1raFwn-0005fc-Me == ladsgroup+beta@gmail.com R=wiki_mail defer (-32): lookup of host "\302\240deployment-mx03.deployment-prep.eqiad1.wikimedia.cloud" failed in wiki_mail router

Replaced the host with IP and now I'm getting this:

2024-02-14 14:21:08 1raG8J-0002l3-Tc ** ladsgroup+beta@gmail.com R=wiki_mail T=remote_smtp H=185.15.56.115 [185.15.56.115] I=[172.16.2.65]: SMTP error from remote mail server after RCPT TO:<ladsgroup+beta@gmail.com>: 550 Relay not permitted

Let me see what I can do about that.

I love this, the relay allows internal IPs

hostlist relay_from_hosts = <; @[] ; 127.0.0.1 ; ::1 ; 172.16.0.0/12 ; 127.0.0.0/8 ; ::1/128

But the mail is being sent from WMCS's external IP:

2024-02-14 14:21:07 H=cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org (deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud) [185.15.56.238]:4961 I=[172.16.6.221]:25 F=<wiki@wikimedia.beta.wmflabs.org> rejected RCPT <ladsgroup+beta@gmail.com>: Relay not permitted
root@deployment-mx03:/var/log/exim4# dig -x 185.15.56.238 +short
cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org.

The whitespace change fixed the lookup and now it should be working. I got the mail as well.

Headers look fine too:

Received-SPF: pass (google.com: domain of wiki@wikimedia.beta.wmflabs.org designates 185.15.56.115 as permitted sender) client-ip=185.15.56.115;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@beta.wmflabs.org header.s=wikimedia header.b=mrUgdnQ9;
       spf=pass (google.com: domain of wiki@wikimedia.beta.wmflabs.org designates 185.15.56.115 as permitted sender) smtp.mailfrom=wiki@wikimedia.beta.wmflabs.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=wmflabs.org

An update:

Gmail is happy now but our own wikimedia.org servers don't get the mails:

2024-02-14 16:02:17 H=deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud [172.16.2.65]:57382 I=[172.16.6.221]:25 F=<wiki@wikimedia.beta.wmflabs.org> temporarily rejected RCPT <asarabadani@wikimedia.org>: Could not complete sender verify

I don't know if it's enterprise gmail rejecting it or our own exim4 is strict. Or mx03 is refusing to send it to wikimedia.org RCPT (why?).

Change 1003498 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] exim: Avoid considering wikimedia domains as local in WMCS

https://gerrit.wikimedia.org/r/1003498

Cherry-picked this in beta cluster and it works.

Basically now this is fixed, I only keep it open to get the puppet patches merged.

I also removed the MX record as they are just extra stuff to maintain, the A record is enough.

Change 1003420 merged by Ladsgroup:

[operations/puppet@production] mail: Add wikimail routing to wmcs as well

https://gerrit.wikimedia.org/r/1003420

Change 1003498 merged by Ladsgroup:

[operations/puppet@production] exim: Avoid considering wikimedia domains as local in WMCS

https://gerrit.wikimedia.org/r/1003498

Ladsgroup claimed this task.