Page MenuHomePhabricator

Some emails from The Wikipedia Library aren't being received as expected
Closed, ResolvedPublicSpike

Description

Emails from The Wikipedia Library are largely broken right now, and have been for some weeks. We've been aware of some delivery issues, but it now appears that this problem is affecting at least all Gmail users. Emails including application approvals, access codes, and comment notifications, are not being delivered as expected.

While we're working on a fix, @Samwalton9-WMF will manually re-send emails from wikipedialibrary@wikimedia.org on a regular basis. We will also display a notice in the library.

Solution
We are working on a temporary fix, which should get emails up and running again, but this solution involves a substantial technical overhead in the long run.

We are currently investigating our options for the future, which may involve integrations some of the content of emails into the library directly (e.g. access to per-collection login details), and additional usage of Echo for notifying users. Ideally we wouldn't be using email at all.

Further updates on this potential solution will be posted in this ticket or its subtasks.

Original task
Numerous users have now reported that they haven't received their BNA access code email. I've asked once or twice whether the email is in spam and users haven't found it there:

In The Wikipedia Library admin djmail is confirming that emails were sent (I usually go here to copy the email text to send on) and when I tested this myself I received the email as expected, so I'm not sure what's going on here. This seems to be a transient issue.

We're seeing similar issues for coordinator emails (T350184), and received further confirmation today that access codes for another partner are not being delivered correctly.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Samwalton9-WMF renamed this task from Some emails aren't being sent correctly from The Wikipedia Library to Some emails from The Wikipedia Library aren't being received as expected.Nov 30 2023, 3:49 PM

I used some information provided by @jsn.sherman to look for things in the exim4 logs on mx-out03.cloudinfra.eqiad1.wikimedia.cloud (the instance behind the mx-out03.wmcloud.org service name). It looks like the issue here is missing or invalid SPF or DKIM information for the "noreply@wikipedialibrary.wmflabs.org" sender address that is being used. Here's a lightly redacted set of logs for a representative message that failed to deliver:

2023-11-24 00:40:03 1r6KEl-0004G4-Ne <= noreply@wikipedialibrary.wmflabs.org H=prod0.twl.eqiad1.wikimedia.cloud (fd3645edb59a) [172.16.5.144]:47122 I=[172.16.6.237]:25 P=esmtp S=2216 id=170078640372.3660.2948080117478532382@fd3645edb59a

2023-11-24 00:40:04 1r6KEl-0004G4-Ne ** REDACTED@gmail.com R=dnslookup_unsigned T=remote_smtp_unsigned H=gmail-smtp-in.l.google.com [172.253.63.27] I=[172.16.6.237] X=TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256 CV=yes DN="CN=mx.google.com": SMTP error from remote mail server after pipelined end of data: 550-5.7.26 This mail has been blocked because the sender is unauthenticated.
550-5.7.26 Gmail requires all senders to authenticate with either SPF or DKIM.
550-5.7.26
550-5.7.26  Authentication results:
550-5.7.26  DKIM = did not pass
550-5.7.26  SPF [wikipedialibrary.wmflabs.org] with ip: [185.15.56.18] = did not
550-5.7.26 pass
550-5.7.26
550-5.7.26  To mitigate this issue, please visit Gmail's authentication guide
550-5.7.26 for instructions on setting up authentication:
550 5.7.26  https://support.google.com/mail/answer/81126#authentication i17-20020a05620a145100b007765135d96asi1949489qkl.390 - gsmtp

2023-11-24 00:40:04 1r6KEm-0004Gr-Cj <= <> R=1r6KEl-0004G4-Ne U=Debian-exim P=local S=5007

2023-11-24 00:40:04 1r6KEl-0004G4-Ne Completed

The 185.15.56.18 IPv4 that was checked for SPF is the public service IP of mx-out03.wmcloud.org.

There currently is no SPF record in DNS for the wikipedialibrary.wmflabs.org subdomain:

$ host -t txt wikipedialibrary.wmflabs.org
wikipedialibrary.wmflabs.org has no TXT record

There is an SPF record for our legacy wmflabs.org domain:

$ host -t txt wmflabs.org
wmflabs.org descriptive text "v=spf1 mx ip4:185.15.56.18 ip4:185.15.56.19 include:wikimedia.org ~all"

...and for the modern wmcloud.org domain:

$ host -t txt wmcloud.org
wmcloud.org descriptive text "v=spf1 ip4:185.15.56.18 ip4:185.15.56.19 ip4:208.80.152.0/22 ip6:2620:0:860::/46 ~all"

(aside: It feels like those records should be identical. I have a hunch that we have been managing them manually which leads to drift.)

I think changing the sender address to something like "no-reply.wikipedialibrary@wmcloud.org" would make Google's SPF check happy. Alternately an address matching some subdomain delegated to the twl Cloud VPS project could be used and an appropriate SPF record added for that subdomain via Horizon.

T47828: Implement mail aliases for Cloud-VPS projects (<novaproject>@wmcloud.org) is a semi-related feature request for Cloud VPS projects generally. If a <addr>@<project>.wmcloud.org variant of that were to be implemented it would be reasonable to also assume that SPF records would be made for each <project>.wmcloud.org subdomain. A lack of delegation of SPF record lookups to higher-level domains is a "feature" of the SPF protocol.

Thank you for looking into this! I think making the sender address change is the way to go: the less special config we need, the better. @Samwalton9-WMF do we need to make an announcement ahead of making such a change?

I think changing the sender address to something like "no-reply.wikipedialibrary@wmcloud.org" would make Google's SPF check happy. Alternately an address matching some subdomain delegated to the twl Cloud VPS project could be used and an appropriate SPF record added for that subdomain via Horizon.

A valid sender address would be a better idea so that if there are other issues in the future the would get reported to mailbox somewhere. The logs on mx-out03 are also full of retry failure messages for attempts to tell noreply@wikipedialibrary.wmflabs.org that there were problems delivering messages. Unfortunately I think using a valid email would actually require the twl project to run its own email service instead of using mx-out03 as we do not currently have T47828: Implement mail aliases for Cloud-VPS projects (<novaproject>@wmcloud.org) or similar mail deliver for Cloud VPS projects.

That seems like something we're likely to do a poor job of maintaining. How are other vps projects handling outbound mail?

I should have looked more closely at the links you provided. I didn't realize that we could manage our own DNS zone. I've setup dmarc/dkim/spf for small orgs before, though I'm quite rusty. Since we're not using twl.wmflabs.org for mail right now, I should be able to iterate as needed without making things worse.

That seems like something we're likely to do a poor job of maintaining. How are other vps projects handling outbound mail?

Good question. It is all ad hoc today. Toolforge has it's own domain, mail exchanges, and SPF record that are managed as part of the project. I would actually guess that the vast majority of Cloud VPS projects do not send outbound mail at all other than Puppet failure cronspam.

Thank you for looking into this! I think making the sender address change is the way to go: the less special config we need, the better. @Samwalton9-WMF do we need to make an announcement ahead of making such a change?

I can't think of any reason we'd need to - it should be fine to just make the switch.

I'll need to generate a dkim key to follow the suggested fix, but for whatever reason I'm not currently able shell into any of our horizon instances; the web interface is also throwing errors, so after much fussing, I'll set this aside for now.

jsn.sherman changed the task status from Open to Stalled.Dec 1 2023, 3:42 PM

Apologies, we had a Cloud VPS network outage today. Things should be working again now!

Apologies, we had a Cloud VPS network outage today. Things should be working again now!

The web interfaces for our cloud vps hosts are accessible, but I still can't shell in:

ssh -v prod.wikilink.eqiad1.wikimedia.cloud
OpenSSH_8.9p1 Ubuntu-3ubuntu0.4, OpenSSL 3.0.2 15 Mar 2022
[...]
debug1: Connecting to prod.wikilink.eqiad1.wikimedia.cloud [172.16.5.37] port 22.
debug1: connect to address 172.16.5.37 port 22: Connection timed out
ssh: connect to host prod.wikilink.eqiad1.wikimedia.cloud port 22: Connection timed out

in every case, the ip address shown in the ssh debug matches what is shown in horizon.

I'm able to reach the bastion with no problem. I tried restarting one of the hosts, but there was no change. I'm happy to move this over to T352539: 2023-12-01 Cloud VPS network outage if needed.

I can SSH to that VM (prod.wikilink.eqiad1.wikimedia.cloud), but there are some ongoing issues with Cloud VPS, please leave a comment in T352539: 2023-12-01 Cloud VPS network outage if this keeps on failing.

It just started working for me after quite a bit of fiddling and retrying.

jsn.sherman changed the task status from Stalled to In Progress.Dec 1 2023, 6:17 PM

well, I don't seem to be able to create a txt record in the twl.wmflabs.org zone, so that means I can't add a SPF record. When I started thinking through the dkim steps, I realized that would only work for the "run our own email service" option, which I don't think is a good one for us. For now, I'm going to implement the no-reply.wikipedialibrary@wmcloud.org sender to hopefully resolve the immediate delivery problem.

That change is now live; we'll need to see if delivery improves.

That change is now live; we'll need to see if delivery improves.

I tried sending an email from the Contact Us page but it never arrived, at either wikipedialibrary@ or my personal email. Djmail claims the email was sent.

Rolling back to the old address.

@Samwalton9 can you send yourself another contact email to see if we're back to the state we were before?

@Samwalton9 can you send yourself another contact email to see if we're back to the state we were before?

That arrived to the @wikimedia.org address as expected, though not to my personal email.

@Samwalton9 can you send yourself another contact email to see if we're back to the state we were before?

That arrived to the @wikimedia.org address as expected, though not to my personal email.

Is this a different outcome than when I sent you a test message last week?

@Samwalton9 can you send yourself another contact email to see if we're back to the state we were before?

That arrived to the @wikimedia.org address as expected, though not to my personal email.

Is this a different outcome than when I sent you a test message last week?

Same result - the wikipedialibrary@wikimedia.org address receives it, but the individual does not.

If I understand @bd808's suggestions correctly, our current option for correcting email delivery would involve:

  • requesting and obtaining a floating ip for the project
  • attaching that to one of our servers
  • setting up an outbound mail on that server
  • adding the appropriate dns records to allow those messages to pass dmarc checks

Do I have that right @bd808?

Do I have that right @bd808?

That seems like the basic steps to move outbound SMTP 100% into the control of your project, yes. If it is only outbound, I'm not sure actually that you would need a floating IP instead of letting the traffic be seen by the Internet as coming from the Cloud VPS NAT gateway.

That change is now live; we'll need to see if delivery improves.

I tried sending an email from the Contact Us page but it never arrived, at either wikipedialibrary@ or my personal email. Djmail claims the email was sent.

This was possibly one of the logged events similar to:

mainlog.3.gz:2023-12-01 23:40:24 1r9D7P-0007fL-J7 ** wikipedialibrary@wikimedia.org R=dnslookup_wmcloud_org_wmcs T=remote_smtp_wmcloud_org_wmcs H=mx2001.wikimedia.org [208.80.153.45] I=[172.16.6.237] X=TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256 CV=yes DN="CN=mx1001.wikimedia.org": SMTP error from remote mail server after RCPT TO:<wikipedialibrary@wikimedia.org>:
550-Verification failed for <no-reply.wikipedialibrary@wmcloud.org>
550-Address no-reply.wikipedialibrary@wmcloud.org does not exist
550 Sender verify failed

That is a MX for wikimedia.org rejecting an inbound message to wikipedialibrary@wikimedia.org because the envelope From of no-reply.wikipedialibrary@wmcloud.org was not acknowledged as a deliverable address by ... some host. We do not publish any MX records for the wmcloud.org domain, so I'm not completely sure what host mx{1001,2001} would have connected to when attempting to verify the envelope From address. I guess it would have ended up finding 185.15.56.49, which is the external interface of the wmcloud.org & wmflabs.org HTTP reverse proxy service. That really should be the same host that found when attempting a deliverability check for "noreply@wikipedialibrary.wmflabs.org" as well. I feel like I'm getting to a "wait, how does any of this work" moment.

We talked about it in a planning meeting today, and we're going to try running our own deliveries for now while working out a plan to move most of our notifications away from email, as it's always been a complicated beast.

Do I have that right @bd808?

That seems like the basic steps to move outbound SMTP 100% into the control of your project, yes. If it is only outbound, I'm not sure actually that you would need a floating IP instead of letting the traffic be seen by the Internet as coming from the Cloud VPS NAT gateway.

@bd808 I don't actually be able to create an spf record with anything useful in it; I can create:

twl.wmflabs.org.	60	IN	TXT	"v=spf1"

but if I try to actually set any ip addresses or set a policy in the record, I get a 400 response back from openstack.

@bd808 I don't actually be able to create an spf record with anything useful in it; I can create:

twl.wmflabs.org.	60	IN	TXT	"v=spf1"

but if I try to actually set any ip addresses or set a policy in the record, I get a 400 response back from openstack.

Horizon let me change that record to "v=spf1 +all" but then errored when I tried "v=spf1 mx +all". I then tried deleting that recordset and creating another using the "SPF" type, but it failed hard. I guess this needs a separate bug and some investigation of what if anything we can do to appease the strange and mysterious Designate service.

[...]
I guess this needs a separate bug and some investigation of what if anything we can do to appease the strange and mysterious Designate service.

Created T352713: Horizon: cannot create/update an SPF DNS record

I'm thinking about other possible options: how is wikipedialibrary@wikimedia.org configured: is it a shared mailbox, a group, or something similar? Could we authenticate and send mail from that address or setup a noreply-library wikimedia.org account for this purpose?

jsn.sherman changed the task status from In Progress to Stalled.Dec 5 2023, 1:21 PM

Okay, we now have an spf record, but no dkim. We did some manual mail tests via shell on prod, and we found that the updated email sender noreply@twl.wmflabs.org is passing spf and getting delivered to wmf and gmail accounts. Since that's an improvement, I put a
hotfix in the pipeline:
https://github.com/WikipediaLibrary/TWLight/compare/b0ffec8f51835885c5515dda00f587749ad9b373..c445d9a63792041337d8bad583f538486519e855

I generated a dkim key pair, added the private key to our docker swarm secrets and added an dns record for the public key.
PR to enable signing here:
https://github.com/WikipediaLibrary/TWLight/pull/1231

Moving back to in progress: I decided I can make this more testable by implementing the console dkim email provider with "public secret" dev key. I also setup a dkim keypair and dns entry for staging. I've added the config for staging into the pr, though I've left the default mail provider as console. I'll shell in and make some local changes to trigger an email to myself via the django app. It won't be a perfect test, but it's as close as we'll get.

This has been really useful; I've fixed a number of bugs in my pr and also discovered that I needed to add mx records as well.

I had do fiddle to get the right key formats :

ssh-keygen -t rsa -b 1024 -m PEM -P "" -f dkim
ssh-keygen -f dkim.pub -m PKCS8 -e > dkim.pub.pkcs8

This gives us the right format private key (dkim) for our app to use and the right format public key (dkim.pub.pkcs8) for the dns records

Okay, I was able to successfully send from staging. The DNS records have been made and the docker secret for the signing key has been added.
https://github.com/WikipediaLibrary/TWLight/pull/1231 is ready for review.

Here's the dns records we ended up with for the twl.wmflabs.org zone:

twl.wmflabs.org.			60	IN	TXT	"v=spf1 a:mx-out03.wmcloud.org a:nat.cloudgw.eqiad1.wikimediacloud.org ip4:185.15.56.0/22 ~all"
twl.wmflabs.org.			60	IN	MX	10 mx1001.wikimedia.org.
twl.wmflabs.org.			60	IN	MX	50 mx2001.wikimedia.org.
staging._domainkey.twl.wmflabs.org. 	60	IN	TXT	"v=DKIM1;t=s;p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQC5PZBVrRGiXETcgPehD2Ih6voYfbMhSZMANXN5o7p6SrT4h5JHnMFu6mR4+bna7RQh2swwuPU4ZyPoRp4XmQ2P1E9DCMZAfHkYzESO259O/O5OOG9d7ypgQYWLfGayB65ZHG8gQPLmuvf0JAFqua5zx9Yc43L92qFFFTsh8rGHpQIDAQAB"
prod._domainkey.twl.wmflabs.org.	60	IN	TXT	"v=DKIM1;t=s;p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCpcQDsm6Pf/aeXPjl2jg+CQwEvfIzExUe7esNSftIqwKR5mlGa5a94cewcauhD8m3z8/sDAvF8ddQqOyKQytnU2/eWPLF9EbUbW3T84FNYTCLEEjC/OnU4E/lv1ajG4Teqp7mg8dUkxzt10GaJj/w+9a8AINiGH0KRKwgwD4rxvwIDAQAB"

I'll shorten the ttl once we get the change deployed and are confident that every thing is working as intended

It occurs to me that we should test this with personal email accounts too.

Okay, I was able to successfully send from staging. The DNS records have been made and the docker secret for the signing key has been added.
https://github.com/WikipediaLibrary/TWLight/pull/1231 is ready for review.

How do you feel about this solution in terms of longevity? It seems simpler than some of the solutions we were talking about before, but I don't want to make any assumptions.

Okay, I was able to successfully send from staging. The DNS records have been made and the docker secret for the signing key has been added.
https://github.com/WikipediaLibrary/TWLight/pull/1231 is ready for review.

How do you feel about this solution in terms of longevity? It seems simpler than some of the solutions we were talking about before, but I don't want to make any assumptions.

Unfortunately, it still leaves us blind to delivery problems. This was the minimal solution to get mail flowing, but it doesn't do all of the things email should really do to be robust. @bd808 proposed 3 different solutions:

I think changing the sender address to something like "no-reply.wikipedialibrary@wmcloud.org" would make Google's SPF check happy.

didn't work

Alternately an address matching some subdomain delegated to the twl Cloud VPS project could be used and an appropriate SPF record added for that subdomain via Horizon.

This is what I'm doing for now, plus dkim signing

A valid sender address would be a better idea so that if there are other issues in the future the would get reported to mailbox somewhere. The logs on mx-out03 are also full of retry failure messages for attempts to tell noreply@wikipedialibrary.wmflabs.org that there were problems delivering messages. Unfortunately I think using a valid email would actually require the twl project to run its own email service instead of using mx-out03

This would be the full email solution, since it would give us a return path for email. It's what we should do if email is going to stay our main channel of sending notifications. We should change the way we're notifying people instead of doing this.

It occurs to me that we should test this with personal email accounts too.

I successfully received a signed email sent to a gmail account. note that the contact form option to send the submitter a copy is broken, but that's an app bug, not an email infrastructure problem. I'll find or file a ticket for it.

I do think I can clean up this spf record a bit, which I should do before we're using this in production

twl.wmflabs.org.			60	IN	TXT	"v=spf1 a:mx-out03.wmcloud.org a:nat.cloudgw.eqiad1.wikimediacloud.org ip4:185.15.56.0/22 ~all"

Unfortunately, it still leaves us blind to delivery problems.

@bd808
I'd like to setup DMARC reports, which would catch some of these. Because of the way we're delegating email sending, we'd need to setup external destination verification. We have an address at lists.wikimedia.org, which is where an edv record would need to go if we wanted to receive reports there. Is that I thing we can request? Because that's on the wikimedia.org domain, I'm not sure who to go to about it. I think it would make a lot of sense for wikimedia.org to have an edv record in place for wmflabs.org. I think it would be something like

*.wmflabs.org._report._dmarc.wikimedia.org.			60	IN	TXT	"v=DMARC1"

or maybe

*.wmflabs.org._report._dmarc.lists.wikimedia.org.		60	IN	TXT	"v=DMARC1"

would do the trick

Okay, I cleaned up our spf record and verified that it's happy:

twl.wmflabs.org.	60	IN	TXT	"v=spf1 a:mx-out03.wmcloud.org a:mx-out04.wmcloud.org ~all"

@Scardenasmolinar the patch is ready for review, dmarc reports can come later

PR has been merged. Sticking in QA for further tests

I was able to verify delivery to a wikimedia address. The headers all look good, though we'll need to wait to hear about other domains from users.

Unfortunately, it still leaves us blind to delivery problems.

@bd808
I'd like to setup DMARC reports, which would catch some of these. ... Is that I thing we can request?

I'd say broadly, yes it can be requested. I think you will need help from an SRE and ideally one who understands something about DMARC, SPF, etc. Step 1 is probably splitting out a specific child task about the goal. @herron has helped with some similar things in the past under the T249237: Fix Cloud VPS and Toolforge mail servers to work with the modern internet umbrella, so maybe he can at least suggest someone to bother?

jsn.sherman changed the task status from Stalled to In Progress.Dec 6 2023, 6:46 PM

[...] note that the contact form option to send the submitter a copy is broken, but that's an app bug, not an email infrastructure problem. I'll find or file a ticket for it. [...]

Actually, this seems to be working just fine now.

[...] note that the contact form option to send the submitter a copy is broken, but that's an app bug, not an email infrastructure problem. I'll find or file a ticket for it. [...]

Actually, this seems to be working just fine now.

Rather, it worked for me on staging, when I directly set my email address as the recipient instead of wikipedialibrary@wikimedia.org. It does not seem to be working in production. @Samwalton9-WMF what can you tell me about that wikipedialibrary@wikimedia.org email address: is it an email account, a google group, or something else?

That is a MX for wikimedia.org rejecting an inbound message to wikipedialibrary@wikimedia.org because the envelope From of no-reply.wikipedialibrary@wmcloud.org was not acknowledged as a deliverable address by ... some host. We do not publish any MX records for the wmcloud.org domain, so I'm not completely sure what host mx{1001,2001} would have connected to when attempting to verify the envelope From address. I guess it would have ended up finding 185.15.56.49, which is the external interface of the wmcloud.org & wmflabs.org HTTP reverse proxy service. That really should be the same host that found when attempting a deliverability check for "noreply@wikipedialibrary.wmflabs.org" as well. I feel like I'm getting to a "wait, how does any of this work" moment.

This turned out to be a misconfiguration on the wikimedia.org mail servers which I've now fixed, so it's now possible to send mail from wmcloud.org to wikimedia.org addresses.

So, I've verifed that wikipedialibrary@wikimedia.org is a google group, and that we basically can't cc users when the group is the recipient.
I could hotfix out the cc option on the form, or I could do a slightly PR. We talked about replacing the form with just a link to our email info, @Samwalton9-WMF what if I just made the contact info card the main content for this page? That will be a pretty straightforward PR, and we could skip an intermediate step.

image.png (346×498 px, 20 KB)

As an outcome of this work we have decided to move away from emails as our method of sharing information and notifying users about library events. Further details in T353476.

Closing out since we've had consistent message delivery for a couple of weeks now. I've increased the TTL to 1 hour for our email-related DNS records since we're no longer rapidly iterating. I also cleaned out some now-unused records that were created along the way. I also documented the email signing on our server setup page:
https://github.com/WikipediaLibrary/TWLight/wiki/Debian-Server-setup
Feel free to reopen if we start having widespread email delivery issues again.