Page MenuHomePhabricator

Fix Wikipedia Library survey emails - implement rate limit and set correct headers
Closed, DeclinedPublic

Description

It seems like prod0.twl.eqiad1.wikimedia.cloud has started sending emails with a way too high rate, causing the queue on our mail server to slow down. They're also sent without the standard Auto-Submitted: auto-generated header and using an envelope-from address causing bounces and such to be sent to the Cloud VPS admins.

For now your mails are being blocked at the mail server level until the problems are fixed at your end.

Event Timeline

Restricted Application added a subscriber: Sadads. · View Herald Transcript

Yes, we tried sending a survey out to our users. Clearly we need some backoff retry logic in addition to adding that header. Is the rate limit documented somewhere?

thanks for posting this task in the log messages by the way; I cancelled out that run and stalled the task (T409420) on this one.

I would ask that you unblock us since I've cancelled out the survey run, which is manually triggered. That way our regular account communications can still go through while we fix the issues ahead of trying the survey again.

It seems like prod0.twl.eqiad1.wikimedia.cloud has started sending emails with a way too high rate, causing the queue on our mail server to slow down.

@taavi what is an appropriate rate to send messages? I poked around the docs, but didn't see a recommended rate.
https://wikitech.wikimedia.org/wiki/Help:Email_in_Cloud_VPS

jsn.sherman changed the task status from Open to In Progress.Dec 11 2025, 6:39 PM
jsn.sherman claimed this task.
jsn.sherman triaged this task as Unbreak Now! priority.
jsn.sherman moved this task from Ready to In Progress on the Moderator-Tools-Team (Kanban) board.

They're also sent [...] using an envelope-from address causing bounces and such to be sent to the Cloud VPS admins.

So far as the envelope-from address is concerned, we switched to root@wmflabs.org a few years ago while working with y'all on T347512: Some emails from The Wikipedia Library aren't being received as expected because the address we were setting was getting flagged as undeliverable by some host in the delivery path. I'd be happy to set it to an address we monitor like wikipedialibrary@wikimedia.org if that would be valid / deliverable. Can you advise on what our options are?

I would ask that you unblock us since I've cancelled out the survey run, which is manually triggered. That way our regular account communications can still go through while we fix the issues ahead of trying the survey again.

Done. Please do not resume the survey mailings without approval from us first.

For rate limits: our current system is primarly intended for transactional mail, where a specific strict rate limit isn't really applicable. Bulk mailings (like your survey) have very different charasteristics in terms of mail deliverability compared to transactional mail. (For example, implementing one-click unsubscriptions with List-Unsubscribe: is much more important for bulk mail.) I can't give you an exact number without having more details on your campaign, but in general best practices include starting slow and ramping up the rate over time.

So far as the envelope-from address is concerned, we switched to root@wmflabs.org a few years ago while working with y'all on T347512: Some emails from The Wikipedia Library aren't being received as expected because the address we were setting was getting flagged as undeliverable by some host in the delivery path. I'd be happy to set it to an address we monitor like wikipedialibrary@wikimedia.org if that would be valid / deliverable. Can you advise on what our options are?

You can set the address you monitor as the Reply-To: header while keeping the From and envelope-from addresses using a Cloud VPS domain. (Although do consider moving to wmcloud.org at some point!)

Samwalton9-WMF renamed this task from prod0.twl.eqiad1.wikimedia.cloud started sending several emails per second to Fix Wikipedia Library survey emails - implement rate limit and set correct headers.Dec 16 2025, 9:54 AM
Samwalton9-WMF lowered the priority of this task from Unbreak Now! to High.

I would ask that you unblock us since I've cancelled out the survey run, which is manually triggered. That way our regular account communications can still go through while we fix the issues ahead of trying the survey again.

Done. Please do not resume the survey mailings without approval from us first.

For rate limits: our current system is primarly intended for transactional mail, where a specific strict rate limit isn't really applicable. Bulk mailings (like your survey) have very different charasteristics in terms of mail deliverability compared to transactional mail. (For example, implementing one-click unsubscriptions with List-Unsubscribe: is much more important for bulk mail.) I can't give you an exact number without having more details on your campaign, but in general best practices include starting slow and ramping up the rate over time.

Thanks Taavi, and apologies for not running this by you first!

For this campaign, we planned to survey all Wikipedia Library users. That's theoretically ~60,000 emails, though we have some constraints as defined in T409420 that I would estimate probably lower that by up to 25%.

If I understand correctly the open tasks here, we need to:

  • Set a List-Unsubscribe header. From my quick reading online it looks like we want to use the mailto: method, since we have no URL for uses to POST to - we're just collecting emails manually and recording them for future mailings.
  • Set a Auto-Submitted: auto-generated header.
  • Set Reply-To: to wikipedialibrary@wikimedia.org
  • Implement a rate limit - we'd appreciate some guidance here, even if it's just a ballpark figure or range to aim for. These emails don't need to be delivered rapidly, but obviously we'd like to get through them as quickly as is permissable.

Thanks for this @Samwalton9-WMF!

I would ask that you unblock us since I've cancelled out the survey run, which is manually triggered. That way our regular account communications can still go through while we fix the issues ahead of trying the survey again.

Done. Please do not resume the survey mailings without approval from us first.

For rate limits: our current system is primarly intended for transactional mail, where a specific strict rate limit isn't really applicable. Bulk mailings (like your survey) have very different charasteristics in terms of mail deliverability compared to transactional mail. (For example, implementing one-click unsubscriptions with List-Unsubscribe: is much more important for bulk mail.) I can't give you an exact number without having more details on your campaign, but in general best practices include starting slow and ramping up the rate over time.

Thanks Taavi, and apologies for not running this by you first!

For this campaign, we planned to survey all Wikipedia Library users. That's theoretically ~60,000 emails, though we have some constraints as defined in T409420 that I would estimate probably lower that by up to 25%.

If I understand correctly the open tasks here, we need to:

  • Set a List-Unsubscribe header. From my quick reading online it looks like we want to use the mailto: method, since we have no URL for uses to POST to - we're just collecting emails manually and recording them for future mailings.

I have a thought here: we should add a new user profile option to opt out of survey emails. Then we could link to the user profile page.

  • Set a Auto-Submitted: auto-generated header.
  • Set Reply-To: to wikipedialibrary@wikimedia.org
  • Implement a rate limit - we'd appreciate some guidance here, even if it's just a ballpark figure or range to aim for. These emails don't need to be delivered rapidly, but obviously we'd like to get through them as quickly as is permissable.

I'll throw out a suggestion here:

  • start with 1 email per second
  • each minute, ramp the rate by 1 email per second, to a maximum of 10 emails per second
  • upon receiving smtp 550, pause for 1 minute and reset rate to 1 email per second before resuming

what do you think @taavi?

I have a thought here: we should add a new user profile option to opt out of survey emails. Then we could link to the user profile page.

I agree in general, but for expediency we're happy collecting emails for now (we received 6 requests from what looks like ~5000 emails sent) and then we could upload these in the future if we add such a field/option.

Another thought: we could be batching these using BCC.

Is there any guidance on the maximum length of a BCC list for our delivery infrastructure?

For a lot of systems, it's ~500ish

Oh yeah, I meant to comment on:

In T412427#11463180, @Samwalton9-WMF wrote:

[...]

For this campaign, we planned to survey all Wikipedia Library users. That's theoretically ~60,000 emails, though we have some constraints as defined in T409420 that I would estimate probably lower that by up to 25%.

When I ran tests against production data, it was a little under ~40k

okay, I'm forking our email middleware and implementing changes there.
https://github.com/WikipediaLibrary/djmail/tree/Jsn.sherman/T412427

When that's good, we'll update our requirements to use it:
https://github.com/WikipediaLibrary/TWLight/pull/1498

I agree in general, but for expediency we're happy collecting emails for now (we received 6 requests from what looks like ~5000 emails sent) and then we could upload these in the future if we add such a field/option.

Unfortunately that's not including the couple dozen bounces that we got in via root@ that made me notice this in the first place.

So, the envelope sender/From header is still a bit of an open problem:

  • The envelope sender and From: must be on the same domain for deliverability reputation reasons.
  • That domain in use must also pass SPF+DKIM, which restricts us to wmcloud.org (or wmflabs for legacy stuff, but that's not applicable here).
  • Bounce messages are sent to the envelope sender, not to reply-to (which is used by humans and autoresponders), which means:
  • The envelope sender address must not bounce on its own, which in practice means using root@ or some of the other aliases on that domain routed to some WMCS SREs.

We've been tolerating the occasional bounces sent to root@, but for a mass mailing of this size it would be much better to send those to you to be able to remove the addresses from your listings. It is unlikely that we'll find time to engineer a proper, general solution for that anytime soon, but I'll check if we can figure out some manual hack for this.

I'll throw out a suggestion here:

  • start with 1 email per second
  • each minute, ramp the rate by 1 email per second, to a maximum of 10 emails per second

Sending tens of thousands of messages per hour is way too high for a sender with no history of bulk mailing. Alternative proposal:

  • Start with a mail every 30 seconds
  • Ramp up the rate by 10% every hour, up to a mail every five secodns
  • upon receiving smtp 550, pause for 1 minute and reset rate to 1 email per second before resuming

A 550 is not actually a very useful signal in this case, since all you're seeing with that is whether the emails were added to the queue of our mail relays, not whether they are actually being delivered. I quickly created https://grafana.wmcloud.org/d/291bdc44-a77a-41df-a7f6-a6087c68a286/mail to show the basic queue state for humans at least.

Another thought: we could be batching these using BCC.

This provides no advantage here. Please do not.

Also, how have you validated the addresses users have given to you? From the brief batch of mails that was sent out last week, I can see several total nonsense addresses that could have never worked (think of [something]@en.wikipedia.org or similar) so I'm not sure how those got to your list?

Also, how have you validated the addresses users have given to you? From the brief batch of mails that was sent out last week, I can see several total nonsense addresses that could have never worked (think of [something]@en.wikipedia.org or similar) so I'm not sure how those got to your list?

these are shared with us via metawiki oauth and sync on login by default. Users may disable the sync and change the address, and we just do a basic format validation in that case. We don't do any kind of deliverability checks.

I agree in general, but for expediency we're happy collecting emails for now (we received 6 requests from what looks like ~5000 emails sent) and then we could upload these in the future if we add such a field/option.

Unfortunately that's not including the couple dozen bounces that we got in via root@ that made me notice this in the first place.

So, the envelope sender/From header is still a bit of an open problem:

  • The envelope sender and From: must be on the same domain for deliverability reputation reasons.
  • That domain in use must also pass SPF+DKIM, which restricts us to wmcloud.org (or wmflabs for legacy stuff, but that's not applicable here).
  • Bounce messages are sent to the envelope sender, not to reply-to (which is used by humans and autoresponders), which means:
  • The envelope sender address must not bounce on its own, which in practice means using root@ or some of the other aliases on that domain routed to some WMCS SREs.

We've been tolerating the occasional bounces sent to root@, but for a mass mailing of this size it would be much better to send those to you to be able to remove the addresses from your listings. It is unlikely that we'll find time to engineer a proper, general solution for that anytime soon, but I'll check if we can figure out some manual hack for this.

This is good to know, I thought I had misunderstood how the reply-to was treated and that there was some kind of behind-the-scenes magic I was unaware of previously.

I'll throw out a suggestion here:

  • start with 1 email per second
  • each minute, ramp the rate by 1 email per second, to a maximum of 10 emails per second

Sending tens of thousands of messages per hour is way too high for a sender with no history of bulk mailing. Alternative proposal:

  • Start with a mail every 30 seconds
  • Ramp up the rate by 10% every hour, up to a mail every five secodns

Helpful! I didn't understand what rough order of magnitude to shoot for.

  • upon receiving smtp 550, pause for 1 minute and reset rate to 1 email per second before resuming

A 550 is not actually a very useful signal in this case, since all you're seeing with that is whether the emails were added to the queue of our mail relays, not whether they are actually being delivered. I quickly created https://grafana.wmcloud.org/d/291bdc44-a77a-41df-a7f6-a6087c68a286/mail to show the basic queue state for humans at least.

oh yeah, I can see how little context we're getting.

Another thought: we could be batching these using BCC.

This provides no advantage here. Please do not.

Noted

update: it's pretty clear that this bulk mailing is off-label for the cloud vps infrastructure. For this survey, we're going to try using the mediawiki email user api instead. It's something we have previously explored, so we have some PoC code on hand already.

Thanks, that seems like a good approach if you want to get the mails out quickly/before the holidays break. I think we would be happy to talk about a long-term solution afterwards but, again, that will take some amount of effort from our side so will take some time to make happen.