Figure out average outgoing mail rates, then set up rate limiting in exim to enforce them.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Restricted Task | |||||
Open | None | T249237 Fix Cloud VPS and Toolforge mail servers to work with the modern internet | |||
Resolved | aborrero | T175964 Set up mail rate limiting for tools-mail |
Event Timeline
Using ratelimit.pl against the past 10 days worth of exim logs these appear to be the top 20 senders by rate. With addresses removed it's more or less the high water mark by hour and day.
cat topsenderavg_hourly 17 : redacted@wikimedia.org 19 : redacted@tools.wmflabs.org 21 : redacted@tools.wmflabs.org 22 : redacted@tools.wmflabs.org 22 : redacted@tools.wmflabs.org 23 : redacted@tools.wmflabs.org 25 : redacted@tools.wmflabs.org 26 : redacted@tools.wmflabs.org 46 : redacted@tools.wmflabs.org 48 : redacted@tools.wmflabs.org 48 : redacted@tools.wmflabs.org 57 : redacted@tools.wmflabs.org 75 : redacted@tools.wmflabs.org 147 : redacted@tools.wmflabs.org 226 : redacted@tools.wmflabs.org 232 : redacted@tools.wmflabs.org 294 : redacted@tools.wmflabs.org 45772 : <> 54646 : redacted@tools.wmflabs.org
cat topsenderavg_daily 146 : redacted@tools.wmflabs.org 147 : redacted@tools.wmflabs.org 148 : redacted@tools.wmflabs.org 148 : redacted@tools.wmflabs.org 148 : redacted@tools.wmflabs.org 148 : redacted@tools.wmflabs.org 149 : redacted@tools.wmflabs.org 186 : redacted@tools.wmflabs.org 187 : redacted@tools.wmflabs.org 188 : redacted@tools.wmflabs.org 189 : redacted@tools.wmflabs.org 189 : redacted@tools.wmflabs.org 189 : redacted@tools.wmflabs.org 190 : redacted@tools.wmflabs.org 228 : redacted@tools.wmflabs.org 244 : redacted@tools.wmflabs.org 300 : redacted@tools.wmflabs.org 61079 : <> 147773 : redacted@tools.wmflabs.org
The bottom two in each case relate to the recent spam event. Many of the empty envelopes are from bounces but interestingly some appear to have originated from tools-exec hosts as well.
How about rate limiting at 100 messages per sender address per hour for starters?
That seems on the surface to be pretty reasonable. A couple of questions:
- When the limit is hit will outbound message deliver be throttled or will the submission of new messages be blocked?
- Would we have a way to give a particular sender a higher limit if good cause could be shown?
In T175964#3611065, @bd808 wrote:
- When the limit is hit will outbound message deliver be throttled or will the submission of new messages be blocked?
Since it's an acl attribute there is lots of flexibility for what happens when reached. To me throttling makes sense. When a sender exceeds the rate limit we could issue a 4xx temporary error until the time period refreshes. The sending MTA would queue and attempt delivery again later. With queue monitoring in place alerts would then fire if a client mail queue exceeds the "normal" threshold. This should detect and help mitigate future issues similar to T175837
OTOH a downside to temp fail is that large quantities of mail would be slowed down but not blocked. Flying under the radar you could get ~2400 messages out per day. We could layer on some additional longer-term limits with more severe actions but I think I'd want to live with temp fails first before actually dropping mail.
In T175964#3611065, @bd808 wrote:
- Would we have a way to give a particular sender a higher limit if good cause could be shown?
Sure, a simple way would be to maintain a list of senders who can skip the rate limit acl. Or we could get more complicated with a lookup config for individualized custom rate limits if necessary.
https://gerrit.wikimedia.org/r/#/c/379239/ created for hourly rate limit implementation. Went ahead and set it up with a file lookup for overrides and the default limit as well. Also, included a per-host rate limit and took a stab at a default there. The ACLs action is warn initially so this can run for a period of time to validate our assumptions before deferring mail.
We have the rate-limiting in place. There is nothing else to do? Closing task now, feel free to reopen if required.
We have rate-limiting, but all it does is a warn action. It doesn't enforce the limit. Looking at the config, that's still in place. I believe we need to add a drop or bounce action or something.
Oh, you are right.
This is why I was tricked. I was generating emails using telnet and saw the log entry in the server:
2020-06-23 11:41:58 H=(endurance) [213.194.138.68] Warning: Sender address arturo@debian.org has exceeded rate limit of messages per 1h
And then I never got the email. Except that I actually got it, but never saw it because landed in the spam folder.
To enforce it in a user-friendly-ish way I'd suggest changing from warn to defer. That would result in rate limited mails queueing up on the sending host with an informational 4xx error, and allow the sender to address the issue in their outgoing queue and/or try again later.
I'm curious why the log entry I pasted above contains an empty string in $sender_rate_limit. According to the docs http://www.exim.org/exim-html-current/doc/html/spec_html/ch-access_control_lists.html the value should be populated by exim.
That's weird. Maybe it's the ordering of when it looks up the file vs. the log message?
This was just enabled in tools and toolsbeta.
In case of errors, please revert the patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/607320
I plan to work soonon proper grafana dashboards to monitor exim behavior beyond what we have now, which seems insufficient.
Closing task now, please feel free to reopen if required.