Page MenuHomePhabricator

Set up mail rate limiting for tools-mail
Closed, ResolvedPublic

Description

Figure out average outgoing mail rates, then set up rate limiting in exim to enforce them.

https://www.exim.org/exim-html-current/doc/html/spec_html/ch-access_control_lists.html#SECTratelimiting

Event Timeline

herron created this object with visibility "Custom Policy".

Using ratelimit.pl against the past 10 days worth of exim logs these appear to be the top 20 senders by rate. With addresses removed it's more or less the high water mark by hour and day.

cat topsenderavg_hourly
17 : redacted@wikimedia.org
19 : redacted@tools.wmflabs.org
21 : redacted@tools.wmflabs.org
22 : redacted@tools.wmflabs.org
22 : redacted@tools.wmflabs.org
23 : redacted@tools.wmflabs.org
25 : redacted@tools.wmflabs.org
26 : redacted@tools.wmflabs.org
46 : redacted@tools.wmflabs.org
48 : redacted@tools.wmflabs.org
48 : redacted@tools.wmflabs.org
57 : redacted@tools.wmflabs.org
75 : redacted@tools.wmflabs.org
147 : redacted@tools.wmflabs.org
226 : redacted@tools.wmflabs.org
232 : redacted@tools.wmflabs.org
294 : redacted@tools.wmflabs.org
45772 : <>
54646 : redacted@tools.wmflabs.org
cat topsenderavg_daily
146 : redacted@tools.wmflabs.org
147 : redacted@tools.wmflabs.org
148 : redacted@tools.wmflabs.org
148 : redacted@tools.wmflabs.org
148 : redacted@tools.wmflabs.org
148 : redacted@tools.wmflabs.org
149 : redacted@tools.wmflabs.org
186 : redacted@tools.wmflabs.org
187 : redacted@tools.wmflabs.org
188 : redacted@tools.wmflabs.org
189 : redacted@tools.wmflabs.org
189 : redacted@tools.wmflabs.org
189 : redacted@tools.wmflabs.org
190 : redacted@tools.wmflabs.org
228 : redacted@tools.wmflabs.org
244 : redacted@tools.wmflabs.org
300 : redacted@tools.wmflabs.org
61079 : <>
147773 : redacted@tools.wmflabs.org

The bottom two in each case relate to the recent spam event. Many of the empty envelopes are from bounces but interestingly some appear to have originated from tools-exec hosts as well.

How about rate limiting at 100 messages per sender address per hour for starters?

How about rate limiting at 100 messages per sender address per hour for starters?

That seems on the surface to be pretty reasonable. A couple of questions:

  • When the limit is hit will outbound message deliver be throttled or will the submission of new messages be blocked?
  • Would we have a way to give a particular sender a higher limit if good cause could be shown?

In T175964#3611065, @bd808 wrote:

  • When the limit is hit will outbound message deliver be throttled or will the submission of new messages be blocked?

Since it's an acl attribute there is lots of flexibility for what happens when reached. To me throttling makes sense. When a sender exceeds the rate limit we could issue a 4xx temporary error until the time period refreshes. The sending MTA would queue and attempt delivery again later. With queue monitoring in place alerts would then fire if a client mail queue exceeds the "normal" threshold. This should detect and help mitigate future issues similar to T175837

OTOH a downside to temp fail is that large quantities of mail would be slowed down but not blocked. Flying under the radar you could get ~2400 messages out per day. We could layer on some additional longer-term limits with more severe actions but I think I'd want to live with temp fails first before actually dropping mail.

In T175964#3611065, @bd808 wrote:

  • Would we have a way to give a particular sender a higher limit if good cause could be shown?

Sure, a simple way would be to maintain a list of senders who can skip the rate limit acl. Or we could get more complicated with a lookup config for individualized custom rate limits if necessary.

I like the answers and the options for future changes. +1

https://gerrit.wikimedia.org/r/#/c/379239/ created for hourly rate limit implementation. Went ahead and set it up with a file lookup for overrides and the default limit as well. Also, included a per-host rate limit and took a stab at a default there. The ACLs action is warn initially so this can run for a period of time to validate our assumptions before deferring mail.

I plan to follow up on this with @herron soon.

We have the rate-limiting in place. There is nothing else to do? Closing task now, feel free to reopen if required.

Bstorm added a subscriber: Bstorm.

We have rate-limiting, but all it does is a warn action. It doesn't enforce the limit. Looking at the config, that's still in place. I believe we need to add a drop or bounce action or something.

We have rate-limiting, but all it does is a warn action. It doesn't enforce the limit. Looking at the config, that's still in place. I believe we need to add a drop or bounce action or something.

Oh, you are right.

This is why I was tricked. I was generating emails using telnet and saw the log entry in the server:

2020-06-23 11:41:58 H=(endurance) [213.194.138.68] Warning: Sender address arturo@debian.org has exceeded rate limit of  messages per 1h

And then I never got the email. Except that I actually got it, but never saw it because landed in the spam folder.

We have rate-limiting, but all it does is a warn action. It doesn't enforce the limit. Looking at the config, that's still in place. I believe we need to add a drop or bounce action or something.

To enforce it in a user-friendly-ish way I'd suggest changing from warn to defer. That would result in rate limited mails queueing up on the sending host with an informational 4xx error, and allow the sender to address the issue in their outgoing queue and/or try again later.

I'm curious why the log entry I pasted above contains an empty string in $sender_rate_limit. According to the docs http://www.exim.org/exim-html-current/doc/html/spec_html/ch-access_control_lists.html the value should be populated by exim.

I'm curious why the log entry I pasted above contains an empty string in $sender_rate_limit. According to the docs http://www.exim.org/exim-html-current/doc/html/spec_html/ch-access_control_lists.html the value should be populated by exim.

That's weird. Maybe it's the ordering of when it looks up the file vs. the log message?

aborrero moved this task from Soon! to Doing on the cloud-services-team (Kanban) board.

This was just enabled in tools and toolsbeta.
In case of errors, please revert the patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/607320

I plan to work soonon proper grafana dashboards to monitor exim behavior beyond what we have now, which seems insufficient.

Closing task now, please feel free to reopen if required.

bd808 changed the visibility from "Custom Policy" to "Public (No Login Required)".Jun 24 2020, 4:18 PM