Set up mail rate limiting for tools-mail
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	herron
	Sep 14 2017, 9:01 PM

Description

Figure out average outgoing mail rates, then set up rate limiting in exim to enforce them.

https://www.exim.org/exim-html-current/doc/html/spec_html/ch-access_control_lists.html#SECTratelimiting

Details

Subject	Repo	Branch	Lines +/-
toolforge: mailrelay: collect exim metrics using prometheus	operations/puppet	production	+17 -1
toolforge: mailrelay: enforce ratelimiting	operations/puppet	production	+7 -9
Add rate limiting to profile::toolforge::mailrelay with warn action	operations/puppet	production	+82 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	• taavi	T249237 Fix Cloud VPS and Toolforge mail servers to work with the modern internet
Resolved	aborrero	T175964 Set up mail rate limiting for tools-mail

Event Timeline

herron created this task.Sep 14 2017, 9:01 PM

herron created this object with visibility "Custom Policy".

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 14 2017, 9:01 PM

Using ratelimit.pl against the past 10 days worth of exim logs these appear to be the top 20 senders by rate. With addresses removed it's more or less the high water mark by hour and day.

cat topsenderavg_hourly
17 : redacted@wikimedia.org
19 : redacted@tools.wmflabs.org
21 : redacted@tools.wmflabs.org
22 : redacted@tools.wmflabs.org
22 : redacted@tools.wmflabs.org
23 : redacted@tools.wmflabs.org
25 : redacted@tools.wmflabs.org
26 : redacted@tools.wmflabs.org
46 : redacted@tools.wmflabs.org
48 : redacted@tools.wmflabs.org
48 : redacted@tools.wmflabs.org
57 : redacted@tools.wmflabs.org
75 : redacted@tools.wmflabs.org
147 : redacted@tools.wmflabs.org
226 : redacted@tools.wmflabs.org
232 : redacted@tools.wmflabs.org
294 : redacted@tools.wmflabs.org
45772 : <>
54646 : redacted@tools.wmflabs.org

cat topsenderavg_daily
146 : redacted@tools.wmflabs.org
147 : redacted@tools.wmflabs.org
148 : redacted@tools.wmflabs.org
148 : redacted@tools.wmflabs.org
148 : redacted@tools.wmflabs.org
148 : redacted@tools.wmflabs.org
149 : redacted@tools.wmflabs.org
186 : redacted@tools.wmflabs.org
187 : redacted@tools.wmflabs.org
188 : redacted@tools.wmflabs.org
189 : redacted@tools.wmflabs.org
189 : redacted@tools.wmflabs.org
189 : redacted@tools.wmflabs.org
190 : redacted@tools.wmflabs.org
228 : redacted@tools.wmflabs.org
244 : redacted@tools.wmflabs.org
300 : redacted@tools.wmflabs.org
61079 : <>
147773 : redacted@tools.wmflabs.org

The bottom two in each case relate to the recent spam event. Many of the empty envelopes are from bounces but interestingly some appear to have originated from tools-exec hosts as well.

How about rate limiting at 100 messages per sender address per hour for starters?

In T175964#3611030, @herron wrote:

How about rate limiting at 100 messages per sender address per hour for starters?

That seems on the surface to be pretty reasonable. A couple of questions:

When the limit is hit will outbound message deliver be throttled or will the submission of new messages be blocked?
Would we have a way to give a particular sender a higher limit if good cause could be shown?

In T175964#3611065, @bd808 wrote:

When the limit is hit will outbound message deliver be throttled or will the submission of new messages be blocked?

Since it's an acl attribute there is lots of flexibility for what happens when reached. To me throttling makes sense. When a sender exceeds the rate limit we could issue a 4xx temporary error until the time period refreshes. The sending MTA would queue and attempt delivery again later. With queue monitoring in place alerts would then fire if a client mail queue exceeds the "normal" threshold. This should detect and help mitigate future issues similar to T175837

OTOH a downside to temp fail is that large quantities of mail would be slowed down but not blocked. Flying under the radar you could get ~2400 messages out per day. We could layer on some additional longer-term limits with more severe actions but I think I'd want to live with temp fails first before actually dropping mail.

In T175964#3611065, @bd808 wrote:

Would we have a way to give a particular sender a higher limit if good cause could be shown?

Sure, a simple way would be to maintain a list of senders who can skip the rate limit acl. Or we could get more complicated with a lookup config for individualized custom rate limits if necessary.

I like the answers and the options for future changes. +1

greg added a project: Wikimedia-Incident.Sep 19 2017, 6:49 PM

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

https://gerrit.wikimedia.org/r/#/c/379239/ created for hourly rate limit implementation. Went ahead and set it up with a file lookup for overrides and the default limit as well. Also, included a per-host rate limit and took a stab at a default there. The ACLs action is warn initially so this can run for a period of time to validate our assumptions before deferring mail.

• JHedden triaged this task as Medium priority.Apr 21 2020, 4:47 PM

• JHedden added a project: cloud-services-team (Kanban).

• JHedden added a parent task: T249237: Fix Cloud VPS and Toolforge mail servers to work with the modern internet.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

• JHedden assigned this task to aborrero.May 5 2020, 4:37 PM

• JHedden updated the task description. (Show Details)

I plan to follow up on this with @herron soon.

We have the rate-limiting in place. There is nothing else to do? Closing task now, feel free to reopen if required.

We have rate-limiting, but all it does is a warn action. It doesn't enforce the limit. Looking at the config, that's still in place. I believe we need to add a drop or bounce action or something.

In T175964#6248688, @Bstorm wrote:

We have rate-limiting, but all it does is a warn action. It doesn't enforce the limit. Looking at the config, that's still in place. I believe we need to add a drop or bounce action or something.

Oh, you are right.

This is why I was tricked. I was generating emails using telnet and saw the log entry in the server:

2020-06-23 11:41:58 H=(endurance) [213.194.138.68] Warning: Sender address arturo@debian.org has exceeded rate limit of  messages per 1h

And then I never got the email. Except that I actually got it, but never saw it because landed in the spam folder.

In T175964#6248688, @Bstorm wrote:

We have rate-limiting, but all it does is a warn action. It doesn't enforce the limit. Looking at the config, that's still in place. I believe we need to add a drop or bounce action or something.

To enforce it in a user-friendly-ish way I'd suggest changing from warn to defer. That would result in rate limited mails queueing up on the sending host with an informational 4xx error, and allow the sender to address the issue in their outgoing queue and/or try again later.

I'm curious why the log entry I pasted above contains an empty string in $sender_rate_limit. According to the docs http://www.exim.org/exim-html-current/doc/html/spec_html/ch-access_control_lists.html the value should be populated by exim.

In T175964#6249164, @aborrero wrote:

I'm curious why the log entry I pasted above contains an empty string in $sender_rate_limit. According to the docs http://www.exim.org/exim-html-current/doc/html/spec_html/ch-access_control_lists.html the value should be populated by exim.

That's weird. Maybe it's the ordering of when it looks up the file vs. the log message?

This was just enabled in tools and toolsbeta.
In case of errors, please revert the patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/607320

I plan to work soonon proper grafana dashboards to monitor exim behavior beyond what we have now, which seems insufficient.

Closing task now, please feel free to reopen if required.

bd808 changed the visibility from "Custom Policy" to "Public (No Login Required)".Jun 24 2020, 4:18 PM

Set up mail rate limiting for tools-mail Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Set up mail rate limiting for tools-mail
Closed, ResolvedPublic
Actions

Related Objects
Search...