Page MenuHomePhabricator

large MX queues should page
Closed, ResolvedPublic

Description

As a reaction to incident 2021-12-03 mx (T297017 T297127)
@faidon said: "large MX queues (threshold to be determined) should page"

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

deep link to existing Icinga check:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mx2001&service=exim+queue

As you can see there the current threshold for alerting is 2000 per "OK: Less than 2000 mails in exim queue.".

But here alerting means only IRC output. (Should it mean email is sent? T253733?) and either way it does not currently page (so , no "critical => true" is set regardless what we set as treshold).

We'll also want to think about the failure modes for this alert specifically, e.g. if mail is significantly impacted how will the page go out? T294166 is also of interest on that front

Marostegui triaged this task as Medium priority.Dec 14 2021, 2:18 PM

Change 747128 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] mx: make exim queue alert paging

https://gerrit.wikimedia.org/r/747128

Change 747128 merged by Herron:

[operations/puppet@production] mx: make exim queue alert paging

https://gerrit.wikimedia.org/r/747128

herron claimed this task.

I know the task description says "threshold to be determined" but calling more attention to the current check would have helped in the related incident case. So that check is now paging, and we can continue to tune/adjust/improve the monitoring and thresholds via the related tasks.