As a reaction to incident 2021-12-03 mx (T297017 T297127)
@faidon said: "large MX queues (threshold to be determined) should page"
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
mx: make exim queue alert paging | operations/puppet | production | +1 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Dzahn | T297017 MX record issue on mx2001.wikimedia.org | |||
Resolved | herron | T297127 Incident: 2021-12-03 mx2001->Gmail delivery issues | |||
Resolved | herron | T297144 large MX queues should page | |||
Open | None | T275867 Add exim queue size to grafana graph | |||
Open | None | T294166 Alert that should have paged via VictorOps was delayed because of partial networking outage |
Event Timeline
deep link to existing Icinga check:
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mx2001&service=exim+queue
As you can see there the current threshold for alerting is 2000 per "OK: Less than 2000 mails in exim queue.".
But here alerting means only IRC output. (Should it mean email is sent? T253733?) and either way it does not currently page (so , no "critical => true" is set regardless what we set as treshold).
We'll also want to think about the failure modes for this alert specifically, e.g. if mail is significantly impacted how will the page go out? T294166 is also of interest on that front
Change 747128 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] mx: make exim queue alert paging
Change 747128 merged by Herron:
[operations/puppet@production] mx: make exim queue alert paging
I know the task description says "threshold to be determined" but calling more attention to the current check would have helped in the related incident case. So that check is now paging, and we can continue to tune/adjust/improve the monitoring and thresholds via the related tasks.