People seem to be down with trying out the icinga list so now we need to gather some metrics about frtechmail
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Jgreen | T91508 [Epic] overhaul fundraising cluster monitoring | |||
Declined | None | T202419 Page fr-tech about spikes in frtechmail | |||
Declined | None | T207511 Sort out fr-tech work phone situation | |||
Resolved | Jgreen | T212252 Reconfigure fundraising check_endpoints |
Event Timeline
Comment Actions
I think all the failmailing hosts are now reporting: https://grafana.wikimedia.org/dashboard/db/fundraising-overview?orgId=1&panelId=23&fullscreen&refresh=1m&from=now-1h&to=now
Comment Actions
I'm not sure what the others are, but we should avoid using root@ where possible, in favor of a fundraising-specific address to prevent bouncing messages outside of our team.
Comment Actions
Reopening, now that we have our own prometheus/grafana instance, would it make sense to alert from it?
Comment Actions
Thinking about this, I wonder if we would be better served by alerting on log spikes. We already have monitoring tools that will count log lines matching regexes and alert accordingly.