Page MenuHomePhabricator

prometheus: some sort of IRC alerts on restarts?
Closed, ResolvedPublic

Description

(part of https://wikitech.wikimedia.org/wiki/Incident_documentation/20190425-prometheus)

It's known that prometheus restarts can cause strange monitoring artifacts. A one-time Icinga notification on IRC might be nice. Maybe we just look for process uptime < 10 minutes?

Event Timeline

Checking process uptime sounds good to me, if I understood correctly (the one-time icinga notifcation) the alert would self-recover once uptime is no longer in breach, is that correct?

I can think of at least two ways for implementation, namely by checking activation time of systemd unit(s) or checking process_start_time_seconds prometheus metric from the prometheus server itself.

Dzahn triaged this task as Medium priority.Apr 30 2019, 9:38 PM

Change 508011 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] prometheus: one-shot alert on restarts

https://gerrit.wikimedia.org/r/508011

Change 508011 merged by CDanis:
[operations/puppet@production] prometheus: one-shot alert on restarts

https://gerrit.wikimedia.org/r/508011

Change 508356 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] prometheus uptime alert: fix query

https://gerrit.wikimedia.org/r/508356

Change 508356 merged by CDanis:
[operations/puppet@production] prometheus uptime alert: fix query

https://gerrit.wikimedia.org/r/508356

CDanis claimed this task.

We now have IRC alerting based on scraping each prometheus for its process_start_time_seconds metric.