Page MenuHomePhabricator

Icinga check for ircecho should check for actual activity
Closed, ResolvedPublic

Description

A restart of systemd-journald on irc.wikimedia.org caused ircecho.service to sound out further updates (https://phabricator.wikimedia.org/T216607#4968319).
While it was broken the Icinga status for kraz was all fine, the "ircecho bot process" only checks whether the udpmixecho.py process is present.

I'm not 100% sure how to best improve the check to make it detect such an error, if there are logs it could be checked whether there was any update within the last 10 seconds or so. Or given that there are channels which are very busy, the check could also join a specific channel and warn if there was no actitivity over 5-10 seconds or so.

Event Timeline

MoritzMuehlenhoff renamed this task from Icnga check for ircecho should check for actual activity to Icinga check for ircecho should check for actual activity.Feb 20 2019, 12:33 PM
MoritzMuehlenhoff triaged this task as Medium priority.

Change 662124 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] mw_rc_irc: add metrics endpoint to udpmxircecho

https://gerrit.wikimedia.org/r/662124

Change 662125 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: add prometheus job for udpmxircecho

https://gerrit.wikimedia.org/r/662125

Change 662124 merged by Cwhite:
[operations/puppet@production] mw_rc_irc: add metrics endpoint to udpmxircecho

https://gerrit.wikimedia.org/r/662124

Mentioned in SAL (#wikimedia-operations) [2021-02-18T16:23:35Z] <shdubsh> restart ircecho on kraz -- deploying new metrics endpoint T216611

Change 662125 merged by Cwhite:
[operations/puppet@production] profile: add prometheus job for udpmxircecho

https://gerrit.wikimedia.org/r/662125

Change 665129 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] mw_rc_irc: add check_prometheus alert on no messages being relayed

https://gerrit.wikimedia.org/r/665129

Change 665129 merged by Cwhite:
[operations/puppet@production] mw_rc_irc: add check_prometheus alert on no messages being relayed

https://gerrit.wikimedia.org/r/665129

colewhite claimed this task.
colewhite added a subscriber: colewhite.

Updated monitoring has been deployed.

This has been flapping in Icinga, e.g. for today:

[11:13] <icinga-wm> PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki
[11:22] <icinga-wm> RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:24] <icinga-wm> PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:25] <icinga-wm> RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets

The irc service itself was working just fine (had a connection #de-wikipedia open which was streaming edits just fine)

Digging a bit into this found that there is some breakdown in communication between prometheus-ircd-exporter and ircd on the new irc2001 host: T224579

ircecho does not appear affected.