Page MenuHomePhabricator

vrts - spamassassin icinga alerts
Closed, ResolvedPublic

Description

Every once in a while we get these:

<icinga-wm> PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: spamassassin_updates.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

This is only visible since recently the spamassassin update was moved from cron to systemd timer.

That's why it now causes the alerts. Does not mean it wasn't happening before with the cron.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
[otrs1001:~] $ sudo systemctl status spamassassin_updates
● spamassassin_updates.service - Spamassassin definitions update
   Loaded: loaded (/lib/systemd/system/spamassassin_updates.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2022-09-01 09:17:34 UTC; 10h ago
 Main PID: 15567 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

@Arnoldokoth If you feel like taking a look.. maybe we can run the updates manually a couple times and catch some logs to figure out what is failing here every once in a while.

Dzahn triaged this task as Low priority.Sep 1 2022, 7:58 PM

Mentioned in SAL (#wikimedia-operations) [2022-09-01T19:58:53Z] <mutante> otrs1001 - sudo systemctl reset-failed - T316903

Sep 01 20:19:06 otrs1001 systemd[1]: spamassassin_updates.service: Succeeded

This is all I got from the logs after manually starting it.

ACK. It seems to be transitory, just happens every once in a while. Let's keep an eye open for the Icinga alert (maybe make it send email?) to catch it next time before logs are rotated.

It seems weird if the script didn't also sometimes fail under cron. Perhaps we just didn't notice.
But yes, maybe we just need to modify the script a bit and have et send an email, and then notice alert as a failed service. It is just SpamAssassin definition updates, it's not super critical if they are a day behind.

Change 829108 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] C:spamassassin Allow debugging of why service fails.

https://gerrit.wikimedia.org/r/829108

It seems weird if the script didn't also sometimes fail under cron. Perhaps we just didn't notice.

cron eats away most errors silently, moving to properly failing system timers unveiled plenty of other things we eventually fixed for good.

Change 829108 merged by Slyngshede:

[operations/puppet@production] C:spamassassin Allow debugging of why service fails.

https://gerrit.wikimedia.org/r/829108

Dzahn claimed this task.

We think this should be fixed and will reopen if it happens again.