Page MenuHomePhabricator

Fix paniclog alert to only sent mails once
Open, MediumPublic

Description

Currently if there's a paniclog mail for any of our mail servers, the paniclog gets mailed to root@. Virtually all of those are harmless and the effect of day-to-day operation, e.g.

2020-07-02 23:34:59 1jr8jB-0000za-CO spam acl condition: all spamd servers failed

which apparently was caused by a spamassassin restart.

The alert however is re-sent daily until someone manually removes the paniclog file. We should add some systemd timer that removes paniclog files which haven't changed since > 24 hrs to reduce noise

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 3 2020, 6:34 AM
herron added a subscriber: herron.Jul 4 2020, 7:39 PM

One idea to avoid causing alerts during spamassassin restarts is have the two MXes act as failback spamd servers for each other. This way when the local spamd goes down for restart exim will try the spamd of the other MX before logging 'all spamd servers failed'

fgiunchedi moved this task from Inbox to Backlog on the observability board.Jul 6 2020, 11:26 AM
Dzahn added a project: Mail.Jul 8 2020, 6:15 PM

Change 616524 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] mx: add paniclog to exim logrotate

https://gerrit.wikimedia.org/r/616524

herron triaged this task as Medium priority.Mon, Jul 27, 2:05 PM

Change 616524 merged by Herron:
[operations/puppet@production] mx: add paniclog to exim logrotate

https://gerrit.wikimedia.org/r/616524

@herron Seems to work fine, didn't see a paniclog mail today \o/

@herron Seems to work fine, didn't see a paniclog mail today \o/

Actually, the paniclog mail for kubernetes2002 was sent again.

Change 617529 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] exim4: move daily paniclog rotate from exim4-base to exim4-paniclog

https://gerrit.wikimedia.org/r/617529

@herron Seems to work fine, didn't see a paniclog mail today \o/

Actually, the paniclog mail for kubernetes2002 was sent again.

Ah, yes, only was rotating paniclog on the MXes. Other hosts should be less noisy overall, but might as well rotate the paniclog daily across the fleet.

Change 617529 merged by Herron:
[operations/puppet@production] exim4: move daily paniclog rotate from exim4-base to exim4-paniclog

https://gerrit.wikimedia.org/r/617529