Page MenuHomePhabricator

Check for an oversized exim4 queue indicating mail delivery failures
Closed, ResolvedPublic

Description

Queueing over 9000 emails without any warning is just embarassing. Let's set up at least:

  • a check that checks for an oversized exim4 queue ("exipick -i |wc -l" or something)
  • a check for a large amount of mails queued for more than N hours.

Event Timeline

We have this for the lists server.

files/icinga/check_mailman_queue

modules/role/manifests/lists/server.pp: nrpe_command => '/usr/bin/sudo -u list /usr/local/lib/nagios/plugins/check_mailman_queue 25 25 25',

Change 361023 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: add plugin to check exim queue sizes

https://gerrit.wikimedia.org/r/361023

^ with script above:

[mx1001:~] $ ./check_exim_queue -w 1000 -c 5000
OK: Less than 1000 mails in exim queue.
[mx1001:~] $ ./check_exim_queue -w 300 -c 5000
WARNING: 329 mails in exim queue.
[mx1001:~] $ ./check_exim_queue -w 100 -c 200
CRITICAL: 330 mails in exim queue.
[mx1001:~] $ ./check_exim_queue -w 100 
Usage: ./check_exim_queue -w <warn> -c <crit>

While are at it I'd suggest removing the disk i/o check which hasn't yield good result

Change 361023 merged by Dzahn:
[operations/puppet@production] icinga/role:mail::mx: add monitoring of exim queue size

https://gerrit.wikimedia.org/r/361023

[mx1001:~] $ sudo -u nagios /usr/local/lib/nagios/plugins/check_exim_queue -w 1000 -c 3000
OK: Less than 1000 mails in exim queue.
# This file is managed by Puppet!

nagios ALL = NOPASSWD: /usr/sbin/exipick -bpc -o [[\:digit\:]][[\:digit\:]][mh]

Added and working here:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=exim+queue

What do you think about the numbers "1000" and "3000" specifically?

Change 368110 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] lists/icinga: remove I/O monitoring on lists server

https://gerrit.wikimedia.org/r/368110

Change 368110 merged by Dzahn:
[operations/puppet@production] lists/icinga: remove I/O monitoring on lists server

https://gerrit.wikimedia.org/r/368110