Queueing over 9000 emails without any warning is just embarassing. Let's set up at least:
- a check that checks for an oversized exim4 queue ("exipick -i |wc -l" or something)
- a check for a large amount of mails queued for more than N hours.
Queueing over 9000 emails without any warning is just embarassing. Let's set up at least:
We have this for the lists server.
files/icinga/check_mailman_queue
modules/role/manifests/lists/server.pp: nrpe_command => '/usr/bin/sudo -u list /usr/local/lib/nagios/plugins/check_mailman_queue 25 25 25',
Change 361023 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: add plugin to check exim queue sizes
^ with script above:
[mx1001:~] $ ./check_exim_queue -w 1000 -c 5000 OK: Less than 1000 mails in exim queue. [mx1001:~] $ ./check_exim_queue -w 300 -c 5000 WARNING: 329 mails in exim queue. [mx1001:~] $ ./check_exim_queue -w 100 -c 200 CRITICAL: 330 mails in exim queue. [mx1001:~] $ ./check_exim_queue -w 100 Usage: ./check_exim_queue -w <warn> -c <crit>
While are at it I'd suggest removing the disk i/o check which hasn't yield good result
Change 361023 merged by Dzahn:
[operations/puppet@production] icinga/role:mail::mx: add monitoring of exim queue size
[mx1001:~] $ sudo -u nagios /usr/local/lib/nagios/plugins/check_exim_queue -w 1000 -c 3000 OK: Less than 1000 mails in exim queue.
# This file is managed by Puppet! nagios ALL = NOPASSWD: /usr/sbin/exipick -bpc -o [[\:digit\:]][[\:digit\:]][mh]
Added and working here:
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=exim+queue
What do you think about the numbers "1000" and "3000" specifically?
Change 368110 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] lists/icinga: remove I/O monitoring on lists server
Change 368110 merged by Dzahn:
[operations/puppet@production] lists/icinga: remove I/O monitoring on lists server