Check for an oversized exim4 queue indicating mail delivery failures
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	faidon
	Apr 19 2016, 9:33 PM

Description

Queueing over 9000 emails without any warning is just embarassing. Let's set up at least:

a check that checks for an oversized exim4 queue ("exipick -i |wc -l" or something)
a check for a large amount of mails queued for more than N hours.

Details

	Subject	Repo	Branch	Lines +/-
	lists/icinga: remove I/O monitoring on lists server	operations/puppet	production	+0 -26
	icinga/role:mail::mx: add monitoring of exim queue size	operations/puppet	production	+92 -0

Customize query in gerrit

Event Timeline

faidon created this task.Apr 19 2016, 9:33 PM

Restricted Application added subscribers: TerraCodes, Aklapper. · View Herald TranscriptApr 19 2016, 9:33 PM

We have this for the lists server.

files/icinga/check_mailman_queue

modules/role/manifests/lists/server.pp: nrpe_command => '/usr/bin/sudo -u list /usr/local/lib/nagios/plugins/check_mailman_queue 25 25 25',

Dzahn claimed this task.Apr 20 2017, 2:50 AM

Change 361023 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: add plugin to check exim queue sizes

https://gerrit.wikimedia.org/r/361023

gerritbot added a project: Patch-For-Review.Jun 23 2017, 4:37 AM

^ with script above:

[mx1001:~] $ ./check_exim_queue -w 1000 -c 5000
OK: Less than 1000 mails in exim queue.
[mx1001:~] $ ./check_exim_queue -w 300 -c 5000
WARNING: 329 mails in exim queue.
[mx1001:~] $ ./check_exim_queue -w 100 -c 200
CRITICAL: 330 mails in exim queue.
[mx1001:~] $ ./check_exim_queue -w 100 
Usage: ./check_exim_queue -w <warn> -c <crit>

faidon moved this task from Inbox to In progress on the observability board.Jul 10 2017, 1:00 PM

faidon moved this task from In progress to Up next on the observability board.Jul 24 2017, 3:09 PM

While are at it I'd suggest removing the disk i/o check which hasn't yield good result

Change 361023 merged by Dzahn:
[operations/puppet@production] icinga/role:mail::mx: add monitoring of exim queue size

https://gerrit.wikimedia.org/r/361023

[mx1001:~] $ sudo -u nagios /usr/local/lib/nagios/plugins/check_exim_queue -w 1000 -c 3000
OK: Less than 1000 mails in exim queue.

# This file is managed by Puppet!

nagios ALL = NOPASSWD: /usr/sbin/exipick -bpc -o [[\:digit\:]][[\:digit\:]][mh]

Added and working here:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=exim+queue

What do you think about the numbers "1000" and "3000" specifically?

Change 368110 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] lists/icinga: remove I/O monitoring on lists server

https://gerrit.wikimedia.org/r/368110

Change 368110 merged by Dzahn:
[operations/puppet@production] lists/icinga: remove I/O monitoring on lists server

https://gerrit.wikimedia.org/r/368110

Dzahn closed this task as Resolved.Jul 27 2017, 12:00 AM

Check for an oversized exim4 queue indicating mail delivery failuresClosed, ResolvedPublicActions

Description

Details

Event Timeline

Check for an oversized exim4 queue indicating mail delivery failures
Closed, ResolvedPublic
Actions