Page MenuHomePhabricator

mailman check_queue recurrent alarm/recovery
Closed, ResolvedPublic

Description

looks like mailman_queue_size regularly fires and recovers during UTC morning. note the threshold was raised in rOPUPaebf548041c4b so likely it needs some more tuning cc @Dzahn

06:05 -icinga-wm:#wikimedia-operations- PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 2 mailman queue(s) above 100
06:29 -icinga-wm:#wikimedia-operations- RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100
09:05 -icinga-wm:#wikimedia-operations- PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100
09:21 -icinga-wm:#wikimedia-operations- RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100
09:04 -icinga-wm:#wikimedia-operations- PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100
09:21 -icinga-wm:#wikimedia-operations- RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100
09:05 -icinga-wm:#wikimedia-operations- PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100
09:18 -icinga-wm:#wikimedia-operations- RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100
09:05 -icinga-wm:#wikimedia-operations- PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100
09:17 -icinga-wm:#wikimedia-operations- RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100
09:04  <icinga-wm_> PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100
09:19  <icinga-wm_> RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100
09:04 -icinga-wm:#wikimedia-operations- PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100
09:18 -icinga-wm:#wikimedia-operations- RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100
09:04 -icinga-wm:#wikimedia-operations- PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100
09:26 -icinga-wm:#wikimedia-operations- RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100

Details

Related Gerrit Patches:
operations/puppet : productionmailman: increase out queue to 300 check
operations/puppet : productionmailman: add cron to gather queue data
operations/puppet : productionmailman: queue monitoring, enable multi thresholds

Event Timeline

fgiunchedi raised the priority of this task from to Medium.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added a project: acl*sre-team.
fgiunchedi added subscribers: fgiunchedi, Dzahn.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptOct 7 2015, 8:38 AM
Dzahn claimed this task.Oct 7 2015, 2:24 PM

Needs per queue levels.

The issue is all bounces emails and all digest emails go out at once which easily is 500+ emails at any one time.

Dzahn added a comment.Oct 7 2015, 9:59 PM

@JohnLewis i agree, but i also wonder if we had "per queue"-levels, would we still monitor this queue and would it be useful with a threshold that is so high? i mean.. as opposed to just monitoring the other queues

Change 244366 had a related patch set uploaded (by Dzahn):
mailman: queue monitoring, enable multi thresholds

https://gerrit.wikimedia.org/r/244366

Change 244366 merged by Dzahn:
mailman: queue monitoring, enable multi thresholds

https://gerrit.wikimedia.org/r/244366

Dzahn added a comment.Oct 8 2015, 10:12 PM

The script now takes separate limits for each of the 4 queues we monitor, in, out, bounces, virgin, which allows to be more specific.
Follow-up is finding the right values.

Dzahn added a comment.Oct 8 2015, 10:18 PM

Needs per queue levels.
The issue is all bounces emails and all digest emails go out at once which easily is 500+ emails at any one time.

So "bounces" and "out" to ... ? We should measure it with a little script (and keep notifications switched off, which i did until we know better)

Change 247472 had a related patch set uploaded (by John F. Lewis):
mailman: add cron to gather queue data

https://gerrit.wikimedia.org/r/247472

Stealing assign.

Change 247472 merged by Dzahn:
mailman: add cron to gather queue data

https://gerrit.wikimedia.org/r/247472

I wrote a script for this:

https://gerrit.wikimedia.org/r/#/c/247349/

that is now used jby the cron job John added:

https://gerrit.wikimedia.org/r/#/c/247472/1

Dzahn set Security to None.
Dzahn added a project: Wikimedia-Mailing-lists.
Dzahn added a comment.EditedOct 19 2015, 10:49 PM

on fermium:
Notice: /Stage[main]/Mailman::Cron/Cron[queue_data]/ensure: created

# Puppet Name: queue_data
2 * * * * /usr/local/sbin/queue_data -a >> /var/www/qdata.html

FWIW you should be able to obtain the same with a diamond collector and export data into graphite for graphing (and possibly alerting)

Change 247604 had a related patch set uploaded (by John F. Lewis):
mailman: increase out queue to 300 check

https://gerrit.wikimedia.org/r/247604

Change 247604 merged by Dzahn:
mailman: increase out queue to 300 check

https://gerrit.wikimedia.org/r/247604

JohnLewis closed this task as Resolved.Oct 20 2015, 4:29 PM

Above commit will resolve this. Unsilenced icinga check.