Page MenuHomePhabricator

Add exim queue size to grafana graph
Closed, InvalidPublic

Description

Can we add the exim queue size to the grafana graphs for mail. it look slike the current metrics are derived using mtail so will not be able to collect the active queue size as such some other tool will likely be required.

would be good to split the graph by

frozen: exiqgrep -z -c
unfrozen: exiqgrep -x -c
total: exiqgrep -c

worth noting all theses values can be derived from the same command e.g.

$ exiqgrep -x -c
830 matches out of 1721 messages

1721: total messages
830: unfrozen messages
891: frozen messages (1721-830)

Event Timeline

jbond triaged this task as Medium priority.

In addition to the overall queue totals exiqsumm provides a breakdown by destination domain. It would be nice to have labels representing this, or perhaps a subset of domains. e.g. it would be useful to see the number of deferred messages specifically to wikimedia.org addresses.

akosiaris subscribed.

Removing SRE, has already been triaged to a more specific SRE subteam

Volans added subscribers: jhathaway, Volans.

The mail dashboard has already a quick display of the queues, I've added a graph to see both frozen and unfrozen queues.
As for the breakdown by domain that will need to be added to the prometheus exporter (using something like exim -bp | exiqsumm -c | head -n 12) and I'll leave it to @jhathaway to decide if it's needed or not.

Let me note that we also have an alert on exim_queue_length per https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-sre/mail.yaml#17. Severity is page, so we should be ok from that side. I am not sure per domain breakdown actually helps in a graph, but that's just my 2 cents.

I am inclined to say we resolve this for what is worth.

fgiunchedi subscribed.

No longer valid I think, also MXes now use postfix