Page MenuHomePhabricator

Requests for new JobQueue monitoring capabilities
Closed, ResolvedPublic

Description

As we're building the new JobQueue based on Event-Platform we have an opportunity to rethink the monitoring of the queue.

By using ChangeProp we get some nice perks out of the box that we can monitor:

  • Rate of posting to the queue by job type
  • Rate of processing of each job type
  • Delay in time between the root job was posted and the leaf job executed
  • Backlog in number of messages per job type
  • Rates of retries
  • Deduplication rates
  • Change prop memory usage stats
  • Redis usage stats
  • Kafka-Changeprop RTT
  • Detailed breakdown of the root jobs being processed right now - which templates were edited, when, etc

I think we can do better and create some more interesting metrics or scripts to get even more insight into what's going on. I've created this brainstorming task to collect all the ideas of what people want to be able to see in regards to monitoring and debugging the new queue. Please add your most wild ideas here and then we can discuss whether it's possible to implement and if it is - how do we do it.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is very promising, I was in the process of writing down my own requirements and it seems most things are already covered, although it's not clear from your post if we can have per-wiki stats as well as per-job stats.

Since we cannot count jobs in queue properly anymore, we would like to have the following metrics to be available/collected:
 * Processing lag (defined as the Xth percentile of age of rootJobTimestamps in the last minute, in the current model for the jobqueue), for each job type and for each wiki.
 * Rates of job insertion, again for each job type on each wiki, and corresponding totals
 * Rate of job processing, as above
 * Rate of job failures, as above
 * Rate of jobs dropped because of deduplication, as above
 * Ops and services will have to come up with sensible alerts to set up based on those metrics.
 * Every changeprop instance should expose an endpoint for health-checking (this is distinct from the global metrics above)

There is a second part to my list of "requirements", but I'll post those to a separate ticket.

I think we got all of this, except per-wiki metrics. it's tracked in T175952 so I'm gonna close this one as Resolved.