Page MenuHomePhabricator

Better metrics from the job queue
Closed, ResolvedPublic

Description

It'd be really helpful to get some information from the job queue per job type (and maybe per wiki?):

  1. time spent processing that job type
  2. number of jobs of that type started
  3. average execution time (basically 1 divided by 2 if both are rrdtool counters)
  4. number of jobs in the queue
  5. average time between job finished and job queued

Linked from: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150825-Redis.

Details

Reference
bz60105

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:57 AM
bzimport set Reference to bz60105.
bzimport added a subscriber: Unknown Object (MLST).

Nik: Do you know if this would require changes in JobQueue ("MediaWiki" product) first, or server setup only (ganglia etc.), or both?

I think it is likely.

Sorry, to be more clear: I imagine to get them all you'd need to change Mediawiki in some way.

Krenair added a subscriber: Krenair.May 3 2015, 5:14 PM

note that per-status stats for each wiki is a non-starter, T95913 (too many metrics essentially)

fgiunchedi merged a task: Restricted Task.Jun 1 2015, 11:06 PM
fgiunchedi added a subscriber: Joe.
jcrespo raised the priority of this task from Medium to High.Aug 25 2015, 9:20 AM
jcrespo added a subscriber: jcrespo.

Elevating this to high, after polling other ops and devs on IRC, as it could have detected (maybe prevented most of its consequences?) a recent outage before it was too late.

I see a much improved graph here: https://grafana.wikimedia.org/dashboard/db/job-queue-rate

Maybe this has already been resolved?

Krinkle moved this task from To Triage to Follow-up/Actionables on the Wikimedia-Incident board.
Krinkle updated the task description. (Show Details)
Krinkle removed a subscriber: wikibugs-l-list.