Requests for new JobQueue monitoring capabilities
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Pchelolo
	Sep 13 2017, 2:48 AM

Description

As we're building the new JobQueue based on Event-Platform we have an opportunity to rethink the monitoring of the queue.

By using ChangeProp we get some nice perks out of the box that we can monitor:

Rate of posting to the queue by job type
Rate of processing of each job type
Delay in time between the root job was posted and the leaf job executed
Backlog in number of messages per job type
Rates of retries
Deduplication rates
Change prop memory usage stats
Redis usage stats
Kafka-Changeprop RTT
Detailed breakdown of the root jobs being processed right now - which templates were edited, when, etc

I think we can do better and create some more interesting metrics or scripts to get even more insight into what's going on. I've created this brainstorming task to collect all the ideas of what people want to be able to see in regards to monitoring and debugging the new queue. Please add your most wild ideas here and then we can discuss whether it's possible to implement and if it is - how do we do it.

Related Objects
Search...

Status	Assigned	Task
Resolved	• Pchelolo	T157088 [EPIC] Develop a JobQueue backend based on EventBus
Resolved	• Pchelolo	T175780 Requests for new JobQueue monitoring capabilities
Declined	• Pchelolo	T175952 Split ChangeProp metrics by wiki

Event Timeline

• Pchelolo created this task.Sep 13 2017, 2:48 AM

Restricted Application added a project: Analytics. · View Herald TranscriptSep 13 2017, 2:48 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is very promising, I was in the process of writing down my own requirements and it seems most things are already covered, although it's not clear from your post if we can have per-wiki stats as well as per-job stats.

Since we cannot count jobs in queue properly anymore, we would like to have the following metrics to be available/collected:
* Processing lag (defined as the Xth percentile of age of rootJobTimestamps in the last minute, in the current model for the jobqueue), for each job type and for each wiki.
* Rates of job insertion, again for each job type on each wiki, and corresponding totals
* Rate of job processing, as above
* Rate of job failures, as above
* Rate of jobs dropped because of deduplication, as above
* Ops and services will have to come up with sensible alerts to set up based on those metrics.
* Every changeprop instance should expose an endpoint for health-checking (this is distinct from the global metrics above)

There is a second part to my list of "requirements", but I'll post those to a separate ticket.

• mobrovac added a project: MediaWiki-Core-JobQueue.Sep 14 2017, 1:40 PM

• Pchelolo created subtask T175952: Split ChangeProp metrics by wiki.Sep 14 2017, 7:34 PM

• fdans moved this task from Incoming to Radar on the Analytics board.Sep 21 2017, 4:36 PM

• mobrovac added a project: Platform Team Legacy (Designing).Dec 20 2018, 12:55 PM

I think we got all of this, except per-wiki metrics. it's tracked in T175952 so I'm gonna close this one as Resolved.

• Pchelolo closed subtask T175952: Split ChangeProp metrics by wiki as Declined.Mar 18 2020, 5:12 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

Requests for new JobQueue monitoring capabilitiesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Requests for new JobQueue monitoring capabilities
Closed, ResolvedPublic
Actions

Related Objects
Search...