Page MenuHomePhabricator

job queue insert rate metrics gone from Grafana
Closed, ResolvedPublic

Description

https://grafana.wikimedia.org/d/000000107/job-queue-health (deprecated)
and
https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus (considered canonical)

Are showing N/A insert rates:

All other metrics, including processing rate seem to be ok. Not sure if a dashboard or prometheus issue. Seems like a metric important enough to report a bug against.

Event Timeline

jcrespo created this task.Nov 14 2019, 6:38 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 14 2019, 6:38 AM
elukey added a subscriber: elukey.Nov 14 2019, 8:30 AM

Just "fixed" the https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus removing cluster='eventbus' from the broken metrics.

As discussed on IRC, the EventBus service has been deprecated in favor of EventGate main, and this is the new dashboard: https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?refresh=1m&orgId=1
Not sure if we still need the old eventbus one or if it can be deleted, let's ask to CPT for confirmation :)

In both cases, be it deprecated or not, probably we will want better discoverability (tags) on the new dashboards, documentation update https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue and potentially adding a link to the above dashboards on the old one (just a suggested fix).

My personal use case is "did I break the job queue on master failover, lets check metrics?", and I didn't find that new dashboard by searching job queue on grafana or wikitech.

Yes yes, we need it! All the JobQueue stats are there, so please don't remove it.

Just to be clear, I wasn't suggesting removing it- mostly it was fixing the missing metrics and making things more easy to find/document.

fgiunchedi moved this task from Inbox to Radar on the observability board.Dec 9 2019, 11:15 AM
Pchelolo closed this task as Resolved.Wed, Jan 8, 7:18 PM
Pchelolo claimed this task.
Pchelolo added a subscriber: Pchelolo.

I've fixed up the last few broken metrics. It seems that the job queue dashboard is in good condition now.

As an addendum, could something be improved related to T238296#5662905 ? It seems there are at least 3 dashboards related to the job queue health, and while https://grafana.wikimedia.org/d/000000107/job-queue-health is clearly marked as deprecated, I didn't know about T238296#5662894. I would suggest to clarify current status on wikitech or grafana itself.