Page MenuHomePhabricator

Improve / extend prometheus metrics exported by mercurius
Open, LowPublic

Description

With a couple of weeks of mileage using mercurius to drive videoscaling, there's been some discussion about potential improvements to the prometheus metrics it exports.

Collecting some ideas from discussion with @Joe and @hnowlan, a non-exhaustive set of items that have come up include:

New metrics:

  • Instance state: A running mercurius instance has two states, active and draining (i.e., consumer shut down, still processing jobs). It might be nice to have an instance state gauge metric - e.g., to know when state transitions happen, how many draining instances exist, etc.
  • Concurrent jobs: Having a gauge that exports the number of in-flight jobs might be nice - e.g., for a single view of total concurrency across instances, for answering questions about calibrating concurrency limits, etc.
  • Backlog time: While changeprop exports a distribution of job backlog times, mercurius does not. Having a wallclock time to quantify backlog, alongside the Kafka burrow reported lag (log offset), would be nice to have. Doing this properly gets complicated in the presence of recursive jobs. etc. (i.e., current vs. root event time) and opens a potential can of worms related to message encoding / schema (which mercurius is pleasantly oblivious to at the moment).
  • Upstream errors: We can see when mercurius fails to process a job, but it doesn't appear that we have a way to graph upstream errors coming from shellbox, which would be nice even upon retries.
  • Worker state: This may be useful in debugging edge cases, especially when tearing down superseded instances - how many workers are shut down/shutting down/active.

Metrics improvements:

  • Job provenance: Once a message is consumed from Kafka and mercurius_kafka_consumer_messages is updated, we don’t carry provenance information to "downstream" metrics. It would be nice if we at least annotate the latter with topic (which is 1:1 with job type).
  • Processing time: Mercurius currently advertises a mercurius_job_duration_seconds histogram, but it does not appear to be useful as it exists now (e.g., buckets need a much wider range).

The purpose of this task is to track improvements covering some set of the above, and possibly others.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Custom buckets for mercurius_job_duration_secondsrepos/sre/mercurius!21swfrenchwork/swfrench/metrics-improvements-5-custom-bucketsmain
Add a server status gauge metricrepos/sre/mercurius!20swfrenchwork/swfrench/metrics-improvements-4-server-statusmain
Add a gauge to report active workersrepos/sre/mercurius!19swfrenchwork/swfrench/metrics-improvements-3-active-workersmain
Add Kafka topic label to processing metricsrepos/sre/mercurius!17swfrenchwork/swfrench/metrics-improvements-2-topic-labelsmain
Restructure / correct existing prometheus metricsrepos/sre/mercurius!16swfrenchwork/swfrench/metrics-improvements-1-restructuremain
Customize query in GitLab

Event Timeline

I have some lingering patches from a quiet stretch just before the end-of-year holiday, which should address a subset of these - including job provenance, processing time, concurrent jobs, and instance state.

jijiki triaged this task as Medium priority.Feb 3 2025, 1:34 PM
jijiki moved this task from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
jijiki moved this task from 🌻Mediawiki to Doing 😎 on the serviceops board.
Scott_French lowered the priority of this task from Medium to Low.Feb 7 2025, 1:46 AM

Alright, that should cover everything I've got lingering from December.