With a couple of weeks of mileage using mercurius to drive videoscaling, there's been some discussion about potential improvements to the prometheus metrics it exports.
Collecting some ideas from discussion with @Joe and @hnowlan, a non-exhaustive set of items that have come up include:
New metrics:
- Instance state: A running mercurius instance has two states, active and draining (i.e., consumer shut down, still processing jobs). It might be nice to have an instance state gauge metric - e.g., to know when state transitions happen, how many draining instances exist, etc.
- Concurrent jobs: Having a gauge that exports the number of in-flight jobs might be nice - e.g., for a single view of total concurrency across instances, for answering questions about calibrating concurrency limits, etc.
- Backlog time: While changeprop exports a distribution of job backlog times, mercurius does not. Having a wallclock time to quantify backlog, alongside the Kafka burrow reported lag (log offset), would be nice to have. Doing this properly gets complicated in the presence of recursive jobs. etc. (i.e., current vs. root event time) and opens a potential can of worms related to message encoding / schema (which mercurius is pleasantly oblivious to at the moment).
- Upstream errors: We can see when mercurius fails to process a job, but it doesn't appear that we have a way to graph upstream errors coming from shellbox, which would be nice even upon retries.
- Worker state: This may be useful in debugging edge cases, especially when tearing down superseded instances - how many workers are shut down/shutting down/active.
Metrics improvements:
- Job provenance: Once a message is consumed from Kafka and mercurius_kafka_consumer_messages is updated, we don’t carry provenance information to "downstream" metrics. It would be nice if we at least annotate the latter with topic (which is 1:1 with job type).
- Processing time: Mercurius currently advertises a mercurius_job_duration_seconds histogram, but it does not appear to be useful as it exists now (e.g., buckets need a much wider range).
The purpose of this task is to track improvements covering some set of the above, and possibly others.