Page MenuHomePhabricator

Review prometheus ORES rules for completeness
Closed, ResolvedPublic

Description

Event Timeline

Halfak triaged this task as Medium priority.Oct 9 2019, 8:45 PM
Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

If the statsd-exporter sidecar approach is appropriate for ORES, there are quite a few metrics with unclear type and meaning. I've constructed a tree to assist us in defining them.

Each item wrapped in parenthesis is a metric that needs a type and description. On some cursory playing with the data, it appears that a sum of the sub-metrics does not always equal the parent count. For these, it's not safe to assume that the type is count and the sub-metrics are more specifically-defined parts of the parent.

  • ores
    • <host>
      • datasources_extracted (type: timing description: ?)
        • context (type: timing description: ?)
          • model (type: timing description: ?)
      • locking_response_time (type: timing description: ?)
        • poolcounter (type: timing description: ?)
      • precache_cache_hit (type: count description: cache hit count overall total)
        • context (type: count description: cache hit count per wiki total)
          • model (type: count description: cache hit count per conclusion total)
      • precache_cache_miss (type: count description: cache miss count overall total)
        • context (type: count description: cache miss count per wiki total)
          • model (type: count description: cache miss count per conclusion total)
      • precache_request (type: timing description: ?)
        • context (type: timing description: ?)
          • model (type: timing description: ?)
      • response (type: count description: total count of responses)
        • <http status> (type: count description: total count of responses of status code)
          • <wiki?> (type: count description: total count of responses of status code for wiki)
      • revision_scored (type: count description: ?)
        • context (type: count description: ?)
          • model (type: count description: ?)
      • score_cache_hit (type: count description: cache hit count overall total)
        • context (type: count description: cache hit count total for wiki)
          • model (type: count description: cache hit count)
      • score_cache_miss (type: count description: cache hit count overall total)
        • context (type: count description: cache miss count total for wiki)
          • model (type: count description: cache miss count)
      • score_errored (type: count description: ?)
        • context (type: count description: ?)
          • model (type: count description: ?)
      • score_processed (type: timing description: ?)
        • context (type: timing description: ?)
          • model (type: timing description: ?)
      • score_timed_out (type: timing description: ?)
        • context (type: timing description: ?)
          • model (type: timing description: ?)
      • scores_request (type: timing description: ?)
        • context (type: timing description: ?)
          • model (type: timing description: ?)

Change 548938 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: update ores matching rules and drop undefined metrics

https://gerrit.wikimedia.org/r/548938

Change 548938 merged by Cwhite:
[operations/puppet@production] hiera: update ores matching rules and drop undefined metrics

https://gerrit.wikimedia.org/r/548938

faidon renamed this task from Review promethius ORES rules for completeness to Review prometheus ORES rules for completeness.Nov 20 2019, 3:48 PM

I did more research and found a usage pattern that didn't initially occur to me.

Assuming a namespace template of ores.<hostname>.<action>.<context>.<model>, it appears that for each action, a duration is recorded at that 'top' level. Then, for that action, the same duration is recorded for the context of that action. Then, for each matching model, that same duration is recorded. This is the reason metrics at the action level are not a sum of all durations at the model level. In short, the namespace appears to be inverted. This is not necessarily true. Instead, perhaps the timings are all in a greater "requests" bucket?

If the template were ores.<hostname>.<model>.<action>.<context> or ores.<hostname>.<model>.<context>.<action>, it would stand to reason that actions and contexts are attributes of models. Writing to multiple namespaces in this pattern would not affect roll-up aggregation. While it would not affect roll-up aggregation, it makes the metrics difficult to work with. After reviewing dashboards, it appears that model isn't used at all.

What this means for conversion to Prometheus is models should be the top level metric and contexts|actions should be labels.

I'm curious to hear your thoughts on the matter, but as we're quickly coming up on the end of the quarter we need something to meet the goal. I will start implementing the top-level model approach something, and we can revisit it at a more convenient time.

Since it's not used in dashboards, what do we do with the model? I imagine it's useful, but I'm not sure how.

Change 554656 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: update ores statsd exporter mappings

https://gerrit.wikimedia.org/r/554656

Change 554656 merged by Cwhite:
[operations/puppet@production] hiera: update ores statsd exporter mappings

https://gerrit.wikimedia.org/r/554656

Nice work, Cole. You've reverse engineered correctly!

Re. the sub-type counts not adding up to larger metrics, that makes sense in a lot of cases, but we can certainly change the order in which metric-names are defined. See https://github.com/wikimedia/ores/blob/master/ores/metrics_collectors/statsd.py

From this file, you can see what type of events we capture with each metric. We could re-order the names in the keys as necessary. It would take a little work to clean up our dashboards, but this seems do-able with a day or two of effort.

Should we do the cleanup that I proposed above?

I see the value in a refactor/cleanup if what is currently being captured is not everything we need to (at least) recreate the current dashboards.

I propose we update the ORES dashboards to use the Prometheus metrics first, and then re-evaluate from there.

OK that sounds good. What are the next steps for updating the dashboards? How do we map our current metrics onto Promethius-generated metrics.

Change 572065 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: fix typo in ores statsd exporter mapping

https://gerrit.wikimedia.org/r/572065

Change 572065 merged by Cwhite:
[operations/puppet@production] hiera: fix typo in ores statsd exporter mapping

https://gerrit.wikimedia.org/r/572065

Change 572082 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add uwsgi.core.busy_workers to ores statsd exporter mappings

https://gerrit.wikimedia.org/r/572082

Change 572082 merged by Cwhite:
[operations/puppet@production] hiera: add uwsgi.core.busy_workers to ores statsd exporter mappings

https://gerrit.wikimedia.org/r/572082

I went ahead and updated this dashboard and added the Prometheus version next to the Graphite version as an example. During the process, I amended a couple metrics that were missed or misconfigured.

Hi @colewhite! I'm sorry for the big delay on this. Was looking this over today and I think I might need a tutorial for how to convert grafana graphs to promethius. I think I'll be able to complete this process if I simply finish updating our graphs. Do you have any resources you recommend?

I've found the Prometheus docs and this blog post very helpful for translating metrics.

@Halfak I got a code example from the search team on how they are defining and using Prometheus metrics, might be helpful: https://github.com/wikimedia/search-MjoLniR/blob/master/mjolnir/kafka/bulk_daemon.py#L74

Change 608102 had a related patch set uploaded (by Accraze; owner: Accraze):
[operations/puppet@production] hiera: add missing metrics for ores statsd exporter

https://gerrit.wikimedia.org/r/608102

@colewhite I found a couple of missing metrics and added them in the above patch set. I'm still slowly converting the ORES advanced metrics dashboard, but things are looking good for the most part.

One question I do have is about how to monitor our celery/redis queues. Is there a redis exporter set up somewhere? It seems like using redis_key_size{key="celery"} yields 'No data' on both eqiad and codfw.

Change 608102 merged by Cwhite:
[operations/puppet@production] hiera: add missing metrics for ores statsd exporter

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608102

There is a redis exporter available and installed on the rdb servers. However, there are no instances of the redis exporter configured to export key length/size.

Sorry for the delay @colewhite, everything looks good for our current needs! Moving to done on our workboard.