Review prometheus ORES rules for completeness
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Sep 20 2019, 6:48 PM

Description

See https://gerrit.wikimedia.org/r/c/operations/puppet/+/479563/4/hieradata/role/common/ores.yaml.

Does this cover all of metrics we need to care about?

Details

Subject	Repo	Branch	Lines +/-
hiera: add missing metrics for ores statsd exporter	operations/puppet	production	+4 -4
hiera: add uwsgi.core.busy_workers to ores statsd exporter mappings	operations/puppet	production	+3 -0
hiera: fix typo in ores statsd exporter mapping	operations/puppet	production	+1 -1
hiera: update ores statsd exporter mappings	operations/puppet	production	+42 -116
hiera: update ores matching rules and drop undefined metrics	operations/puppet	production	+9 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T228380 Tech debt: sunsetting of Graphite (part 1)
Open	colewhite	T205870 Fully migrate producers off statsd
Resolved	• ACraze	T233448 Review prometheus ORES rules for completeness
Declined	colewhite	T239833 StatsD Exporter drops relayed metrics

Event Timeline

Halfak created this task.Sep 20 2019, 6:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 20 2019, 6:48 PM

Halfak added subscribers: colewhite, • ACraze.Sep 20 2019, 6:48 PM

Halfak triaged this task as Medium priority.Oct 9 2019, 8:45 PM

Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

If the statsd-exporter sidecar approach is appropriate for ORES, there are quite a few metrics with unclear type and meaning. I've constructed a tree to assist us in defining them.

Each item wrapped in parenthesis is a metric that needs a type and description. On some cursory playing with the data, it appears that a sum of the sub-metrics does not always equal the parent count. For these, it's not safe to assume that the type is count and the sub-metrics are more specifically-defined parts of the parent.

ores
- <host>
  - datasources_extracted (type: timing description: ?)
    - context (type: timing description: ?)
      - model (type: timing description: ?)
  - locking_response_time (type: timing description: ?)
    - poolcounter (type: timing description: ?)
  - precache_cache_hit (type: count description: cache hit count overall total)
    - context (type: count description: cache hit count per wiki total)
      - model (type: count description: cache hit count per conclusion total)
  - precache_cache_miss (type: count description: cache miss count overall total)
    - context (type: count description: cache miss count per wiki total)
      - model (type: count description: cache miss count per conclusion total)
  - precache_request (type: timing description: ?)
    - context (type: timing description: ?)
      - model (type: timing description: ?)
  - response (type: count description: total count of responses)
    - <http status> (type: count description: total count of responses of status code)
      - <wiki?> (type: count description: total count of responses of status code for wiki)
  - revision_scored (type: count description: ?)
    - context (type: count description: ?)
      - model (type: count description: ?)
  - score_cache_hit (type: count description: cache hit count overall total)
    - context (type: count description: cache hit count total for wiki)
      - model (type: count description: cache hit count)
  - score_cache_miss (type: count description: cache hit count overall total)
    - context (type: count description: cache miss count total for wiki)
      - model (type: count description: cache miss count)
  - score_errored (type: count description: ?)
    - context (type: count description: ?)
      - model (type: count description: ?)
  - score_processed (type: timing description: ?)
    - context (type: timing description: ?)
      - model (type: timing description: ?)
  - score_timed_out (type: timing description: ?)
    - context (type: timing description: ?)
      - model (type: timing description: ?)
  - scores_request (type: timing description: ?)
    - context (type: timing description: ?)
      - model (type: timing description: ?)

colewhite added a parent task: T205870: Fully migrate producers off statsd.Nov 6 2019, 12:15 AM

Change 548938 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: update ores matching rules and drop undefined metrics

https://gerrit.wikimedia.org/r/548938

gerritbot added a project: Patch-For-Review.Nov 6 2019, 12:18 AM

Change 548938 merged by Cwhite:
[operations/puppet@production] hiera: update ores matching rules and drop undefined metrics

https://gerrit.wikimedia.org/r/548938

Maintenance_bot removed a project: Patch-For-Review.Nov 12 2019, 10:10 PM

faidon renamed this task from Review promethius ORES rules for completeness to Review prometheus ORES rules for completeness.Nov 20 2019, 3:48 PM

I did more research and found a usage pattern that didn't initially occur to me.

Assuming a namespace template of ores.<hostname>.<action>.<context>.<model>, it appears that for each action, a duration is recorded at that 'top' level. Then, for that action, the same duration is recorded for the context of that action. Then, for each matching model, that same duration is recorded. This is the reason metrics at the action level are not a sum of all durations at the model level. ~~In short, the namespace appears to be inverted.~~ This is not necessarily true. Instead, perhaps the timings are all in a greater "requests" bucket?

If the template were ores.<hostname>.<model>.<action>.<context> or ores.<hostname>.<model>.<context>.<action>, it would stand to reason that actions and contexts are attributes of models. Writing to multiple namespaces in this pattern would not affect roll-up aggregation. While it would not affect roll-up aggregation, it makes the metrics difficult to work with. After reviewing dashboards, it appears that model isn't used at all.

~~What this means for conversion to Prometheus is models should be the top level metric and contexts|actions should be labels.~~

I'm curious to hear your thoughts on the matter, but as we're quickly coming up on the end of the quarter we need something to meet the goal. I will start implementing ~~the top-level model approach~~ something, and we can revisit it at a more convenient time.

colewhite added a subtask: T239833: StatsD Exporter drops relayed metrics.Dec 4 2019, 8:17 PM

Since it's not used in dashboards, what do we do with the model? I imagine it's useful, but I'm not sure how.

Change 554656 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: update ores statsd exporter mappings

https://gerrit.wikimedia.org/r/554656

gerritbot added a project: Patch-For-Review.Dec 5 2019, 1:19 AM

Change 554656 merged by Cwhite:
[operations/puppet@production] hiera: update ores statsd exporter mappings

https://gerrit.wikimedia.org/r/554656

Nice work, Cole. You've reverse engineered correctly!

Re. the sub-type counts not adding up to larger metrics, that makes sense in a lot of cases, but we can certainly change the order in which metric-names are defined. See https://github.com/wikimedia/ores/blob/master/ores/metrics_collectors/statsd.py

From this file, you can see what type of events we capture with each metric. We could re-order the names in the keys as necessary. It would take a little work to clean up our dashboards, but this seems do-able with a day or two of effort.

Should we do the cleanup that I proposed above?

I see the value in a refactor/cleanup if what is currently being captured is not everything we need to (at least) recreate the current dashboards.

I propose we update the ORES dashboards to use the Prometheus metrics first, and then re-evaluate from there.

OK that sounds good. What are the next steps for updating the dashboards? How do we map our current metrics onto Promethius-generated metrics.

Change 572065 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: fix typo in ores statsd exporter mapping

https://gerrit.wikimedia.org/r/572065

Change 572065 merged by Cwhite:
[operations/puppet@production] hiera: fix typo in ores statsd exporter mapping

https://gerrit.wikimedia.org/r/572065

Change 572082 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add uwsgi.core.busy_workers to ores statsd exporter mappings

https://gerrit.wikimedia.org/r/572082

Change 572082 merged by Cwhite:
[operations/puppet@production] hiera: add uwsgi.core.busy_workers to ores statsd exporter mappings

https://gerrit.wikimedia.org/r/572082

I went ahead and updated this dashboard and added the Prometheus version next to the Graphite version as an example. During the process, I amended a couple metrics that were missed or misconfigured.

Hi @colewhite! I'm sorry for the big delay on this. Was looking this over today and I think I might need a tutorial for how to convert grafana graphs to promethius. I think I'll be able to complete this process if I simply finish updating our graphs. Do you have any resources you recommend?

I've found the Prometheus docs and this blog post very helpful for translating metrics.

Halfak claimed this task.May 4 2020, 5:18 PM

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.

@Halfak I got a code example from the search team on how they are defining and using Prometheus metrics, might be helpful: https://github.com/wikimedia/search-MjoLniR/blob/master/mjolnir/kafka/bulk_daemon.py#L74

Halfak reassigned this task from Halfak to • ACraze.Jun 1 2020, 4:55 PM

colewhite closed subtask T239833: StatsD Exporter drops relayed metrics as Declined.Jun 8 2020, 3:13 PM

Change 608102 had a related patch set uploaded (by Accraze; owner: Accraze):
[operations/puppet@production] hiera: add missing metrics for ores statsd exporter

https://gerrit.wikimedia.org/r/608102

@colewhite I found a couple of missing metrics and added them in the above patch set. I'm still slowly converting the ORES advanced metrics dashboard, but things are looking good for the most part.

One question I do have is about how to monitor our celery/redis queues. Is there a redis exporter set up somewhere? It seems like using redis_key_size{key="celery"} yields 'No data' on both eqiad and codfw.

Change 608102 merged by Cwhite:
[operations/puppet@production] hiera: add missing metrics for ores statsd exporter

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608102

There is a redis exporter available and installed on the rdb servers. However, there are no instances of the redis exporter configured to export key length/size.