See https://gerrit.wikimedia.org/r/c/operations/puppet/+/479563/4/hieradata/role/common/ores.yaml.
Does this cover all of metrics we need to care about?
See https://gerrit.wikimedia.org/r/c/operations/puppet/+/479563/4/hieradata/role/common/ores.yaml.
Does this cover all of metrics we need to care about?
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T228380 Tech debt: sunsetting of Graphite (part 1) | |||
Open | colewhite | T205870 Fully migrate producers off statsd | |||
Resolved | • ACraze | T233448 Review prometheus ORES rules for completeness | |||
Declined | colewhite | T239833 StatsD Exporter drops relayed metrics |
If the statsd-exporter sidecar approach is appropriate for ORES, there are quite a few metrics with unclear type and meaning. I've constructed a tree to assist us in defining them.
Each item wrapped in parenthesis is a metric that needs a type and description. On some cursory playing with the data, it appears that a sum of the sub-metrics does not always equal the parent count. For these, it's not safe to assume that the type is count and the sub-metrics are more specifically-defined parts of the parent.
Change 548938 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: update ores matching rules and drop undefined metrics
Change 548938 merged by Cwhite:
[operations/puppet@production] hiera: update ores matching rules and drop undefined metrics
I did more research and found a usage pattern that didn't initially occur to me.
Assuming a namespace template of ores.<hostname>.<action>.<context>.<model>, it appears that for each action, a duration is recorded at that 'top' level. Then, for that action, the same duration is recorded for the context of that action. Then, for each matching model, that same duration is recorded. This is the reason metrics at the action level are not a sum of all durations at the model level. In short, the namespace appears to be inverted. This is not necessarily true. Instead, perhaps the timings are all in a greater "requests" bucket?
If the template were ores.<hostname>.<model>.<action>.<context> or ores.<hostname>.<model>.<context>.<action>, it would stand to reason that actions and contexts are attributes of models. Writing to multiple namespaces in this pattern would not affect roll-up aggregation. While it would not affect roll-up aggregation, it makes the metrics difficult to work with. After reviewing dashboards, it appears that model isn't used at all.
What this means for conversion to Prometheus is models should be the top level metric and contexts|actions should be labels.
I'm curious to hear your thoughts on the matter, but as we're quickly coming up on the end of the quarter we need something to meet the goal. I will start implementing the top-level model approach something, and we can revisit it at a more convenient time.
Since it's not used in dashboards, what do we do with the model? I imagine it's useful, but I'm not sure how.
Change 554656 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: update ores statsd exporter mappings
Change 554656 merged by Cwhite:
[operations/puppet@production] hiera: update ores statsd exporter mappings
Nice work, Cole. You've reverse engineered correctly!
Re. the sub-type counts not adding up to larger metrics, that makes sense in a lot of cases, but we can certainly change the order in which metric-names are defined. See https://github.com/wikimedia/ores/blob/master/ores/metrics_collectors/statsd.py
From this file, you can see what type of events we capture with each metric. We could re-order the names in the keys as necessary. It would take a little work to clean up our dashboards, but this seems do-able with a day or two of effort.
I see the value in a refactor/cleanup if what is currently being captured is not everything we need to (at least) recreate the current dashboards.
I propose we update the ORES dashboards to use the Prometheus metrics first, and then re-evaluate from there.
OK that sounds good. What are the next steps for updating the dashboards? How do we map our current metrics onto Promethius-generated metrics.
Change 572065 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: fix typo in ores statsd exporter mapping
Change 572065 merged by Cwhite:
[operations/puppet@production] hiera: fix typo in ores statsd exporter mapping
Change 572082 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: add uwsgi.core.busy_workers to ores statsd exporter mappings
Change 572082 merged by Cwhite:
[operations/puppet@production] hiera: add uwsgi.core.busy_workers to ores statsd exporter mappings
I went ahead and updated this dashboard and added the Prometheus version next to the Graphite version as an example. During the process, I amended a couple metrics that were missed or misconfigured.
Hi @colewhite! I'm sorry for the big delay on this. Was looking this over today and I think I might need a tutorial for how to convert grafana graphs to promethius. I think I'll be able to complete this process if I simply finish updating our graphs. Do you have any resources you recommend?
I've found the Prometheus docs and this blog post very helpful for translating metrics.
@Halfak I got a code example from the search team on how they are defining and using Prometheus metrics, might be helpful: https://github.com/wikimedia/search-MjoLniR/blob/master/mjolnir/kafka/bulk_daemon.py#L74
Change 608102 had a related patch set uploaded (by Accraze; owner: Accraze):
[operations/puppet@production] hiera: add missing metrics for ores statsd exporter
@colewhite I found a couple of missing metrics and added them in the above patch set. I'm still slowly converting the ORES advanced metrics dashboard, but things are looking good for the most part.
One question I do have is about how to monitor our celery/redis queues. Is there a redis exporter set up somewhere? It seems like using redis_key_size{key="celery"} yields 'No data' on both eqiad and codfw.
Change 608102 merged by Cwhite:
[operations/puppet@production] hiera: add missing metrics for ores statsd exporter
There is a redis exporter available and installed on the rdb servers. However, there are no instances of the redis exporter configured to export key length/size.
Sorry for the delay @colewhite, everything looks good for our current needs! Moving to done on our workboard.