Page MenuHomePhabricator

Standardize a SLI metrics naming/storage/mapping scheme
Open, MediumPublic

Description

In order to make SLO querying/dashboarding/etc more straightforward and maintainable in the longer term, we should identify an approach to organize and name our SLI metrics in a reasonably standard way, and explore what tools/approaches could help us keep this maintainable as SLO coverage grows in terms of services.

Since we're relatively early in the process (in terms of service coverage), I think finding a common ground that suits a majority of services generally, making it easier to template queries and dashboards would be a sufficient milestone to consider this task complete. In my mind essentially covering the 80% use case moreso than 100%, and planning/allowing for per-service customizations where needed.

Here's a first take on a checklist for the purposes of a near-term deployment (please expand, update, clarify, etc. and as the steps become more concrete we can branch off into more detailed subtasks):

  • Manually build (2) SLO dashboards (for 3 total, including etcd), approximately following the scheme outlined by https://grafana.wikimedia.org/d/iyumW7LGz/etcd-slos?orgId=1 (optimizing/updating where necessary, but trying to follow this as a draft template)
  • Using these dashboards, identify where we have commonality in terms of queries, graphs, percentages, and think/discuss how to abstract these into a generalized SLI scheme. (e.g. how do we best map latency metrics for many services into a common scheme, as opposed to allowing dashboards to sprawl with per-service metric customizations?)
  • Propose/identify naming/organizing schemes that support this, and investigate approaches/tooling to keep these maintainable as the usage and number of dashboards grows.
  • Experiment with tooling identified, package, deploy puppetize, etc.
  • Draft guidelines for SLI scheme, including naming, mapping, constraints, and how to approach per service customizations.
  • Deploy practices/tooling to production (metrics gathering, dashboarding, etc.)

Event Timeline

Something worth considering here, in addition to the naming scheme for SLIs themselves, is recording metrics that represent the SLOs values themselves (e.g. 0.1 percent)

My thinking is that since the SLOs are subject to tuning/change over time, it would be useful to have a record in our metrics that tracks what the SLO was for a given service at a given time. And it would also help us to keep in sync and generalize our dashboards, alert rules, etc.

lmata triaged this task as Medium priority.Nov 16 2021, 4:46 PM
lmata raised the priority of this task from Medium to Needs Triage.
lmata triaged this task as Medium priority.

@herron: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!