In order to make SLO querying/dashboarding/etc more straightforward and maintainable in the longer term, we should identify an approach to organize and name our SLI metrics in a reasonably standard way, and explore what tools/approaches could help us keep this maintainable as SLO coverage grows in terms of services.
Since we're relatively early in the process (in terms of service coverage), I think finding a common ground that suits a majority of services generally, making it easier to template queries and dashboards would be a sufficient milestone to consider this task complete. In my mind essentially covering the 80% use case moreso than 100%, and planning/allowing for per-service customizations where needed.
Here's a first take on a checklist for the purposes of a near-term deployment (please expand, update, clarify, etc. and as the steps become more concrete we can branch off into more detailed subtasks):
- Manually build (2) SLO dashboards (for 3 total, including etcd), approximately following the scheme outlined by https://grafana.wikimedia.org/d/iyumW7LGz/etcd-slos?orgId=1 (optimizing/updating where necessary, but trying to follow this as a draft template)
- Using these dashboards, identify where we have commonality in terms of queries, graphs, percentages, and think/discuss how to abstract these into a generalized SLI scheme. (e.g. how do we best map latency metrics for many services into a common scheme, as opposed to allowing dashboards to sprawl with per-service metric customizations?)
- Propose/identify naming/organizing schemes that support this, and investigate approaches/tooling to keep these maintainable as the usage and number of dashboards grows.
- Experiment with tooling identified, package, deploy puppetize, etc.
- Draft guidelines for SLI scheme, including naming, mapping, constraints, and how to approach per service customizations.
- Deploy practices/tooling to production (metrics gathering, dashboarding, etc.)