mapping scheme
Open, MediumPublic
Actions

Assigned To

None

Authored By

	herron
	Feb 12 2021, 5:20 PM

Description

In order to make SLO querying/dashboarding/etc more straightforward and maintainable in the longer term, we should identify an approach to organize and name our SLI metrics in a reasonably standard way, and explore what tools/approaches could help us keep this maintainable as SLO coverage grows in terms of services.

Since we're relatively early in the process (in terms of service coverage), I think finding a common ground that suits a majority of services generally, making it easier to template queries and dashboards would be a sufficient milestone to consider this task complete. In my mind essentially covering the 80% use case moreso than 100%, and planning/allowing for per-service customizations where needed.

Here's a first take on a checklist for the purposes of a near-term deployment (please expand, update, clarify, etc. and as the steps become more concrete we can branch off into more detailed subtasks):

Manually build (2) SLO dashboards (for 3 total, including etcd), approximately following the scheme outlined by https://grafana.wikimedia.org/d/iyumW7LGz/etcd-slos?orgId=1 (optimizing/updating where necessary, but trying to follow this as a draft template)
Using these dashboards, identify where we have commonality in terms of queries, graphs, percentages, and think/discuss how to abstract these into a generalized SLI scheme. (e.g. how do we best map latency metrics for many services into a common scheme, as opposed to allowing dashboards to sprawl with per-service metric customizations?)
Propose/identify naming/organizing schemes that support this, and investigate approaches/tooling to keep these maintainable as the usage and number of dashboards grows.
Experiment with tooling identified, package, deploy puppetize, etc.
Draft guidelines for SLI scheme, including naming, mapping, constraints, and how to approach per service customizations.
Deploy practices/tooling to production (metrics gathering, dashboarding, etc.)

Related Objects
Search...

Status	Assigned	Task
Resolved	herron	T274665 Design and implement SLO Dashboard tooling
Open	None	T274668 Standardize a SLI metrics naming/storage/mapping scheme
Open	herron	T289615 Migrate existing SLO related metrics to recording rules

Event Timeline

herron created this task.Feb 12 2021, 5:20 PM

RLazarus subscribed.Feb 16 2021, 5:12 PM

herron moved this task from Inbox to In progress on the observability board.Feb 22 2021, 3:36 PM

Something worth considering here, in addition to the naming scheme for SLIs themselves, is recording metrics that represent the SLOs values themselves (e.g. 0.1 percent)

My thinking is that since the SLOs are subject to tuning/change over time, it would be useful to have a record in our metrics that tracks what the SLO was for a given service at a given time. And it would also help us to keep in sync and generalize our dashboards, alert rules, etc.

lmata edited projects, added SRE Observability (FY2021/2022-Q1); removed observability.Jul 12 2021, 2:20 AM

lmata moved this task from Inbox to In progress on the SRE Observability (FY2021/2022-Q1) board.

lmata assigned this task to herron.Aug 18 2021, 4:40 PM

herron moved this task from FY2021/2022-Q1 to FY2021/2022-Q2 on the SRE Observability board.Oct 1 2021, 4:08 PM

herron edited projects, added SRE Observability (FY2021/2022-Q2); removed SRE Observability (FY2021/2022-Q1).

lmata triaged this task as Medium priority.Nov 16 2021, 4:46 PM

lmata raised the priority of this task from Medium to Needs Triage.

lmata triaged this task as Medium priority.

lmata moved this task from FY2021/2022-Q2 to FY2021/2022-Q3 on the SRE Observability board.Jan 13 2022, 2:02 PM

lmata edited projects, added SRE Observability (FY2021/2022-Q3); removed SRE Observability (FY2021/2022-Q2).

lmata edited projects, added Observability-Metrics; removed SRE Observability (FY2021/2022-Q3).Apr 11 2022, 1:02 PM

lmata moved this task from Inbox to Backlog on the Observability-Metrics board.

@herron: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action... → Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!

Standardize a SLI metrics naming/storage/mapping schemeOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Standardize a SLI metrics naming/storage/mapping scheme
Open, MediumPublic
Actions

Related Objects
Search...