Page MenuHomePhabricator

Scap canary has a shifting baseline
Closed, ResolvedPublic

Description

The MediaWiki canary check performed by scap ensures that the error rate hasn't increased significantly (read: 10x) since the last deployment. The problem is that in the case of a successful deployment (i.e., one that hits all of production) that causes a significant increase in the error rate, subsequent deployments error rates are judged against a bad baseline.

Event Timeline

Should we delay the baseline by 24 hours to avoid recent changes from affecting it?

@mmodell A dynamic query for 24h ago seems somewhat fragile given that 1) traffic volume has strong seasonality (varying upto 4x on a regular basis), and 2) 24h could also coincidentally match a bad deploy?

Alternative idea: Default to the same as now, but conditionally override to a specific timestamp if the last deploy was a bad deploy. E.g. a timestamp value we store somewhere in case of a bad deploy, and if that value is set, check from before that time instead. If it isn't set (or has been cleared by a subsequent good deploy to all prod), then it'd check recent as usual.

I don't know if that is feasible, given we'd need a place to store it, but maybe we've got a place to store it already?

Change 403574 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Scap canary: cache last good deploy time

https://gerrit.wikimedia.org/r/403574

thcipriani triaged this task as Medium priority.
thcipriani moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.

@thcipriani @greg In light of yesterday's incident, gentle reminder for T121597.

The canary/logstash check can catch a wide range of errors from a wide range of sources, and it has been impressive for that. I imagine that its way of measuring relative change will help a lot with ensuring we keep lowering our logspam from mediawiki, and not regress. And we're doing pretty good there.

On the other hand, even after a dozen fatal incidents all leading to incremental improvements of the canary/logstash checker, it still doesn't reliably catch the big fatals, which don't require a wide net to catch. A simple version of the pre-promote check would go a long way to preventing incidents. They're easy to implement, and easy to get right.

I've updated T121597 with the latest details. I'm open to help with the details if needed (plan, implementation, code-review etc.), Let me know :)

Change 403574 abandoned by Thcipriani:
Scap canary: cache last good deploy time

Reason:
Moving most of this logic into scap, will add a command line flag for baseline timestamp to logstash_checker

https://gerrit.wikimedia.org/r/403574

dancy changed the task status from Open to In Progress.Jun 14 2024, 9:57 PM
dancy claimed this task.
dancy lowered the priority of this task from Medium to Low.

As of scap 4.89.0, the threshold canary error rate is set by canary_threshold which has a default value of 10 which we expect to be suitable for production based on analysis of the last 90 days of canary error rates. You can run scap analyze-logstash to get a fresh recommendation.