Page MenuHomePhabricator

Scap canary has a shifting baseline
Open, NormalPublic

Description

The MediaWiki canary check performed by scap ensures that the error rate hasn't increased significantly (read: 10x) since the last deployment. The problem is that in the case of a successful deployment (i.e., one that hits all of production) that causes a significant increase in the error rate, subsequent deployments error rates are judged against a bad baseline.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 2 2018, 9:21 PM

Should we delay the baseline by 24 hours to avoid recent changes from affecting it?

@mmodell A dynamic query for 24h ago seems somewhat fragile given that 1) traffic volume has strong seasonality (varying upto 4x on a regular basis), and 2) 24h could also coincidentally match a bad deploy?

Alternative idea: Default to the same as now, but conditionally override to a specific timestamp if the last deploy was a bad deploy. E.g. a timestamp value we store somewhere in case of a bad deploy, and if that value is set, check from before that time instead. If it isn't set (or has been cleared by a subsequent good deploy to all prod), then it'd check recent as usual.

I don't know if that is feasible, given we'd need a place to store it, but maybe we've got a place to store it already?

Change 403574 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Scap canary: cache last good deploy time

https://gerrit.wikimedia.org/r/403574

thcipriani triaged this task as Normal priority.
thcipriani moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.
Krinkle added a subscriber: greg.EditedJan 31 2018, 1:48 AM

@thcipriani @greg In light of yesterday's incident, gentle reminder for T121597.

The canary/logstash check can catch a wide range of errors from a wide range of sources, and it has been impressive for that. I imagine that its way of measuring relative change will help a lot with ensuring we keep lowering our logspam from mediawiki, and not regress. And we're doing pretty good there.

On the other hand, even after a dozen fatal incidents all leading to incremental improvements of the canary/logstash checker, it still doesn't reliably catch the big fatals, which don't require a wide net to catch. A simple version of the pre-promote check would go a long way to preventing incidents. They're easy to implement, and easy to get right.

I've updated T121597 with the latest details. I'm open to help with the details if needed (plan, implementation, code-review etc.), Let me know :)

mmodell moved this task from Needs triage to Debt on the Scap board.Feb 1 2018, 12:21 AM

Change 403574 abandoned by Thcipriani:
Scap canary: cache last good deploy time

Reason:
Moving most of this logic into scap, will add a command line flag for baseline timestamp to logstash_checker

https://gerrit.wikimedia.org/r/403574

thcipriani removed thcipriani as the assignee of this task.Jun 25 2018, 4:56 PM
thcipriani moved this task from In-progress to Backlog on the Release-Engineering-Team (Kanban) board.