Scap canary has a shifting baseline
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	thcipriani
	Jan 2 2018, 9:21 PM

Description

The MediaWiki canary check performed by scap ensures that the error rate hasn't increased significantly (read: 10x) since the last deployment. The problem is that in the case of a successful deployment (i.e., one that hits all of production) that causes a significant increase in the error rate, subsequent deployments error rates are judged against a bad baseline.

Details

	Subject	Repo	Branch	Lines +/-
	Scap canary: cache last good deploy time	operations/puppet	production	+133 -4

Customize query in gerrit

Related Objects

Mentioned In: rMSCAa70af88cc17b: Move logstash checker code into scap
T212147: Allow scap sync to deploy gradually
T121597: Implement MediaWiki pre-promote checks
T183952: Investigate deployment that caused high error-rate and was prevented from going past canaries by Scap
Mentioned Here: T121597: Implement MediaWiki pre-promote checks

Event Timeline

thcipriani created this task.Jan 2 2018, 9:21 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 2 2018, 9:21 PM

thcipriani mentioned this in T183952: Investigate deployment that caused high error-rate and was prevented from going past canaries by Scap.Jan 2 2018, 9:24 PM

Should we delay the baseline by 24 hours to avoid recent changes from affecting it?

@mmodell A dynamic query for 24h ago seems somewhat fragile given that 1) traffic volume has strong seasonality (varying upto 4x on a regular basis), and 2) 24h could also coincidentally match a bad deploy?

Alternative idea: Default to the same as now, but conditionally override to a specific timestamp if the last deploy was a bad deploy. E.g. a timestamp value we store somewhere in case of a bad deploy, and if that value is set, check from before that time instead. If it isn't set (or has been cleared by a subsequent good deploy to all prod), then it'd check recent as usual.

I don't know if that is feasible, given we'd need a place to store it, but maybe we've got a place to store it already?

Change 403574 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Scap canary: cache last good deploy time

https://gerrit.wikimedia.org/r/403574

gerritbot added a project: Patch-For-Review.Jan 11 2018, 1:16 AM

thcipriani claimed this task.Jan 11 2018, 1:33 AM

thcipriani triaged this task as Medium priority.

thcipriani moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.

Niharika subscribed.Jan 31 2018, 1:23 AM

Krinkle mentioned this in T121597: Implement MediaWiki pre-promote checks.Jan 31 2018, 1:37 AM

@thcipriani @greg In light of yesterday's incident, gentle reminder for T121597.

The canary/logstash check can catch a wide range of errors from a wide range of sources, and it has been impressive for that. I imagine that its way of measuring relative change will help a lot with ensuring we keep lowering our logspam from mediawiki, and not regress. And we're doing pretty good there.

On the other hand, even after a dozen fatal incidents all leading to incremental improvements of the canary/logstash checker, it still doesn't reliably catch the big fatals, which don't require a wide net to catch. A simple version of the pre-promote check would go a long way to preventing incidents. They're easy to implement, and easy to get right.

I've updated T121597 with the latest details. I'm open to help with the details if needed (plan, implementation, code-review etc.), Let me know :)

• mmodell moved this task from Needs triage to Debt on the Scap board.Feb 1 2018, 12:21 AM

Change 403574 abandoned by Thcipriani:
Scap canary: cache last good deploy time

Reason:
Moving most of this logic into scap, will add a command line flag for baseline timestamp to logstash_checker

https://gerrit.wikimedia.org/r/403574

thcipriani removed thcipriani as the assignee of this task.Jun 25 2018, 4:56 PM

thcipriani moved this task from In-progress to Backlog on the Release-Engineering-Team (Kanban) board.

jijiki mentioned this in T212147: Allow scap sync to deploy gradually .Mar 8 2019, 9:39 PM

greg added a project: Release-Engineering-Team-TODO.Jul 1 2019, 9:29 PM

greg moved this task from Should be empty (use Release-Engineering-Team) to Soon-ish on the Release-Engineering-Team-TODO board.Jul 1 2019, 9:30 PM

greg removed a project: Release-Engineering-Team (Kanban).Jul 1 2019, 9:31 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 1 2019, 10:11 PM

greg moved this task from Soon-ish to Later / Need volunteer on the Release-Engineering-Team-TODO board.Jul 6 2019, 4:57 AM

greg added a project: Release-Engineering-Team (Deployment services).Aug 1 2019, 11:17 PM

thcipriani removed a project: Release-Engineering-Team (Deployment services).Apr 20 2021, 1:10 AM

thcipriani edited projects, added Release-Engineering-Team (thcipriani-workboard-fiddling); removed Release-Engineering-Team-TODO.Apr 20 2021, 3:42 AM

thcipriani moved this task from thcipriani-workboard-fiddling to Seen (ARCHIVE) on the Release-Engineering-Team board.Apr 20 2021, 3:58 AM

thcipriani edited projects, added Release-Engineering-Team; removed Release-Engineering-Team (thcipriani-workboard-fiddling).

thcipriani edited projects, added Release-Engineering-Team (Seen); removed Release-Engineering-Team.Apr 20 2021, 3:23 PM

dancy changed the task status from Open to In Progress.Jun 14 2024, 9:57 PM

dancy claimed this task.

dancy lowered the priority of this task from Medium to Low.

dancy opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/358

Move logstash checker code into scap

thcipriani merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/358

Move logstash checker code into scap

dancy mentioned this in rMSCAa70af88cc17b: Move logstash checker code into scap.Jun 21 2024, 8:13 PM

Maintenance_bot removed a project: Patch-For-Review.Jun 21 2024, 8:31 PM

As of scap 4.89.0, the threshold canary error rate is set by canary_threshold which has a default value of 10 which we expect to be suitable for production based on analysis of the last 90 days of canary error rates. You can run scap analyze-logstash to get a fresh recommendation.

Krinkle awarded a token.Jun 24 2024, 4:48 PM

Scap canary has a shifting baselineClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Scap canary has a shifting baseline
Closed, ResolvedPublic
Actions