Investigate deployment that caused high error-rate and was prevented from going past canaries by Scap
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Jan 2 2018, 6:48 PM

Description

See T178942#3867158.

This caused a high spike in Logstash for type:mediawiki channel:error (and presumably in type:hhvm as well) on every MediaWiki index.php request.

Related Objects

Mentioned In: T121597: Implement MediaWiki pre-promote checks
Mentioned Here: T183999: Scap canary has a shifting baseline
T173146: Scap MediaWiki canaries should prompt to continue
T178942: Switch existing wikis from high density logos to SVG

Event Timeline

Krinkle created this task.Jan 2 2018, 6:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 2 2018, 6:48 PM

Krinkle updated the task description. (Show Details)Jan 2 2018, 6:49 PM

See T173146

Logstash graph based on the query looks like it should have caught it: https://logstash.wikimedia.org/goto/800d886d5e05b8e1f9d11454717cf183

Couple of options:

the spike happened outside the 20 second wait period
scap's query doesn't match the logstash dashboard (https://github.com/wikimedia/puppet/blob/production/modules/service/files/logstash_checker.py#L115)
internal scap logic for handling this failed somewhere

Well.

Scap does seem to have failed along with that error spike: http://tools.wmflabs.org/sal/log/AWC3SqMzwg13V6286YVJ

scap failed: average error rate on 6/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details)

In T183952#3868684, @thcipriani wrote:
Well.

Scap does seem to have failed along with that error spike: http://tools.wmflabs.org/sal/log/AWC3SqMzwg13V6286YVJ
scap failed: average error rate on 6/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details)

cc @zeljkofilipin :)

I filed T183999: Scap canary has a shifting baseline to address the main problem I see here, which is that a deployment that spikes the error rate and that's canary check fails but is subsequently redeployed results in the canary check running with a new baseline.

Maybe the baseline should be 24 hours in the past?

greg moved this task from INBOX to Kanban on the Release-Engineering-Team board.Jan 3 2018, 12:32 AM

greg edited projects, added Release-Engineering-Team (Kanban); removed Release-Engineering-Team.

greg moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.

Scap did fail during deployment. Since the commit that caused the failure was already merged, I had to revert it and deploy the revert. Is there something else I should have done?

In T183952#3878049, @zeljkofilipin wrote:

Scap did fail during deployment. Since the commit that caused the failure was already merged, I had to revert it and deploy the revert. Is there something else I should have done?

Nope, as I'm looking at it, it looks like you did the right thing and that this deploy didn't make it any further than the canary servers: all the effected servers are in: /etc/dsh/group/mediawiki-ap{i,pserver}-canaries.

thcipriani renamed this task from Investigate deployment that caused high error-rate but wasn't prevented by Scap to Investigate deployment that caused high error-rate and was prevented from going past canaries by Scap.Jan 8 2018, 11:04 PM

thcipriani updated the task description. (Show Details)

Dug a little deeper on this today and realized that this didn't actually make it out to production afaict (thanks to @zeljkofilipin), although the error rate on the canaries was high enough to climb to spike the overall error rate. The task I filed a few comments back is still a thing we need to do. I'm going to call this investigation complete though.

• MZMcBride subscribed.Jan 8 2018, 11:10 PM

Krinkle mentioned this in T121597: Implement MediaWiki pre-promote checks.Jan 31 2018, 1:37 AM

	F12210688: Screen Shot 2018-01-02 at 18.48.57.png
	Jan 2 2018, 6:49 PM

	F12210690: Screen Shot 2018-01-02 at 18.49.11.png
	Jan 2 2018, 6:49 PM

Investigate deployment that caused high error-rate and was prevented from going past canaries by ScapClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Investigate deployment that caused high error-rate and was prevented from going past canaries by Scap
Closed, ResolvedPublic
Actions