Page MenuHomePhabricator

Scap MediaWiki canaries should prompt to continue
Closed, ResolvedPublic

Description

scap didn't stop after error rate hit 89%

21:04:08 Started sync-check-canaries
check-canaries: 100% (ok: 11; fail: 0; left: 0)                                 
21:06:55 Finished sync-check-canaries (duration: 02m 46s)
21:06:55 Waiting for canary traffic...
21:07:15 Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1264.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mwdebug1002.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'
21:07:15 Check 'Logstash Error rate for mw1263.eqiad.wmnet' failed: ERROR: 89% OVER_THRESHOLD (Avg. Error rate: Before: 0.87, After: 84.50, Threshold: 8.68)
21:07:15 Started sync-proxies
sync-proxies:  87% (ok: 7; fail: 0; left: 1)

And then it kept going on...it probably should have prompted me to rollback?

Revisions and Commits

Event Timeline

thcipriani moved this task from Needs triage to Debt on the Scap board.
thcipriani added a subscriber: thcipriani.

This is expected behavior as of scap 3.6.0: https://github.com/wikimedia/scap/blob/master/scap/main.py#L108-L109 The canary check now relies on 2 instances being above an error threshold rather than 1. I did this to give scap a deployment bias. Too often deploys would be aborted by a single host's logspam spike unrelated to any deployment. If deployers make a habit of using --force for deploys, then a canary is useless.

In this instance the error message on mw1263 was:

Notice: Undefined variable: wmgUseTimeless in /srv/mediawiki/wmf-config/CommonSettings.php on line 653

I'm unclear why it was only mw1263 that has a spike in that error message, but I would assume it has something to do with rsync timing.

I'm open to changing canary behavior for MediaWiki; however, reverting to failing a deployment based on a single canary must to be paired with a change in logstash checker (https://github.com/wikimedia/puppet/blob/production/modules/service/files/logstash_checker.py) to change failure thresholds, I think. Suggestions are welcome.

Canary checks based on error rates have proven not to be as deterministic as I would like. Modifications to this system have been very whack-a-mole-y. A more deterministic system I think would involve modifications to the error thresholds in logstash_checker.py as well as finally getting T136839: Create a script to run test requests for the MediaWiki service out the door.

That all makes sense now that I'm not staring at a terminal wondering why scap kept going. I (and the others in the room I showed the terminal to) were expecting me to get a prompt that asked me to continue or not. In this case the error didn't have any user impact so I would have kept going, but I think the option to rollback would have been nice had it been another potentially user facing error.

thcipriani renamed this task from scap didn't stop after error rate hit 89% to Scap mediawiki canaries should prompt to continue.Sep 29 2017, 4:02 PM
thcipriani renamed this task from Scap mediawiki canaries should prompt to continue to Scap MediaWiki canaries should prompt to continue.
thcipriani updated the task description. (Show Details)

I suffered a similar issue:

WARNING: Check 'Logstash Error rate for mw1276.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.03, After: 2.00, Threshold: 1.00)

The current verbosity of the log which made it disappear from my buffer quickly alarmed me. Can I suggest a log line saying "Continuing because only ${max_failed_canaries} failed to apply changes" as a compromise?