Scap MediaWiki canaries should prompt to continue
scap didn't stop after error rate hit 89%

21:04:08 Started sync-check-canaries
check-canaries: 100% (ok: 11; fail: 0; left: 0)                                 
21:06:55 Finished sync-check-canaries (duration: 02m 46s)
21:06:55 Waiting for canary traffic...
21:07:15 Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1264.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mwdebug1002.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'
21:07:15 Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'
21:07:15 Check 'Logstash Error rate for mw1263.eqiad.wmnet' failed: ERROR: 89% OVER_THRESHOLD (Avg. Error rate: Before: 0.87, After: 84.50, Threshold: 8.68)
21:07:15 Started sync-proxies
sync-proxies:  87% (ok: 7; fail: 0; left: 1)

And then it kept going probably should have prompted me to rollback?

This is expected behavior as of scap 3.6.0: The canary check now relies on 2 instances being above an error threshold rather than 1. I did this to give scap a deployment bias. Too often deploys would be aborted by a single host's logspam spike unrelated to any deployment. If deployers make a habit of using --force for deploys, then a canary is useless.

In this instance the error message on mw1263 was:

Notice: Undefined variable: wmgUseTimeless in /srv/mediawiki/wmf-config/CommonSettings.php on line 653

I'm unclear why it was only mw1263 that has a spike in that error message, but I would assume it has something to do with rsync timing.

I'm open to changing canary behavior for MediaWiki; however, reverting to failing a deployment based on a single canary must to be paired with a change in logstash checker ( to change failure thresholds, I think. Suggestions are welcome.

Canary checks based on error rates have proven not to be as deterministic as I would like. Modifications to this system have been very whack-a-mole-y. A more deterministic system I think would involve modifications to the error thresholds in as well as finally getting T136839: Create a script to run test requests for the MediaWiki service out the door.

That all makes sense now that I'm not staring at a terminal wondering why scap kept going. I (and the others in the room I showed the terminal to) were expecting me to get a prompt that asked me to continue or not. In this case the error didn't have any user impact so I would have kept going, but I think the option to rollback would have been nice had it been another potentially user facing error.

I suffered a similar issue:

WARNING: Check 'Logstash Error rate for mw1276.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.03, After: 2.00, Threshold: 1.00)

The current verbosity of the log which made it disappear from my buffer quickly alarmed me. Can I suggest a log line saying "Continuing because only ${max_failed_canaries} failed to apply changes" as a compromise?

