Page MenuHomePhabricator

RuntimeError: scap failed: average error rate on 4/11 canaries increased by 10x
Closed, DeclinedPublic

Description

Today during EU SWAT while deploying 465283 scap has failed:

[zfilipin@deploy1001 mediawiki-staging (master u=)]$ scap sync-file wmf-config/throttle.php 'SWAT: [[gerrit:465283|Add two throttle rules and remove outdated rule (T206408 T206914)]]'
           ___ ____
         ⎛   ⎛ ,----
          \  //==--'
     _//|,.·//==--'    ____________________________
    _OO≣=-  ︶ ᴹw ⎞_§ ______  ___\ ___\ ,\__ \/ __ \
   (∞)_, )  (     |  ______/__  \/ /__ / /_/ / /_/ /
     ¨--¨|| |- (  / ______\____/ \___/ \__^_/  .__/
         ««_/  «_/ jgs/bd808                /_/

11:26:11 Checking for new runtime errors locally
11:26:11 Copying to deploy1001.eqiad.wmnet from deploy1001.eqiad.wmnet
11:26:11 Started rsync common
11:26:12 Finished rsync common (duration: 00m 00s)
11:26:12 Started cache_git_info
11:26:17 Finished cache_git_info (duration: 00m 05s)
11:26:17 Started sync-masters
sync-masters: 100% (ok: 1; fail: 0; left: 0)                                    
11:26:24 Finished sync-masters (duration: 00m 07s)
11:26:24 Started sync-pull-masters
sync-pull-masters: 100% (ok: 1; fail: 0; left: 0)                               
11:26:26 Finished sync-pull-masters (duration: 00m 01s)
11:26:26 Started sync-check-canaries
check-canaries: 100% (ok: 11; fail: 0; left: 0)                                 
11:26:29 Finished Canaries Synced (duration: 00m 02s)
11:26:29 Executing check 'Check endpoints for mw1279.eqiad.wmnet'
11:26:29 Executing check 'Check endpoints for mw1276.eqiad.wmnet'
11:26:29 Executing check 'Check endpoints for mw1261.eqiad.wmnet'
11:26:29 Executing check 'Check endpoints for mw1264.eqiad.wmnet'
11:26:29 Executing check 'Check endpoints for mwdebug1002.eqiad.wmnet'
11:26:29 Executing check 'Check endpoints for mwdebug1001.eqiad.wmnet'
11:26:29 Executing check 'Check endpoints for mw1263.eqiad.wmnet'
11:26:29 Executing check 'Check endpoints for mw1262.eqiad.wmnet'
11:26:29 Executing check 'Check endpoints for mw1278.eqiad.wmnet'
11:26:29 Executing check 'Check endpoints for mw1277.eqiad.wmnet'
11:26:29 Executing check 'Check endpoints for mw1265.eqiad.wmnet'
11:26:32 Finished Canary Endpoint Check Complete (duration: 00m 02s)
11:26:32 Waiting for canary traffic...
11:26:49 Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'
11:26:49 Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'
11:26:49 Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'
11:26:49 Executing check 'Logstash Error rate for mw1264.eqiad.wmnet'
11:26:49 Executing check 'Logstash Error rate for mwdebug1002.eqiad.wmnet'
11:26:49 Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'
11:26:49 Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'
11:26:49 Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'
11:26:49 Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'
11:26:49 Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'
11:26:49 Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'
11:26:49 Check 'Logstash Error rate for mw1279.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.03, After: 2.00, Threshold: 1.00)

11:26:49 Check 'Logstash Error rate for mw1276.eqiad.wmnet' failed: ERROR: 71% OVER_THRESHOLD (Avg. Error rate: Before: 0.05, After: 3.50, Threshold: 1.00)

11:26:49 Check 'Logstash Error rate for mw1277.eqiad.wmnet' failed: ERROR: 66% OVER_THRESHOLD (Avg. Error rate: Before: 0.05, After: 3.00, Threshold: 1.00)

11:26:49 Check 'Logstash Error rate for mw1278.eqiad.wmnet' failed: ERROR: 33% OVER_THRESHOLD (Avg. Error rate: Before: 0.04, After: 1.50, Threshold: 1.00)

11:26:49 scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
11:26:49 Finished sync-check-canaries (duration: 00m 23s)
11:26:49 Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 342, in run
    exit_status = app.main(app.extra_arguments)
  File "/usr/lib/python2.7/dist-packages/scap/main.py", line 716, in main
    return super(SyncFile, self).main(*extra_args)
  File "/usr/lib/python2.7/dist-packages/scap/main.py", line 95, in main
    self.canary_checks(canaries, timer)
  File "/usr/lib/python2.7/dist-packages/scap/main.py", line 451, in canary_checks
    raise RuntimeError(canary_fail_msg)
RuntimeError: scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)
11:26:49 sync-file failed: <RuntimeError> scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details)

Event Timeline

The period of exceptions in logstash

https://logstash.wikimedia.org/goto/79a26df78d4623a8f5c1f373598fbde1
https://logstash.wikimedia.org/goto/4d6b1ec854cf50d1f3f89086a2f8b693

It looks like there was a short burt in DBPerformance logs for commonswiki.

Probably didn't relate to the patch

thcipriani subscribed.

The period of exceptions in logstash

https://logstash.wikimedia.org/goto/79a26df78d4623a8f5c1f373598fbde1
https://logstash.wikimedia.org/goto/4d6b1ec854cf50d1f3f89086a2f8b693

It looks like there was a short burt in DBPerformance logs for commonswiki.

Probably didn't relate to the patch

I agree with this assessment.

Looking at the specifics from scap output:

11:26:49 Check 'Logstash Error rate for mw1279.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.03, After: 2.00, Threshold: 1.00)

11:26:49 Check 'Logstash Error rate for mw1276.eqiad.wmnet' failed: ERROR: 71% OVER_THRESHOLD (Avg. Error rate: Before: 0.05, After: 3.50, Threshold: 1.00)

11:26:49 Check 'Logstash Error rate for mw1277.eqiad.wmnet' failed: ERROR: 66% OVER_THRESHOLD (Avg. Error rate: Before: 0.05, After: 3.00, Threshold: 1.00)

11:26:49 Check 'Logstash Error rate for mw1278.eqiad.wmnet' failed: ERROR: 33% OVER_THRESHOLD (Avg. Error rate: Before: 0.04, After: 1.50, Threshold: 1.00)

The "After" average 10-second error rates for each machine that failed are still pretty small overall (2, 3.5, 3, and 1.5 average errors/10 seconds after deployment). The threshold at which scap will stop deployment is max($preDeploymentErrorRate * 10.0, 1.0) average errors per 10 second period. This has worked OK for a while, but I am open to revisiting this if tasks like this begin to happen with any regularity.

I also realized the patch wasn't actually reverted on the canaries but the logs quieted down so the particular patch being deployed didn't seem to actually affect the logs on the canaries one way or another. We try to control for log anomalies on particular servers by ensuring that > 1 canary is failing before actually failing a deployment in scap; however, there are still edgecases like this one.

As this is the first occurrence of a false positive that I've seen in quite a while I'm going to decline this task for the time being without making any tuning changes to scap. If this recurs please reopen or create a new task and link and we can tinker with error thresholds.

Sorry for scap noise and thanks for stopping SWAT to investigate.