Page MenuHomePhabricator

scap did not catch `Notice: Undefined variable: wmgRelatedArticlesShowInSidebar in /srv/mediawiki/wmf-config/CommonSettings.php on line 2893`
Closed, InvalidPublic

Description

351664 introduced this error message:

Notice: Undefined variable: wmgRelatedArticlesShowInSidebar in /srv/mediawiki/wmf-config/CommonSettings.php on line 2893

It was fixed in 352596.

I was expecting scap would prevent me from deploying a change that causes so much log spam.

Event Timeline

thcipriani triaged this task as Medium priority.May 8 2017, 3:36 PM
thcipriani moved this task from Needs triage to Debt on the Scap board.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.

tl;dr: The timestamps of when the error-rate spiked on the canary servers and when the deployment started don't make sense.

I'm having a hard time piecing together this timeline.

This explains why the logstash_checker.py didn't think there was a significant shift in error rate over the 20 second period afer it ran at 2017-05-08T13:43:55.

I wrote a script that shows what logstash_checker.py thought the appropriate error rates should have been at the exact moment that the checker ran:
https://gist.github.com/thcipriani/033c71de8ad6e3057b25078a7d812442

mean 10s error rate before deploy: 74.6834532374
mean 10s error rate after deploy: 304.0
Failure threshold mean 10s error rate: 746.834532374
Over threshold: False

The date on tin and a random canary seem to be the same, so I have no idea what could have caused an 8 minute difference in the timestamps between the canary servers and tin.

I don't know what could have caused this.

there may be some delay in getting events into logstash but I guess that shouldn't screw up the timestamps either way?

... also, shouldn't a 4x increase in errors be enough to trigger the threshold?

When I'm doing a deployment and manually watching logstash I don't tolerate even a 2x increase in error rate.

there may be some delay in getting events into logstash but I guess that shouldn't screw up the timestamps either way?

I do recall @bd808 mentioning a potential delay in getting events to logstash, but this is like an advance of getting them to logstash. Also: 8 minutes of difference between tin and all 6 canary servers is weird.

... also, shouldn't a 4x increase in errors be enough to trigger the threshold?

When I'm doing a deployment and manually watching logstash I don't tolerate even a 2x increase in error rate.

this gets tricky. If the mean 10 second error rate before deployment is 0.1 and after it's 0.2 or 0.4 or even 1 (which is the current minimum tolerance) cosmic rays could account for that kind of difference. It may be worthwhile to inversely vary the multiplier with an increase in the pre-deploy error rate, i.e. the larger the pre-deploy error rate, the smaller the multiplier to halt a deploy, but that's some fiddling for another task :)

The early timestamps are in the raw syslog logs on mwlog1001, so I don't think logstash is doing anything funky here:

$ grep wmgRelatedArticlesUseCirrusSearch hhvm.log-20170509|awk '{print $3, $4}'|head -500|sort|uniq -c
     28 13:35:41 mw1261:
     38 13:35:41 mw1262:
     32 13:35:41 mw1263:
     40 13:35:41 mw1265:
    172 13:35:41 mw1276:
     40 13:35:41 mw1277:
     30 13:35:41 mw1278:
    120 13:35:41 mw1279:

At 2017-05-08T13:35:41Z, scap was running and changing files on those hosts:

$ grep 13:35:41 scap.log-20170509|wc -l
44
$ grep 13:35:41 scap.log-20170509|head -1|jq .
{
  "script": "/usr/bin/scap",
  "process": 9869,
  "@version": 1,
  "@timestamp": "2017-05-08T13:35:41Z",
  "module": "tasks",
  "funcName": "sync_common",
  "host": "mw1262",
  "pathname": "/usr/lib/python2.7/dist-packages/scap/tasks.py",
  "user": "mwdeploy",
  "message": "Copying to mw1262.eqiad.wmnet from tin.eqiad.wmnet",
  "filename": "tasks.py",
  "type": "scap",
  "levelname": "INFO",
  "channel": "sync_common",
  "lineno": 401
}

These were canary syncs that started at 2017-05-08T13:35:39Z:

{
  "script": "/usr/bin/scap",
  "process": 1751,
  "@version": 1,
  "@timestamp": "2017-05-08T13:35:39Z",
  "module": "log",
  "funcName": "__enter__",
  "host": "tin",
  "pathname": "/usr/lib/python2.7/dist-packages/scap/log.py",
  "user": "zfilipin",
  "message": "Started sync-check-canaries",
  "filename": "log.py",
  "type": "scap",
  "levelname": "INFO",
  "channel": "timer",
  "lineno": 724
}

The triggering scap command started at 2017-05-08T13:35:30Z as far as I can tell. The record immediately before that was timestamped 2017-05-08T13:32:19Z from mwdebug1002.

{
  "script": "/usr/bin/scap",
  "process": 1751,
  "@version": 1,
  "@timestamp": "2017-05-08T13:35:30Z",
  "module": "tasks",
  "funcName": "check_valid_syntax",
  "host": "tin",
  "pathname": "/usr/lib/python2.7/dist-packages/scap/tasks.py",
  "user": "zfilipin",
  "message": "Running command: `find -O2 '/srv/mediawiki-staging/wmf-config/CommonSettings.php' -not -type d -name '*.php' -not -name 'autoload_static.php'  -or -name '*.inc' | xargs -n1 -P12 -exec php -l >/dev/null`",
  "filename": "tasks.py",
  "type": "scap",
  "levelname": "DEBUG",
  "channel": "check_php_syntax",
  "lineno": 149
}

This was apparently related to the sync-file of wmf-config/InitialiseSettings.php that finished at 2017-05-08T13:44:04Z. I started digging in further though and found something revealing:

$ grep T13:36 scap.log-20170509|grep -v mwdebug1002|jq '. | .["@timestamp"] +" "+ .channel +" "+ .host +" "+ .message'
"2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'"
"2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'"
"2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'"
"2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1264.eqiad.wmnet'"
"2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'"
"2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'"
"2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'"
"2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'"
"2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'"
"2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'"
"2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1276.eqiad.wmnet' failed: ERROR: 44% OVER_THRESHOLD (Avg. Error rate: Before: 5.80, After: 105.00, Threshold: 58.00)\n"
"2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1261.eqiad.wmnet' failed: ERROR: 91% OVER_THRESHOLD (Avg. Error rate: Before: 0.52, After: 59.50, Threshold: 5.19)\n"
"2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1278.eqiad.wmnet' failed: ERROR: 87% OVER_THRESHOLD (Avg. Error rate: Before: 1.80, After: 145.00, Threshold: 17.95)\n"
"2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1263.eqiad.wmnet' failed: ERROR: 94% OVER_THRESHOLD (Avg. Error rate: Before: 0.41, After: 81.00, Threshold: 4.10)\n"
"2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1262.eqiad.wmnet' failed: ERROR: 96% OVER_THRESHOLD (Avg. Error rate: Before: 0.31, After: 95.00, Threshold: 3.05)\n"
"2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1277.eqiad.wmnet' failed: ERROR: 94% OVER_THRESHOLD (Avg. Error rate: Before: 1.42, After: 252.00, Threshold: 14.18)\n"
"2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1265.eqiad.wmnet' failed: ERROR: 58% OVER_THRESHOLD (Avg. Error rate: Before: 2.21, After: 53.50, Threshold: 22.08)\n"
"2017-05-08T13:36:02Z sync-file tin Unhandled error:"
"2017-05-08T13:36:02Z sync-file tin sync-file failed: <RuntimeError> 7 test canaries had check failures (rerun with --force to override this check)"

So the canary check worked and halted the original sync, but then at 2017-05-08T13:37:44Z the sync-file was run again:

"2017-05-08T13:37:44Z check_php_syntax tin Running command: `find -O2 '/srv/mediawiki-staging/wmf-config/InitialiseSettings.php' -not -type d -name '*.php' -not -name 'autoload_static.php'  -or -name '*.inc' | xargs -n1 -P12 -exec php -l >/dev/null`"
"2017-05-08T13:37:45Z sync_common tin Copying to tin.eqiad.wmnet from tin.eqiad.wmnet"
"2017-05-08T13:37:45Z sync_common tin Running rsync command: `sudo -u mwdeploy -n -- /usr/bin/rsync --archive --delete-delay --delay-updates --compress --delete --exclude=**/cache/l10n/*.cdb --exclude=*.swp --no-perms --exclude=**/.git --include=/wmf-config --include=/wmf-config/InitialiseSettings.php --include=/php-*/cache/gitinfo --exclude=* tin.eqiad.wmnet::common /srv/mediawiki`"

And it failed again due to canary error rate:

$ grep T13:38 scap.log-20170509|grep -v mwdebug1002|jq '. | .["@timestamp"] +" "+ .channel +" "+ .host +" "+ .message'
"2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'"
"2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'"
"2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'"
"2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1264.eqiad.wmnet'"
"2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'"
"2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'"
"2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'"
"2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'"
"2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'"
"2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'"
"2017-05-08T13:38:16Z root tin Check 'Logstash Error rate for mw1261.eqiad.wmnet' failed: ERROR: 15% OVER_THRESHOLD (Avg. Error rate: Before: 6.31, After: 74.50, Threshold: 63.07)\n"
"2017-05-08T13:38:16Z sync-file tin Unhandled error:"
"2017-05-08T13:38:16Z sync-file tin sync-file failed: <RuntimeError> 1 test canaries had check failures (rerun with --force to override this check)"

Finally it was run a third time starting at 2017-05-08T13:43:24Z:

$ grep T13:4[34] scap.log-20170509|grep -v mwdebug1002|jq '. | .["@timestamp"] +" "+ .channel +" "+ .host +" "+ .message'
"2017-05-08T13:43:24Z check_php_syntax tin Running command: `find -O2 '/srv/mediawiki-staging/wmf-config/InitialiseSettings.php' -not -type d -name '*.php' -not -name 'autoload_static.php'  -or -name '*.inc' | xargs -n1 -P12 -exec php -l >/dev/null`"
"2017-05-08T13:43:24Z sync_common tin Copying to tin.eqiad.wmnet from tin.eqiad.wmnet"
"2017-05-08T13:43:24Z sync_common tin Running rsync command: `sudo -u mwdeploy -n -- /usr/bin/rsync --archive --delete-delay --delay-updates --compress --delete --exclude=**/cache/l10n/*.cdb --exclude=*.swp --no-perms --exclude=**/.git --include=/wmf-config --include=/wmf-config/InitialiseSettings.php --include=/php-*/cache/gitinfo --exclude=* tin.eqiad.wmnet::common /srv/mediawiki`"
[...snip...]
"2017-05-08T13:43:34Z sync-file tin Waiting for canary traffic..."
"2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'"
"2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'"
"2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'"
"2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'"
"2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'"
"2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'"
"2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'"
"2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'"
"2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'"
"2017-05-08T13:43:55Z timer tin Started sync-proxies"
[...snip...]
"2017-05-08T13:44:04Z timer tin Finished sync-apaches (duration: 00m 06s)"
"2017-05-08T13:44:04Z scap.announce tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:351664|Wikivoyage should show related pages in footer of skin (T164391)]] (duration: 00m 39s)"
[...snip...]

This third attempt finally succeeded, likely because of error rate math as explained by @thcipriani in T164754#3249647

Recommendations:

  • scap should always announce when it starts changing the cluster state.
  • scap should always announce when it halts a sync due to error rate.

Random note: It would be much easier to thread through scap logs if each scap ... action generated a uniq id that was added to every log message.

Recommendations:

  • scap should always announce when it starts changing the cluster state.
  • scap should always announce when it halts a sync due to error rate.

Random note: It would be much easier to thread through scap logs if each scap ... action generated a uniq id that was added to every log message.

+1000

@bd808: I apologize for my involvement in sending you down that rabbit hole. I am not sorry though, that you came back with a rabbit! Nicely done I must say.

@bd808 thank you for the masterful grepping! I was really confused about how this could have happened.

Closing this as invalid as scap did attempt to halt the deploy. The sliding scale for the error-rate calculation will continue to be a problem until MediaWiki deploy can rollback canaries to their previously deployed known-good state.

Recommendations:

  • scap should always announce when it starts changing the cluster state.
  • scap should always announce when it halts a sync due to error rate.

Random note: It would be much easier to thread through scap logs if each scap ... action generated a uniq id that was added to every log message.

Filed as T164980 T164981 T164982

This third attempt finally succeeded, likely because of error rate math as explained by @thcipriani in T164754#3249647

Oops. In my defense, when trying to deploy the first time, scap said mw1264 is down. I have pinged moritzm and he depooled it. I have tried to deploy again and it worked. I do not remember deploying the same thing three times, but I guess logs don't lie.

13:42 moritzm: depooled mw1264 (set to inactive), since the host is down (T164725)

https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2017-05-08
https://wm-bot.wmflabs.org/logs/%23wikimedia-operations/20170508.txt

I guess I was so confused with mw1264 being down that I did not notice the log spam. I did deploy to mwdebug1002, so maybe that counts at the third deployment.