Description
Related Objects
- Mentioned In
- T164982: scap: investigate adding a per-deploy message ID to log messages
T164981: scap should always announce when it halts a sync due to error rate - Mentioned Here
- T164725: mw1264 inaccessible after reboot
T164980: scap should always announce when it starts changing the cluster state
T164981: scap should always announce when it halts a sync due to error rate
T164982: scap: investigate adding a per-deploy message ID to log messages
Event Timeline
tl;dr: The timestamps of when the error-rate spiked on the canary servers and when the deployment started don't make sense.
I'm having a hard time piecing together this timeline.
- 13:44:XX: The SAL records the final sync: 13:44 zfilipin@tin: Synchronized wmf-config/InitialiseSettings.php: SWAT: Wikivoyage should show related pages in footer of skin (T164391) (duration: 00m 39s)
- 13:43:55: Scap checks the logstash error rate
- 13:43:24: Scap's logstash shows you started the sync-file at 2017-05-08T13:43:24: https://logstash.wikimedia.org/goto/b7db812cf2bc874b1306d1fb50ccbeea
- 13:35:43: the error rate on the canary hosts spike due to missing variable in file that hasn't been sync'd yet: https://logstash.wikimedia.org/goto/32e7760188fc8679353d9583bb82de1f
This explains why the logstash_checker.py didn't think there was a significant shift in error rate over the 20 second period afer it ran at 2017-05-08T13:43:55.
I wrote a script that shows what logstash_checker.py thought the appropriate error rates should have been at the exact moment that the checker ran:
https://gist.github.com/thcipriani/033c71de8ad6e3057b25078a7d812442
mean 10s error rate before deploy: 74.6834532374 mean 10s error rate after deploy: 304.0 Failure threshold mean 10s error rate: 746.834532374 Over threshold: False
The date on tin and a random canary seem to be the same, so I have no idea what could have caused an 8 minute difference in the timestamps between the canary servers and tin.
I don't know what could have caused this.
there may be some delay in getting events into logstash but I guess that shouldn't screw up the timestamps either way?
... also, shouldn't a 4x increase in errors be enough to trigger the threshold?
When I'm doing a deployment and manually watching logstash I don't tolerate even a 2x increase in error rate.
I do recall @bd808 mentioning a potential delay in getting events to logstash, but this is like an advance of getting them to logstash. Also: 8 minutes of difference between tin and all 6 canary servers is weird.
... also, shouldn't a 4x increase in errors be enough to trigger the threshold?
When I'm doing a deployment and manually watching logstash I don't tolerate even a 2x increase in error rate.
this gets tricky. If the mean 10 second error rate before deployment is 0.1 and after it's 0.2 or 0.4 or even 1 (which is the current minimum tolerance) cosmic rays could account for that kind of difference. It may be worthwhile to inversely vary the multiplier with an increase in the pre-deploy error rate, i.e. the larger the pre-deploy error rate, the smaller the multiplier to halt a deploy, but that's some fiddling for another task :)
The early timestamps are in the raw syslog logs on mwlog1001, so I don't think logstash is doing anything funky here:
$ grep wmgRelatedArticlesUseCirrusSearch hhvm.log-20170509|awk '{print $3, $4}'|head -500|sort|uniq -c 28 13:35:41 mw1261: 38 13:35:41 mw1262: 32 13:35:41 mw1263: 40 13:35:41 mw1265: 172 13:35:41 mw1276: 40 13:35:41 mw1277: 30 13:35:41 mw1278: 120 13:35:41 mw1279:
At 2017-05-08T13:35:41Z, scap was running and changing files on those hosts:
$ grep 13:35:41 scap.log-20170509|wc -l 44 $ grep 13:35:41 scap.log-20170509|head -1|jq . { "script": "/usr/bin/scap", "process": 9869, "@version": 1, "@timestamp": "2017-05-08T13:35:41Z", "module": "tasks", "funcName": "sync_common", "host": "mw1262", "pathname": "/usr/lib/python2.7/dist-packages/scap/tasks.py", "user": "mwdeploy", "message": "Copying to mw1262.eqiad.wmnet from tin.eqiad.wmnet", "filename": "tasks.py", "type": "scap", "levelname": "INFO", "channel": "sync_common", "lineno": 401 }
These were canary syncs that started at 2017-05-08T13:35:39Z:
{ "script": "/usr/bin/scap", "process": 1751, "@version": 1, "@timestamp": "2017-05-08T13:35:39Z", "module": "log", "funcName": "__enter__", "host": "tin", "pathname": "/usr/lib/python2.7/dist-packages/scap/log.py", "user": "zfilipin", "message": "Started sync-check-canaries", "filename": "log.py", "type": "scap", "levelname": "INFO", "channel": "timer", "lineno": 724 }
The triggering scap command started at 2017-05-08T13:35:30Z as far as I can tell. The record immediately before that was timestamped 2017-05-08T13:32:19Z from mwdebug1002.
{ "script": "/usr/bin/scap", "process": 1751, "@version": 1, "@timestamp": "2017-05-08T13:35:30Z", "module": "tasks", "funcName": "check_valid_syntax", "host": "tin", "pathname": "/usr/lib/python2.7/dist-packages/scap/tasks.py", "user": "zfilipin", "message": "Running command: `find -O2 '/srv/mediawiki-staging/wmf-config/CommonSettings.php' -not -type d -name '*.php' -not -name 'autoload_static.php' -or -name '*.inc' | xargs -n1 -P12 -exec php -l >/dev/null`", "filename": "tasks.py", "type": "scap", "levelname": "DEBUG", "channel": "check_php_syntax", "lineno": 149 }
This was apparently related to the sync-file of wmf-config/InitialiseSettings.php that finished at 2017-05-08T13:44:04Z. I started digging in further though and found something revealing:
$ grep T13:36 scap.log-20170509|grep -v mwdebug1002|jq '. | .["@timestamp"] +" "+ .channel +" "+ .host +" "+ .message' "2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'" "2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'" "2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'" "2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1264.eqiad.wmnet'" "2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'" "2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'" "2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'" "2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'" "2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'" "2017-05-08T13:36:02Z root tin Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'" "2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1276.eqiad.wmnet' failed: ERROR: 44% OVER_THRESHOLD (Avg. Error rate: Before: 5.80, After: 105.00, Threshold: 58.00)\n" "2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1261.eqiad.wmnet' failed: ERROR: 91% OVER_THRESHOLD (Avg. Error rate: Before: 0.52, After: 59.50, Threshold: 5.19)\n" "2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1278.eqiad.wmnet' failed: ERROR: 87% OVER_THRESHOLD (Avg. Error rate: Before: 1.80, After: 145.00, Threshold: 17.95)\n" "2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1263.eqiad.wmnet' failed: ERROR: 94% OVER_THRESHOLD (Avg. Error rate: Before: 0.41, After: 81.00, Threshold: 4.10)\n" "2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1262.eqiad.wmnet' failed: ERROR: 96% OVER_THRESHOLD (Avg. Error rate: Before: 0.31, After: 95.00, Threshold: 3.05)\n" "2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1277.eqiad.wmnet' failed: ERROR: 94% OVER_THRESHOLD (Avg. Error rate: Before: 1.42, After: 252.00, Threshold: 14.18)\n" "2017-05-08T13:36:02Z root tin Check 'Logstash Error rate for mw1265.eqiad.wmnet' failed: ERROR: 58% OVER_THRESHOLD (Avg. Error rate: Before: 2.21, After: 53.50, Threshold: 22.08)\n" "2017-05-08T13:36:02Z sync-file tin Unhandled error:" "2017-05-08T13:36:02Z sync-file tin sync-file failed: <RuntimeError> 7 test canaries had check failures (rerun with --force to override this check)"
So the canary check worked and halted the original sync, but then at 2017-05-08T13:37:44Z the sync-file was run again:
"2017-05-08T13:37:44Z check_php_syntax tin Running command: `find -O2 '/srv/mediawiki-staging/wmf-config/InitialiseSettings.php' -not -type d -name '*.php' -not -name 'autoload_static.php' -or -name '*.inc' | xargs -n1 -P12 -exec php -l >/dev/null`" "2017-05-08T13:37:45Z sync_common tin Copying to tin.eqiad.wmnet from tin.eqiad.wmnet" "2017-05-08T13:37:45Z sync_common tin Running rsync command: `sudo -u mwdeploy -n -- /usr/bin/rsync --archive --delete-delay --delay-updates --compress --delete --exclude=**/cache/l10n/*.cdb --exclude=*.swp --no-perms --exclude=**/.git --include=/wmf-config --include=/wmf-config/InitialiseSettings.php --include=/php-*/cache/gitinfo --exclude=* tin.eqiad.wmnet::common /srv/mediawiki`"
And it failed again due to canary error rate:
$ grep T13:38 scap.log-20170509|grep -v mwdebug1002|jq '. | .["@timestamp"] +" "+ .channel +" "+ .host +" "+ .message' "2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'" "2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'" "2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'" "2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1264.eqiad.wmnet'" "2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'" "2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'" "2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'" "2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'" "2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'" "2017-05-08T13:38:16Z root tin Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'" "2017-05-08T13:38:16Z root tin Check 'Logstash Error rate for mw1261.eqiad.wmnet' failed: ERROR: 15% OVER_THRESHOLD (Avg. Error rate: Before: 6.31, After: 74.50, Threshold: 63.07)\n" "2017-05-08T13:38:16Z sync-file tin Unhandled error:" "2017-05-08T13:38:16Z sync-file tin sync-file failed: <RuntimeError> 1 test canaries had check failures (rerun with --force to override this check)"
Finally it was run a third time starting at 2017-05-08T13:43:24Z:
$ grep T13:4[34] scap.log-20170509|grep -v mwdebug1002|jq '. | .["@timestamp"] +" "+ .channel +" "+ .host +" "+ .message' "2017-05-08T13:43:24Z check_php_syntax tin Running command: `find -O2 '/srv/mediawiki-staging/wmf-config/InitialiseSettings.php' -not -type d -name '*.php' -not -name 'autoload_static.php' -or -name '*.inc' | xargs -n1 -P12 -exec php -l >/dev/null`" "2017-05-08T13:43:24Z sync_common tin Copying to tin.eqiad.wmnet from tin.eqiad.wmnet" "2017-05-08T13:43:24Z sync_common tin Running rsync command: `sudo -u mwdeploy -n -- /usr/bin/rsync --archive --delete-delay --delay-updates --compress --delete --exclude=**/cache/l10n/*.cdb --exclude=*.swp --no-perms --exclude=**/.git --include=/wmf-config --include=/wmf-config/InitialiseSettings.php --include=/php-*/cache/gitinfo --exclude=* tin.eqiad.wmnet::common /srv/mediawiki`" [...snip...] "2017-05-08T13:43:34Z sync-file tin Waiting for canary traffic..." "2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'" "2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'" "2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'" "2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'" "2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'" "2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'" "2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'" "2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'" "2017-05-08T13:43:55Z root tin Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'" "2017-05-08T13:43:55Z timer tin Started sync-proxies" [...snip...] "2017-05-08T13:44:04Z timer tin Finished sync-apaches (duration: 00m 06s)" "2017-05-08T13:44:04Z scap.announce tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:351664|Wikivoyage should show related pages in footer of skin (T164391)]] (duration: 00m 39s)" [...snip...]
This third attempt finally succeeded, likely because of error rate math as explained by @thcipriani in T164754#3249647
Recommendations:
- scap should always announce when it starts changing the cluster state.
- scap should always announce when it halts a sync due to error rate.
Random note: It would be much easier to thread through scap logs if each scap ... action generated a uniq id that was added to every log message.
+1000
@bd808: I apologize for my involvement in sending you down that rabbit hole. I am not sorry though, that you came back with a rabbit! Nicely done I must say.
@bd808 thank you for the masterful grepping! I was really confused about how this could have happened.
Closing this as invalid as scap did attempt to halt the deploy. The sliding scale for the error-rate calculation will continue to be a problem until MediaWiki deploy can rollback canaries to their previously deployed known-good state.
Oops. In my defense, when trying to deploy the first time, scap said mw1264 is down. I have pinged moritzm and he depooled it. I have tried to deploy again and it worked. I do not remember deploying the same thing three times, but I guess logs don't lie.
13:42 moritzm: depooled mw1264 (set to inactive), since the host is down (T164725)
https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2017-05-08
https://wm-bot.wmflabs.org/logs/%23wikimedia-operations/20170508.txt
I guess I was so confused with mw1264 being down that I did not notice the log spam. I did deploy to mwdebug1002, so maybe that counts at the third deployment.