as part of https://phabricator.wikimedia.org/T123675 I was trying to isolate rdb1003 from the JobQueues pool to re-image it with Debian Jessie.
Code review for the change: https://gerrit.wikimedia.org/r/#/c/274411/ (followed by an immediate revert)
The idea was to remove rdb1003 completely from wmf-config without failing over to its slave to avoid running a shard without its backup. The next steps (as described in the phab task) would have been to let the jobrunners drain the job queues and eventually remove rdb1003 from heira too (jobrunners config).
The plan failed miserably. Right after the sync on tin I observed tons of NOTICE events in logstash hhvm, some of them like:
Notice: Undefined index: 0 in /srv/mediawiki/php-1.27.0wmf.14/includes/search/SearchExactMatchRescorer.php on line 44 Notice: JobQueueGroup::__destruct: 1 buffered job(s) of type(s) EnqueueJob never inserted. in /srv/mediawiki/php-1.27.0-wmf.14/includes/jobqueue/JobQueueGroup.php on line 421
From: https://wikitech.wikimedia.org/wiki/Server_Admin_Log (UTC):
12:07 logmsgbot: elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: Remove rdb1003 from the Redis JobQueue pool for maintenance (duration: 00m 32s) 12:12 logmsgbot: elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: Revert - Remove rdb1003 from the Redis JobQueue pool for maintenance (duration: 00m 28s)
Another event triggered by this issue:
12:11 <icinga-wm> PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1138 bytes in 0.056 second response time 12:16 <icinga-wm> RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1657 bytes in 0.152 second response time
Of course I am terribly sorry to have caused this incident, I am going to stop my task until we have a clear idea about how to de-pool/re-pool rdb1003 safely (this task is a blocker for the codfw switchover though).