rdb1002 is a slave used of OCG and needs to be handled separately, please do not consider it for this task.
rdb1001/1003/1004 are still on precise.
rdb1005/1006 are still on trusty.
Reinstall them with Jessie.
rdb1002 is a slave used of OCG and needs to be handled separately, please do not consider it for this task.
rdb1001/1003/1004 are still on precise.
rdb1005/1006 are still on trusty.
Reinstall them with Jessie.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T125673 Switch over from Eqiad to Codfw as primary datacentre | |||
Resolved | jcrespo | T124670 Figure out and document the datacenter switchover process | |||
Resolved | Krinkle | T124673 Figure out how to migrate the jobqueues | |||
Resolved | Dzahn | T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production | |||
Resolved | • elukey | T123675 Reinstall redis servers (Job queues) with Jessie (NOTE: rdb1002 is special and is excluded!) | |||
Resolved | • elukey | T128730 Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance |
Good conversation to save in this task as FYI:
<elukey> is there a known procedure to pool/de-pool rdbXXXX hosts from the jobrunner queue pools? (I am reading https://phabricator.wikimedia.org/T123675) 10:51 <elukey> I mean, without causing an outage like I do usually 10:52 <ori> elukey: comment it out in jobqueue-eqiad.php and (to be polite) wait for the jobs to drain 10:52 <_joe_> ori: is that really enough? 10:53 <_joe_> or the jobrunners would stop consuming from those queues? 10:53 <_joe_> once you comment them out 10:54 <ori> it is really enough -- MediaWiki doesn't read jobs back; it just writes them. So if the hashing scheme is perturbed, it doesn't matter; the important thing is that the load is balanced 10:54 <ori> and the jobrunners would not stop consuming from those queues; they are configured elsewhere 10:54 <_joe_> ori: uhm right 10:54 <_joe_> just checked 10:54 <ori> hieradata/eqiad/mediawiki/jobrunner.yaml 10:54 <_joe_> yeah we were smart enough to separate that 10:54 <_joe_> ok I completely forgot it 10:54 <_joe_> thanks :)
So MediaWiki writes jobs to the rdbXXXX hosts stated in jobqueue-eqiad.php, meanwhile the jobrunners consumes jobs to execute from the hosts stated in hieradata/eqiad/mediawiki/jobrunner.yaml
Each host needs to be re-imaged separately pooling/de-pooling it from the two configuration files in the following order:
The same procedure in reverse needs to be used to put the host back in service.
Slave summary:
rdb1008 slaveof rdb1007
rdb1004 slaveof rdb1003
rdb1006 slaveof rdb1005
rdb1002 slaveof rdb1001 (but 1002 will not be touched in this task)
elukey@neodymium:~$ sudo -i salt -t 120 rdb100* cmd.run 'egrep "^slaveof" /etc/redis/*.conf' rdb1007.eqiad.wmnet: rdb1008.eqiad.wmnet: /etc/redis/tcp_6378.conf:slaveof rdb1007 6378 /etc/redis/tcp_6379.conf:slaveof rdb1007 6379 /etc/redis/tcp_6380.conf:slaveof rdb1007 6380 /etc/redis/tcp_6381.conf:slaveof rdb1007 6381 rdb1005.eqiad.wmnet: rdb1004.eqiad.wmnet: /etc/redis/tcp_6378.conf:slaveof rdb1003 6378 /etc/redis/tcp_6379.conf:slaveof rdb1003 6379 /etc/redis/tcp_6380.conf:slaveof rdb1003 6380 /etc/redis/tcp_6381.conf:slaveof rdb1003 6381 rdb1003.eqiad.wmnet: rdb1006.eqiad.wmnet: /etc/redis/tcp_6378.conf:slaveof rdb1005 6378 /etc/redis/tcp_6379.conf:slaveof rdb1005 6379 /etc/redis/tcp_6380.conf:slaveof rdb1005 6380 /etc/redis/tcp_6381.conf:slaveof rdb1005 6381 rdb1001.eqiad.wmnet: rdb1002.eqiad.wmnet: /etc/redis/tcp_6378.conf:slaveof rdb1001 6378 /etc/redis/tcp_6379.conf:slaveof rdb1001 6379 /etc/redis/tcp_6380.conf:slaveof rdb1001 6380 /etc/redis/tcp_6381.conf:slaveof rdb1001 6381
MediaWiki Job queue config:
https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/jobqueue-eqiad.php
All the slaves are commented:
~/WikimediaSource/mediawiki-config git grep rdb100[8462]
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1002.eqiad.wmnet:6379', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1002.eqiad.wmnet:6380', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1002.eqiad.wmnet:6381', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1004.eqiad.wmnet:6379', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1004.eqiad.wmnet:6380', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1004.eqiad.wmnet:6381', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1008.eqiad.wmnet:6379', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1008.eqiad.wmnet:6380', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1008.eqiad.wmnet:6381', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1006.eqiad.wmnet:6379', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1006.eqiad.wmnet:6380', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1006.eqiad.wmnet:6381', # slave
The slaves are also not mentioned in the job runners configs:
@elukey let's start with the following:
When we did reimage succesfully the first slave, we should immediately remove the corresponding shards from mediawiki-config, in order to be able to drain the jobqueues (and evaluate how long it takes).
Does this sounds like a plan?
Change 274350 had a related patch set uploaded (by Elukey):
Add Debian Jessie PXE support for rdb servers (MW Job Queues).
Change 274350 merged by Elukey:
Add Debian Jessie PXE support for rdb servers (MW Job Queues).
Done:
rdb1004 (precise)
rdb1006 (trusty)
Remaining:
rdb1003
rdb1005 after we've confirmed the replication to 1006 works
rdb1002 after moving out ocg (I am on it)
rdb1001
Change 274411 had a related patch set uploaded (by Elukey):
Remove rdb1003 from the job queue pool for maintenance.
Change 274411 merged by Elukey:
Remove rdb1003 from the job queue pool for maintenance.
Change 276105 had a related patch set uploaded (by Giuseppe Lavagetto):
Reroute jobqueue writes from rdb1003 to rdb1005
Change 276105 merged by Giuseppe Lavagetto:
Reroute jobqueue writes from rdb1003 to rdb1005
Change 276452 had a related patch set uploaded (by Elukey):
Remove rdb1001 from the Redis Job Queues for maintenance.
Change 276452 abandoned by Elukey:
Remove rdb1001 from the Redis Job Queues for maintenance.
Reason:
Will go for rdb1005 first
Change 278244 had a related patch set uploaded (by Elukey):
Remove rdb1005 from the Job Queue pool for maintenance.
Change 278245 had a related patch set uploaded (by Elukey):
Remove rdb1005 from the Job Runner configs for maintenance.
Change 278245 abandoned by Elukey:
Remove rdb1005 from the Job Runner configs for maintenance.
Change 278246 had a related patch set uploaded (by Elukey):
Remove rdb1005 from the Job Runners config for maintenance.
Change 278244 merged by Elukey:
Remove rdb1005 from the Job Queue pool for maintenance.
Change 278246 merged by Elukey:
Remove rdb1005 from the Job Runners config for maintenance.
Change 278888 had a related patch set uploaded (by Elukey):
Remove rdb1001 from the Media Wiki Job Queue pool for maintenance.
Change 278889 had a related patch set uploaded (by Elukey):
Remove rdb1001 from the Job runners config for maintenance.
Change 278888 merged by Elukey:
Remove rdb1001 from the Media Wiki Job Queue pool for maintenance.
Change 278889 merged by Elukey:
Remove rdb1001 from the Job runners config for maintenance.
Change 278931 had a related patch set uploaded (by Elukey):
Notify Jobchron when jobrunner.conf changes.