Page MenuHomePhabricator

Reinstall redis servers (Job queues) with Jessie (NOTE: rdb1002 is special and is excluded!)
Closed, ResolvedPublic

Description

rdb1002 is a slave used of OCG and needs to be handled separately, please do not consider it for this task.

rdb1001/1003/1004 are still on precise.
rdb1005/1006 are still on trusty.

Reinstall them with Jessie.

Event Timeline

Dzahn updated the task description. (Show Details)
Dzahn raised the priority of this task from to Needs Triage.
Dzahn added a project: Operations.
elukey added a subscriber: elukey.EditedFeb 26 2016, 10:04 AM

Good conversation to save in this task as FYI:

<elukey> is there a known procedure to pool/de-pool rdbXXXX hosts from the jobrunner queue pools? (I am reading https://phabricator.wikimedia.org/T123675)
10:51  <elukey> I mean, without causing an outage like I do usually
10:52  <ori> elukey: comment it out in jobqueue-eqiad.php and (to be polite) wait for the jobs to drain
10:52  <_joe_> ori: is that really enough?
10:53  <_joe_> or the jobrunners would stop consuming from those queues?
10:53  <_joe_> once you comment them out
10:54  <ori> it is really enough -- MediaWiki doesn't read jobs back; it just writes them. So if the hashing scheme is perturbed, it doesn't matter; the important thing is that the load is balanced
10:54  <ori> and the jobrunners would not stop consuming from those queues; they are configured elsewhere
10:54  <_joe_> ori: uhm right
10:54  <_joe_> just checked
10:54  <ori> hieradata/eqiad/mediawiki/jobrunner.yaml
10:54  <_joe_> yeah we were smart enough to separate that
10:54  <_joe_> ok I completely forgot it
10:54  <_joe_> thanks :)

So MediaWiki writes jobs to the rdbXXXX hosts stated in jobqueue-eqiad.php, meanwhile the jobrunners consumes jobs to execute from the hosts stated in hieradata/eqiad/mediawiki/jobrunner.yaml

Each host needs to be re-imaged separately pooling/de-pooling it from the two configuration files in the following order:

  1. start with jobqueue-eqiad.php to instruct MediaWiki to stop publishing jobs in the queue.
  2. wait for all the jobs to be completed
  3. remove the host from hieradata/eqiad/mediawiki/jobrunner.yaml

The same procedure in reverse needs to be used to put the host back in service.

elukey renamed this task from reinstall redis servers with jessie to Reinstall redis servers (Job queues) with Jessie.Feb 26 2016, 11:54 AM
elukey triaged this task as Normal priority.Feb 29 2016, 12:53 PM
elukey claimed this task.Mar 1 2016, 6:18 PM
elukey renamed this task from Reinstall redis servers (Job queues) with Jessie to Reinstall redis servers (Job queues) with Jessie (NOTE: rdb1002 is special and is excluded!).Mar 1 2016, 6:21 PM
elukey updated the task description. (Show Details)
elukey added a comment.Mar 2 2016, 7:48 AM

Slave summary:

rdb1008 slaveof rdb1007
rdb1004 slaveof rdb1003
rdb1006 slaveof rdb1005
rdb1002 slaveof rdb1001 (but 1002 will not be touched in this task)

elukey@neodymium:~$ sudo -i salt -t 120 rdb100* cmd.run 'egrep "^slaveof" /etc/redis/*.conf'
rdb1007.eqiad.wmnet:
rdb1008.eqiad.wmnet:
    /etc/redis/tcp_6378.conf:slaveof rdb1007 6378
    /etc/redis/tcp_6379.conf:slaveof rdb1007 6379
    /etc/redis/tcp_6380.conf:slaveof rdb1007 6380
    /etc/redis/tcp_6381.conf:slaveof rdb1007 6381
rdb1005.eqiad.wmnet:
rdb1004.eqiad.wmnet:
    /etc/redis/tcp_6378.conf:slaveof rdb1003 6378
    /etc/redis/tcp_6379.conf:slaveof rdb1003 6379
    /etc/redis/tcp_6380.conf:slaveof rdb1003 6380
    /etc/redis/tcp_6381.conf:slaveof rdb1003 6381
rdb1003.eqiad.wmnet:
rdb1006.eqiad.wmnet:
    /etc/redis/tcp_6378.conf:slaveof rdb1005 6378
    /etc/redis/tcp_6379.conf:slaveof rdb1005 6379
    /etc/redis/tcp_6380.conf:slaveof rdb1005 6380
    /etc/redis/tcp_6381.conf:slaveof rdb1005 6381
rdb1001.eqiad.wmnet:
rdb1002.eqiad.wmnet:
    /etc/redis/tcp_6378.conf:slaveof rdb1001 6378
    /etc/redis/tcp_6379.conf:slaveof rdb1001 6379
    /etc/redis/tcp_6380.conf:slaveof rdb1001 6380
    /etc/redis/tcp_6381.conf:slaveof rdb1001 6381
elukey added a comment.Mar 2 2016, 7:55 AM

MediaWiki Job queue config:

https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/jobqueue-eqiad.php

All the slaves are commented:

~/WikimediaSource/mediawiki-config git grep rdb100[8462]
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1002.eqiad.wmnet:6379', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1002.eqiad.wmnet:6380', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1002.eqiad.wmnet:6381', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1004.eqiad.wmnet:6379', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1004.eqiad.wmnet:6380', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1004.eqiad.wmnet:6381', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1008.eqiad.wmnet:6379', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1008.eqiad.wmnet:6380', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1008.eqiad.wmnet:6381', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1006.eqiad.wmnet:6379', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1006.eqiad.wmnet:6380', # slave
wmf-config/jobqueue-eqiad.php: #'redisServer' => 'rdb1006.eqiad.wmnet:6381', # slave

The slaves are also not mentioned in the job runners configs:

https://github.com/wikimedia/operations-puppet/blob/production/hieradata/eqiad/mediawiki/jobrunner.yaml

Joe added a subscriber: Joe.EditedMar 2 2016, 8:07 AM

@elukey let's start with the following:

  1. rdb1004 (precise)
  2. rdb1006 (trusty)
  3. rdb1003
  4. rdb1005 after we've confirmed the replication to 1006 works
  5. rdb1002 after moving out ocg (I am on it)
  6. rdb1001

When we did reimage succesfully the first slave, we should immediately remove the corresponding shards from mediawiki-config, in order to be able to drain the jobqueues (and evaluate how long it takes).

Does this sounds like a plan?

Change 274350 had a related patch set uploaded (by Elukey):
Add Debian Jessie PXE support for rdb servers (MW Job Queues).

https://gerrit.wikimedia.org/r/274350

Change 274350 merged by Elukey:
Add Debian Jessie PXE support for rdb servers (MW Job Queues).

https://gerrit.wikimedia.org/r/274350

elukey added a comment.Mar 2 2016, 3:38 PM

Done:

rdb1004 (precise)
rdb1006 (trusty)

Remaining:

rdb1003
rdb1005 after we've confirmed the replication to 1006 works
rdb1002 after moving out ocg (I am on it)
rdb1001

Change 274411 had a related patch set uploaded (by Elukey):
Remove rdb1003 from the job queue pool for maintenance.

https://gerrit.wikimedia.org/r/274411

Change 274411 merged by Elukey:
Remove rdb1003 from the job queue pool for maintenance.

https://gerrit.wikimedia.org/r/274411

Change 276105 had a related patch set uploaded (by Giuseppe Lavagetto):
Reroute jobqueue writes from rdb1003 to rdb1005

https://gerrit.wikimedia.org/r/276105

Change 276105 merged by Giuseppe Lavagetto:
Reroute jobqueue writes from rdb1003 to rdb1005

https://gerrit.wikimedia.org/r/276105

Change 276452 had a related patch set uploaded (by Elukey):
Remove rdb1001 from the Redis Job Queues for maintenance.

https://gerrit.wikimedia.org/r/276452

Change 276452 abandoned by Elukey:
Remove rdb1001 from the Redis Job Queues for maintenance.

Reason:
Will go for rdb1005 first

https://gerrit.wikimedia.org/r/276452

Change 278244 had a related patch set uploaded (by Elukey):
Remove rdb1005 from the Job Queue pool for maintenance.

https://gerrit.wikimedia.org/r/278244

Change 278245 had a related patch set uploaded (by Elukey):
Remove rdb1005 from the Job Runner configs for maintenance.

https://gerrit.wikimedia.org/r/278245

Change 278245 abandoned by Elukey:
Remove rdb1005 from the Job Runner configs for maintenance.

https://gerrit.wikimedia.org/r/278245

Change 278246 had a related patch set uploaded (by Elukey):
Remove rdb1005 from the Job Runners config for maintenance.

https://gerrit.wikimedia.org/r/278246

Change 278244 merged by Elukey:
Remove rdb1005 from the Job Queue pool for maintenance.

https://gerrit.wikimedia.org/r/278244

Change 278246 merged by Elukey:
Remove rdb1005 from the Job Runners config for maintenance.

https://gerrit.wikimedia.org/r/278246

Change 278888 had a related patch set uploaded (by Elukey):
Remove rdb1001 from the Media Wiki Job Queue pool for maintenance.

https://gerrit.wikimedia.org/r/278888

Change 278889 had a related patch set uploaded (by Elukey):
Remove rdb1001 from the Job runners config for maintenance.

https://gerrit.wikimedia.org/r/278889

Change 278888 merged by Elukey:
Remove rdb1001 from the Media Wiki Job Queue pool for maintenance.

https://gerrit.wikimedia.org/r/278888

Change 278889 merged by Elukey:
Remove rdb1001 from the Job runners config for maintenance.

https://gerrit.wikimedia.org/r/278889

Change 278931 had a related patch set uploaded (by Elukey):
Notify Jobchron when jobrunner.conf changes.

https://gerrit.wikimedia.org/r/278931

All Redis Job Queue are on Debian!

elukey closed this task as Resolved.Mar 22 2016, 4:50 PM

Change 278931 merged by Elukey:
Notify Jobchron when jobrunner.conf changes.

https://gerrit.wikimedia.org/r/278931