Page MenuHomePhabricator

Figure out how to migrate the jobqueues
Closed, ResolvedPublic

Description

At the moment, jobs get written to the redis clusters in eqiad. When we do the switchover, the new jobs will be written to the redis cluster in codfw, so we need what follows:

  1. Understand what is the preferred configuration for the jobqueues (active/active, active/passive? prevent the non-primary datacenter to consume it?)
  2. Verify the resource available here for the jobrunners the eqiad cluster was expanded)
  3. Figure out how to carry on the switchover process (including how to drain the eqiad queue in case of need)
  4. Fix/check the configs in codfw to match what we have in eqiad

Event Timeline

Joe created this task.Jan 25 2016, 4:59 PM
Joe raised the priority of this task from to Medium.
Joe updated the task description. (Show Details)
Joe added subscribers: faidon, Aklapper, Joe.
Joe updated the task description. (Show Details)Jan 25 2016, 5:02 PM
Joe set Security to None.
Joe added a comment.EditedMar 8 2016, 4:42 PM

Recapitulating what I *understand* to be the best thing to do:

Prep work that still needs to be completed:

  1. Reimage to jessie the four eqiad redis masters (this should be discussed first)
  2. Add two new servers to the codfw cluster and set up encryption and replication between the datacenters
  3. Upgrade mediawiki-config to point to the local resources in each datacenter
  4. Ensure that the jobrunners in the non-active datacenters are not started

The switchover should follow this procedure:

  1. Jobrunners in eqiad get stopped
  2. mediawiki goes read-only - this should ensure no new job gets enqueued, right?
  3. mediawiki primary gets switched in mediawiki-config and in puppet
    1. As a consequence the redis replication will be inverted
  4. mediawiki is set read-write in codfw
  5. We start the jobrunners in codfw and they will consume the jobs left over from eqiad

While complex and requiring a series of manual steps, this should be manageable to do. I'd like others to comment on this plan: the main uncertainty for me is how are shards caluculated: if in any place we use hostname/port instead of a label for sharding, it might be a problem.

Apart from that, we should be sure that jobs will be correctly executed when restarted.

mediawiki goes read-only - this should ensure no new job gets enqueued, right?

That is what I would expect, but I could not verify last time I did it (however, I could not distinguish traffic from new jobs vs. already enqueued at the time), as I did not stop the queues at that time.

ori added subscribers: ori, aaron.
Joe changed the status of subtask T129317: Dedicate 1/2 codfw jobrunners to gwtoolset jobs from Open to Stalled.Mar 14 2016, 6:40 PM
aaron added a comment.Mar 17 2016, 8:38 PM

See also T128730 for comments on doing single-DC maintenance.

aaron added a comment.Mar 17 2016, 8:47 PM

While complex and requiring a series of manual steps, this should be manageable to do. I'd like others to comment on this plan: the main uncertainty for me is how are shards caluculated: if in any place we use hostname/port instead of a label for sharding, it might be a problem.

Apart from that, we should be sure that jobs will be correctly executed when restarted.

The steps above make sense. As for sharding, the 'rdb*' tag names in config are meant to have servers with the same data in each DC for simplicity. So rdb1 in eqiad would have the same data as rdb1 in codfw, whatever the actual host names are.

Strictly speaking, even the tag names don't have to match as long all 4 eqiad servers have codfw "masters" with the same data. So you could have rdb0-3 in one DC and rdb1-4 in another, though, that would be needlessly confusing.

Joe added a comment.Mar 23 2016, 3:41 PM

All the blockers I listed earlier have been removed. Resolving and moving documentation to https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Job_queue

Joe added a comment.Mar 24 2016, 8:55 AM

The procedure is simpler than expected as redis can detect circular replications and will just refuse to communicate to its slave, if such slave is also a master. So we should be safe during the transition even if the puppet runs are not all in sync.

Joe moved this task from Backlog to Done on the codfw-rollout-Jan-Mar-2016 board.Mar 31 2016, 11:37 AM
Krinkle closed this task as Resolved.Apr 21 2016, 3:18 PM
Krinkle claimed this task.