Page MenuHomePhabricator

Rationalize our jobqueues redis topology
Closed, DeclinedPublic

Description

Right now our redis infrastructure for the jobqueues is as follows:

  • a set of 4 master machines each with 4 redis instances running in the currently active DC
  • a set of 4 more machines in the same DC that just replicate from the active ones
  • a set of 3 machines replicating from the masters, but set in the inactive DC; one of those hosts 8 redis instances
  • a set of 3 machines replicating from the local masters, in the inactive DC

So we're currently using 4 machines out of 8 in the active DC and 4 out of 14 overall; those machines are overloaded from time to time.

My proposal would be, given the jobqueue redis data has no PII data, to do as follows:

  • Stop all the intra-dc replication
  • Promote 2 of the current slaves in eqiad to be masters as well, to backpedal a bit the pressure on the current ones
  • Move the other 2 slaves to the spares pool, or use them as generic redis master-slave couple
  • Promote the 3 slaves in codfw to be masters as well and replicate from the masters in eqiad

in case of a failure, we can configure mediawiki to talk to the slave in the other dc without much harm, at the end of the day, and this will both simplify our infrastructure and allow better performance.

Event Timeline

Joe triaged this task as Medium priority.May 12 2016, 12:05 PM

Yes, I think that makes sense. Thanks for suggesting it.

Seems sensible to me. We should document that on manual fail-over (say of a server in eqiad) to:
a) switchover that logical partition to the other DC
b) lower the "wieght" in jobqueue.php config for that partition to avoid hitting latency too much on push()/pop(), but keep it active for existing jobs