Right now our redis infrastructure for the jobqueues is as follows:
- a set of 4 master machines each with 4 redis instances running in the currently active DC
- a set of 4 more machines in the same DC that just replicate from the active ones
- a set of 3 machines replicating from the masters, but set in the inactive DC; one of those hosts 8 redis instances
- a set of 3 machines replicating from the local masters, in the inactive DC
So we're currently using 4 machines out of 8 in the active DC and 4 out of 14 overall; those machines are overloaded from time to time.
My proposal would be, given the jobqueue redis data has no PII data, to do as follows:
- Stop all the intra-dc replication
- Promote 2 of the current slaves in eqiad to be masters as well, to backpedal a bit the pressure on the current ones
- Move the other 2 slaves to the spares pool, or use them as generic redis master-slave couple
- Promote the 3 slaves in codfw to be masters as well and replicate from the masters in eqiad
in case of a failure, we can configure mediawiki to talk to the slave in the other dc without much harm, at the end of the day, and this will both simplify our infrastructure and allow better performance.