We are currently using a very redundant replication scheme for the Job Queues: each shard is a Redis master instance that replicates to a intra-DC replica and to a inter-DC replica in codfw.
For example, rdb1001 runs four Redis instances:
- redis-instance-tcp_6378
- redis-instance-tcp_6379
- redis-instance-tcp_6380
- redis-instance-tcp_6381
Each instance is replicated to rdb1002 and rdb2001 via the Redis replication protocol (that is currently under investigation in T163337 to eliminate inconsistency for some data structures).
So to generalize, rdb100[1357] run 4 Redis instances each (active masters), that replicates to rdb100[2468] and rdb200[135] (rdb2007 does not exists, rdb2005 runs 8 Redis instances). If one of the host running a active master instances fail, this is the failover procedure:
- Mediawiki should automatically detect the failure and submit the jobs to another shard (not permanently, but via a try/except mechanism).
- Ops/Core-Deployers should modify ProductionServices.php in mw-config to permanently remove the shards that are not available anymore pointing them to the local slave replicas (that are rw-enabled, so capable of receiving writes and not only read requests).
At any given time (considering eqiad as active DC) only 4 Redis hosts are serving actively traffic, meanwhile the following hosts are doing basically nothing except replicating:
- 4 local-DC replicas in eqiad
- 3 inter-DC replicas in codfw (considering that rdb2005 is running 8 replicas because rdb2007 does not exists)
- 3 codfw replicas of the inter-DC replicas (so rdb2005 is a replica of rdb1005 and rdb2006 is the replica of rdb2005).
This model ensures to be able to sustain a complete failure of the whole job queue active cluster, but it may be a bit too much considering the amount of resources that we run basically idle all the time. Adding to the mix that some hardware (like rdb100[78]) is really old and would need to be decommissioned.
This task has been opened to figure out if a more efficient replication strategy could be investigated to allow a better hw usage at any given time.
An alternative could be to remove the local-dc replicas and figure out a new failure model scenario if rdb100X fails, for example
- force the jobrunners to consume jobs from rdb200X (would need an IP sec tunnel mesh between each pair of mw/rdb host).
- use a spare rdb host to replicate on the fly from rdb200X and point jobrunners to consume temporary from it.
The above options are not as strong as the current model of course, but we'd need to give up something to make a good compromise between replication and an efficient hw usage.