Investigate a simplified replication model for the Redis Job Queues
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	elukey
	May 8 2017, 12:58 PM

Description

We are currently using a very redundant replication scheme for the Job Queues: each shard is a Redis master instance that replicates to a intra-DC replica and to a inter-DC replica in codfw.

For example, rdb1001 runs four Redis instances:

redis-instance-tcp_6378
redis-instance-tcp_6379
redis-instance-tcp_6380
redis-instance-tcp_6381

Each instance is replicated to rdb1002 and rdb2001 via the Redis replication protocol (that is currently under investigation in T163337 to eliminate inconsistency for some data structures).

So to generalize, rdb100[1357] run 4 Redis instances each (active masters), that replicates to rdb100[2468] and rdb200[135] (rdb2007 does not exists, rdb2005 runs 8 Redis instances). If one of the host running a active master instances fail, this is the failover procedure:

Mediawiki should automatically detect the failure and submit the jobs to another shard (not permanently, but via a try/except mechanism).
Ops/Core-Deployers should modify ProductionServices.php in mw-config to permanently remove the shards that are not available anymore pointing them to the local slave replicas (that are rw-enabled, so capable of receiving writes and not only read requests).

At any given time (considering eqiad as active DC) only 4 Redis hosts are serving actively traffic, meanwhile the following hosts are doing basically nothing except replicating:

4 local-DC replicas in eqiad
3 inter-DC replicas in codfw (considering that rdb2005 is running 8 replicas because rdb2007 does not exists)
3 codfw replicas of the inter-DC replicas (so rdb2005 is a replica of rdb1005 and rdb2006 is the replica of rdb2005).

This model ensures to be able to sustain a complete failure of the whole job queue active cluster, but it may be a bit too much considering the amount of resources that we run basically idle all the time. Adding to the mix that some hardware (like rdb100[78]) is really old and would need to be decommissioned.

This task has been opened to figure out if a more efficient replication strategy could be investigated to allow a better hw usage at any given time.

An alternative could be to remove the local-dc replicas and figure out a new failure model scenario if rdb100X fails, for example

force the jobrunners to consume jobs from rdb200X (would need an IP sec tunnel mesh between each pair of mw/rdb host).
use a spare rdb host to replicate on the fly from rdb200X and point jobrunners to consume temporary from it.

The above options are not as strong as the current model of course, but we'd need to give up something to make a good compromise between replication and an efficient hw usage.

Related Objects

Mentioned Here: T157088: [EPIC] Develop a JobQueue backend based on EventBus
T163337: Job queue corruption after codfw switch over (Queue growth, duplicate runs)

Event Timeline

elukey created this task.May 8 2017, 12:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 8 2017, 12:58 PM

elukey renamed this task from Investigate a simplified Redis Job Queues replication model to Investigate a simplified replication model for the Redis Job Queues.May 8 2017, 1:01 PM

elukey added a subscriber: faidon.

Note that implementation work for T157088: [EPIC] Develop a JobQueue backend based on EventBus has started, which is designed to fully replace the Redis backend, and avoid the associated replication issues. The tentative ETA for the actual migration is end of Q1 2017/18 (August 2017).

@GWicke wonderful news, thanks! We are ~5/6 months away from the new solution though and I am still convinced that it might be good to invest some time in improving the current infrastructure. Let's see what other people think about it :)

• Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.May 9 2017, 10:07 AM

• Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.

Aklapper reopened this task as Open.May 9 2017, 10:19 AM

@Krinkle, @aaron - any opinion? There are a couple of hosts that might be better to decom since the hw is really hold (like rdb1007), but I'd like to discuss it with you guys first before taking any action.

elukey moved this task from Backlog to Ops Backlog on the User-Elukey board.May 19 2017, 10:53 AM

elukey moved this task from Ops Backlog to Stalled on the User-Elukey board.Jun 12 2017, 12:13 PM

All these hosts will probably be deprecated once eventbus+changeprop will take over.

Investigate a simplified replication model for the Redis Job QueuesClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

Investigate a simplified replication model for the Redis Job Queues
Closed, DeclinedPublic
Actions