Rationalize our jobqueues redis topology
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Joe
	May 12 2016, 11:09 AM

Description

Right now our redis infrastructure for the jobqueues is as follows:

a set of 4 master machines each with 4 redis instances running in the currently active DC
a set of 4 more machines in the same DC that just replicate from the active ones
a set of 3 machines replicating from the masters, but set in the inactive DC; one of those hosts 8 redis instances
a set of 3 machines replicating from the local masters, in the inactive DC

So we're currently using 4 machines out of 8 in the active DC and 4 out of 14 overall; those machines are overloaded from time to time.

My proposal would be, given the jobqueue redis data has no PII data, to do as follows:

Stop all the intra-dc replication
Promote 2 of the current slaves in eqiad to be masters as well, to backpedal a bit the pressure on the current ones
Move the other 2 slaves to the spares pool, or use them as generic redis master-slave couple
Promote the 3 slaves in codfw to be masters as well and replicate from the masters in eqiad

in case of a failure, we can configure mediawiki to talk to the slave in the other dc without much harm, at the end of the day, and this will both simplify our infrastructure and allow better performance.

Related Objects

Mentioned Here: T206016: Create a service for session storage
T267581: Phase out "redis_sessions" cluster and away from memcached cluster
T280582: Reduce number of shards in redis_sessions cluster
T198220: Stop and remove old job runners

Event Timeline

Joe created this task.May 12 2016, 11:09 AM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 12 2016, 11:09 AM

Joe added subscribers: aaron, ori, faidon.May 12 2016, 11:10 AM

Joe triaged this task as Medium priority.May 12 2016, 12:05 PM

Yes, I think that makes sense. Thanks for suggesting it.

Seems sensible to me. We should document that on manual fail-over (say of a server in eqiad) to:
a) switchover that logical partition to the other DC
b) lower the "wieght" in jobqueue.php config for that partition to avoid hitting latency too much on push()/pop(), but keep it active for existing jobs

• MZMcBride subscribed.May 22 2016, 12:48 AM

Krinkle removed a project: MediaWiki-Core-JobQueue.Jul 7 2017, 10:26 PM

Krinkle moved this task from Untriaged to Legacy infra on the WMF-JobQueue board.Jul 11 2018, 3:03 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:40 PM

Is this ticket obsolete now that T198220: Stop and remove old job runners is resolved?