Decommission the "session redis" cluster
Closed, DuplicatePublic
Actions

Assigned To

Authored By

	Joe
	Jan 23 2020, 3:26 PM

Description

We're currently storing sessions and echo last seen messages both in kask/cassandra and the "redis sessions" cluster.

For the switchover, we want to avoid switching replication of redis, which is risky and overall loses data. Thus we need to configure MediaWiki to only use echostore and sessionstore instead of the legacy redis cluster.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		RLazarus	T243314 FY2020-2021 Q1 DC switchover and switchback
		Duplicate		RLazarus	T243520 Decommission the "session redis" cluster

Event Timeline

Joe created this task.Jan 23 2020, 3:26 PM

If this is what MediaWiki's MainStash is using, then this is also used by chronology protector. We'd have to move it to something else. Pinging @aaron for that.

WDoranWMF added a project: Performance-Team.Feb 10 2020, 4:04 PM

WDoranWMF moved this task from Inbox to Tracking/Watching on the Platform Engineering board.

Krinkle moved this task from Inbox, needs triage to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.Feb 10 2020, 9:15 PM

There's a Google Doc about other thoughts and use cases around Main Stash (WMF restricted):

https://docs.google.com/document/d/1tX8ekiYb3xYgpNJsmA1SiKqzkWc0F-_E4SGx6BI72vA/edit

In a nut shell:

Sessions moved from replicated Redis (Main Stash) to (new) Cassandra-based store (Kask).
Echo moved from replicated Redis to another (new) Cassandra-based store (EchoStore).
Still remaining use cases include chronology protector.
Chronology Protector is highly latency-sensitive (at least as sensitive as sessions, perhaps more so) as it is unconditionally invovled in all user-facing web requests even if no session data needs to be read.
Aside from latency requirements, ChronologyProtector can't use sql-objectcache in the main database, because it exists to track and wait for db replication of that very same database.

Some of the options I'm aware of

Replace replicated-redis with a third (new) Cassandra-based store for generic Main Stash use cases, including ChronologyProtector.
Replace replicated-redis with a (new) replicating-memcached cluster (powered with) Mcrouter, for Main Stash, including ChronologyProtector.
Migrate ChronologyProtector to (new) replicating-memcached cluster (with Mcrouter). And configure MediaWiki to fold Main Stash into the generic db-replicated BagOStuff. MediaWiki has an objectcache table in production that is currently rarely used, and would make a good fit at low-maintenance/low-cost for whatever misc stuff uses Main Stash still. Looking at "WMF deployed" code search there are currently no call sites aside from Chronology Protector, so this would only be to satisfy the interface requirement.

Krinkle added a subtask: T212129: Move MainStash out of Redis to a simpler multi-dc aware solution.Jun 2 2020, 5:21 PM

Tentatively adding T212129 as sub task, but I think this task is trying to be two things at once, one of which is likely intended.

(Task title) Decom "redis session" (aka mainstash) cluster.

Literally speaking, that's blocked on T212129 and is more about mainstash than Sessionstore/Echostore at this point.

(Task description and parent) Finish migration of Sessionstore and Echostore to Cassandra, so that we can do switchovers more simply.

Echo is done. Session store is tracked at T206016, and still work in progress. But once done, I think possibly the intent of this task is resolved as the current Redis configuration works fine afaik for MainStash. It does need flipping of replication direction, though. But limited data loss would be fine there.

Should this task also be blocked on moving MainStash itself to a different backend (T212129)?

Krinkle mentioned this in T254634: Determine and implement multi-dc strategy for ChronologyProtector.Jun 6 2020, 3:42 AM

Krinkle changed the status of subtask T212129: Move MainStash out of Redis to a simpler multi-dc aware solution from Open to Stalled.Jul 6 2020, 8:36 PM

BPirkle mentioned this in T267270: Determine multi-dc strategy for CentralAuth.Nov 4 2020, 9:19 PM

jijiki added a subtask: T254634: Determine and implement multi-dc strategy for ChronologyProtector.Dec 10 2020, 11:52 AM

jijiki changed the status of subtask T212129: Move MainStash out of Redis to a simpler multi-dc aware solution from Stalled to Open.Dec 14 2020, 1:14 PM

RhinosF1 subscribed.Jan 12 2021, 11:22 AM

Krinkle closed subtask T254634: Determine and implement multi-dc strategy for ChronologyProtector as Resolved.Mar 19 2021, 8:29 PM

jijiki closed this task as a duplicate of T267581: Phase out "redis_sessions" cluster and away from memcached cluster.Apr 19 2021, 6:41 PM

tstarling closed subtask T212129: Move MainStash out of Redis to a simpler multi-dc aware solution as Resolved.Jun 17 2022, 1:38 AM

Krinkle removed a subtask: T254634: Determine and implement multi-dc strategy for ChronologyProtector.Aug 23 2022, 2:14 PM

Krinkle removed a subtask: T212129: Move MainStash out of Redis to a simpler multi-dc aware solution.