Page MenuHomePhabricator

Decommission the "session redis" cluster
Open, Needs TriagePublic

Description

We're currently storing sessions and echo last seen messages both in kask/cassandra and the "redis sessions" cluster.

For the switchover, we want to avoid switching replication of redis, which is risky and overall loses data. Thus we need to configure MediaWiki to only use echostore and sessionstore instead of the legacy redis cluster.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenKrinkle
ResolvedRLazarus
OpenRLazarus
Openaaron
ResolvedEevans
Resolvedaaron
OpenNone
ResolvedPapaul
OpenNone
DeclinedNone
ResolvedMarostegui
ResolvedJclark-ctr
ResolvedMarostegui
ResolvedMarostegui
ResolvedRequestwiki_willy
OpenMarostegui
OpenTrizek-WMF
ResolvedCmjohnson
OpenJclark-ctr
OpenMarostegui
OpenNone
OpenBPirkle
OpenMarostegui
OpenNone

Event Timeline

Joe created this task.Jan 23 2020, 3:26 PM

If this is what MediaWiki's MainStash is using, then this is also used by chronology protector. We'd have to move it to something else. Pinging @aaron for that.

There's a Google Doc about other thoughts and use cases around Main Stash (WMF restricted):

https://docs.google.com/document/d/1tX8ekiYb3xYgpNJsmA1SiKqzkWc0F-_E4SGx6BI72vA/edit

In a nut shell:

  • Sessions moved from replicated Redis (Main Stash) to (new) Cassandra-based store (Kask).
  • Echo moved from replicated Redis to another (new) Cassandra-based store (EchoStore).
  • Still remaining use cases include chronology protector.
  • Chronology Protector is highly latency-sensitive (at least as sensitive as sessions, perhaps more so) as it is unconditionally invovled in all user-facing web requests even if no session data needs to be read.
  • Aside from latency requirements, ChronologyProtector can't use sql-objectcache in the main database, because it exists to track and wait for db replication of that very same database.

Some of the options I'm aware of

  • Replace replicated-redis with a third (new) Cassandra-based store for generic Main Stash use cases, including ChronologyProtector.
  • Replace replicated-redis with a (new) replicating-memcached cluster (powered with) Mcrouter, for Main Stash, including ChronologyProtector.
  • Migrate ChronologyProtector to (new) replicating-memcached cluster (with Mcrouter). And configure MediaWiki to fold Main Stash into the generic db-replicated BagOStuff. MediaWiki has an objectcache table in production that is currently rarely used, and would make a good fit at low-maintenance/low-cost for whatever misc stuff uses Main Stash still. Looking at "WMF deployed" code search there are currently no call sites aside from Chronology Protector, so this would only be to satisfy the interface requirement.

Tentatively adding T212129 as sub task, but I think this task is trying to be two things at once, one of which is likely intended.

  1. (Task title) Decom "redis session" (aka mainstash) cluster.

Literally speaking, that's blocked on T212129 and is more about mainstash than Sessionstore/Echostore at this point.

  1. (Task description and parent) Finish migration of Sessionstore and Echostore to Cassandra, so that we can do switchovers more simply.

Echo is done. Session store is tracked at T206016, and still work in progress. But once done, I think possibly the intent of this task is resolved as the current Redis configuration works fine afaik for MainStash. It does need flipping of replication direction, though. But limited data loss would be fine there.

Should this task also be blocked on moving MainStash itself to a different backend (T212129)?