Page MenuHomePhabricator

Determine multi-dc strategy for ChronologyProtector
Open, MediumPublic

Description

What

ChronologyProtector is responsible for making sure that after users make changes that result in database writes (e.g. editing a page, changing your user preferences, etc) that on future page loads the database reflect these changes. This is mainly to accomodate the fact that replication from db-master to the pool of db-replicas takes time.

The CP code is fairly small and exists part ot the the wikimedia/rdbms libary in MediaWiki.

Why
  • After having saved an edit and going to the history page, your edit always should be listed there. (Even in the edge case where the article content may be lagged for parser-stampede/Michael Jackson reasons.)
  • When changing your skin from Vector to Minerva, the confirmation page seen therafter and the next article you view should use Minerva.
How

Changes are made to the db-master. Information is read from a pool of multiple db replicas (for scale).

In general, a random replica is picked from the pool to read your preferences and render wiki content etc.

If you're recently submitted a form or otherwise did something that resulted in a db-master write action, then ChronologyProtector will have observed the current "position" of the db-master, and added to a list of positions internally associated with your connection (ip/user-agent). A temprorary cookie is used to so that MW knows to look for it, and which offset to expect. This information is then used to decide which db-replica to pick from the pool, and if needed try another or wait briefly for it to have caught up, etc.

So

This information used to be kept in the session store (both logically and physically.) It used the MW interface for adding data to the current session in a subkey of the session object, much like any other session-related feature does. In 2015, CP was logically decoupled from the session store (T111264, 85c0f85e92), so that it could be used also without requiring a user session. Services such as RESTBase, ChangeProp, and JobQueue and externa API consumers can use their cpPosIndex and enjoy the same guruantees that e.g. a job will execute knowing it can read back the information from the event that spawned the event. Decoupling from sessions also makes it easier to use CP across domains and in context of db writes to shared or external databases.

The decoupling meant that instead of using ther session store interface, it used the MainStash interface directly, and thus created its own top-level keys rather than adding data to to the user's main session key-value. But physically sesion store and MainStash used the same Redis backends (until recently)

Now what
  • Determine the guruantees that ChronologyProtector needs.
  • Decide where to store it.
  • Make it happen.
  • Do it in Beta.
  • Do it in prod.
Priority
  • Solving this is blocker for decomissioning ths Redis cluster. – T243520
    • Redis cluster decom is a scheduled to complete ahead of the FY2020-2021 Q1 DC-switchover as it would significantly reduce the amount of work and risk required for a switchover. – T243314
  • Solving this is a blocker for moving MainStash away from Redis to a new, simpler backend (e.g. db-replicated), because ChronologyProtector cannot work on the simpler guruantees of the (new) MainStash.
  • Solving this is a blocker for migrating MainStash, because ChronologyProtector (after SessionStore and Echo) is the responsible for the largest amount of remaining Redis traffic which would make MainStash more expensive to support. Its data is small and short-lived but it has a high reads-per-second and low latency needs. – T229062#6186134
  • Solving this is a blocker for multi-dc (long term). – T88445
Numbers

About 14K operations per second. See also T212129#6190251.

Event Timeline

Krinkle created this task.Jun 6 2020, 2:22 AM
Restricted Application added a project: Platform Engineering. · View Herald TranscriptJun 6 2020, 2:22 AM
Restricted Application added a subscriber: Masumrezarock100. · View Herald Transcript

The ChronologyProtector keys likely could not be accomodated in the (TBD) weaker semantics of a multi-dc MainStash (e.g. one-way replication like Redis with dc-master wrties, or bi-di db-replicated like parser cache, ..). Incurring cross-dc writes would be undesirable for this. Plus, there is the complexity it brings of doing cross-dc "waiting". It would be nice to avoid this since CP itself exists to facciliate waiting for SQL. Waiting for a thing to help you wait for another, would be awkard. I think we can do better then that!

Do we need the highest and most complex guruantees of dc-local writes with bi-directional replication, like Session store and Echo store? I hope not, but maybe we do. I think the code is currently written in a way that expects all that (due to the legacy givens of the session store). But on the other hand, even for the new multi-dc session store (kask) we don't currently have support for cross-dc write quorum. I don't know if that's planned, but again, if at all possible to work with something simpler.. should be considered for sure.

I think it make sense to evaluate CPs actaul needs from "first principles" - ignoring its current implementation. Get a good understanding of what exactly is needed, what for, and what can and can't easily be improved/changed.

As a strawman proposal: Could we use some kind of "simple" dc-local persistent store? (no cross-dc coordination, leveraging the cdncache=false/sticky-dc=eqiad cookies for the same duration as the CP cookie).

I assume not. But I hope this can help as a way to frame what and why. One use case is cross-domain redirects and login/CentralAuth., but let's get this written down in more detail and see what our options are.

Misc thoughts:

  • I seem to recall that we currently code in MW as if we need to support arbitrary and regular jumping between DCs. I don't know if that's the case, but if so, based on more recent chats with SRE it seems this may be unrealistic and something we don't (currently) need to concern ourselves with and should instead be a future goal, separate from multi-dc itself.
  • On the other hand, even if we assume that a given geographic location is always tied in whole (no balancing) to a single application DC, one would still need to deal with the fact that a POST/write request will always go to the primary (jump 1) and that after stuff is settled, one continues on the inital one (jump 2). Jump 2 seems easy to handle given it naturally happens after the DCs are caught up. Jump 1 might be the harder one to solve. Or is it? If we assume pre-send DB writes only happen on POST requests and other requests routed to the primary DC, then there would naturally not be a need for cross-dc chronology given temporary dc-stickyness. But again, the use case of cross-domain comes up here as you might redirect from a POST action to a GET on another domain where the dc-sticky cookie isn't applied. That is currenly solved with a query parameter (cpPosIndex) which informs MW in "the other DC" that it will read a key from MainStash and that if it doesnt' exist yet, to wait for it. And after that, to find or wait for the relevant DB replica.
Krinkle updated the task description. (Show Details)Jun 6 2020, 3:42 AM
Krinkle updated the task description. (Show Details)Jun 6 2020, 3:48 AM

Ignoring all the complications above and whether it needs replication for a second, perhaps there's a simpler way forward here...

Talking purely about the traffic, this is too much for a simpler main-stash, but naturally in the same ballpark as sessions. (14K/s).
The data is small and needs to be written from either DC.

Perhaps it would be simplest to store this in the session store. Just like it used to be, and how it was effectively until now. This is quite literally tied to the concept of a session, just with a different key. It would be tied to the same BagOStuff instance by reference (kask-sessions), and not a separate store like Echo.

In terms of traffic that would mean a 2X increase in req/s for kask-session. Same ball park, but not insignificant either.

TODO for later - intuitively I would expect ChronoProtector req/s to be a subset of session req/s given that any write pretty much always involves a session, and that not every sessioned web request involves a write that CP needs to store or a CP cookie that needs to be applied, so why is it this high?

Krinkle claimed this task.Jun 8 2020, 8:10 PM
Krinkle removed Krinkle as the assignee of this task.
Krinkle moved this task from Inbox to Doing on the Performance-Team board.
eprodromou added a subscriber: eprodromou.

Glad to see movement on this and we'll be monitoring the progress.

Joe added a comment.Jun 10 2020, 11:52 AM

I thought about this and I think we will need to direct people who would set the CP session to the main datacenter anyways: the risk that none of the replicas in the read-only DC have not caught up to an acceptable lag will be non-negligible, so we will need to set some cookie on the client so that subsequent reads are sent to the main datacenter for a period of time.

In this scenario, we could get away with a dc-local storage:

  • User makes a POST request, say for an edit. This is directed to the primary datacenter
  • MediaWiki will both set CP in its store, and set a cookie read=consistent or something like that, expiring at the MAXLAG of the replica datacenter
  • On subsequent requests, as long as that cookie is defined, the traffic layer will direct the user to the primary datacenter

so we could either decide to just use a non-replicated redis for this, or sessionstore - either should be ok, really.

Krinkle updated the task description. (Show Details)Jun 10 2020, 2:42 PM
Krinkle updated the task description. (Show Details)
Krinkle triaged this task as Medium priority.Jun 10 2020, 3:14 PM
Krinkle updated the task description. (Show Details)
Krinkle added a comment.EditedJun 10 2020, 3:35 PM

Notes from todays meeting:

  • Data in ChronologyProtector needs to be strongly persisted but only for fixed short duration of time (under 10 seconds).
  • For the 99% use case, MediaWiki only expects to read CP data that was written by it from within the same DC.
  • The main exception is account creation processes. When signing up on enwiki, MW creates your user on enwiki database and in the CentralAuth database. It then redirects to loginwiki where a local user is created as well. This GET request to login wiki needs to be able to read data that was just written to the CentralAuth master db from the enwiki Signup/POST request.
    • The way this currently works: 1) The enwiki POST request performs an async write to MainStash, and appends a special query parameter to the loginwiki redirect url. 2) On loginwiki, CP sees the query param, polls MainStash for the relevant key to arrive, and then selects/awaits the replica DB.
    • Two ways this could be made to work instead:
      • A) Let the enwiki POST request do a synchronous cross-dc quorum write. This comes at the cost of user-visible latency and additional guruantees on the underlying store to expose that.
      • B) Let the relevant loginwiki GET requests be treated the same as POST requests and naturally route to the primary DC. This reduces user-visible latency. This avoids exposing cross-dc writes. And it is likely needed regardless since these loginwiki GET requests also make database writes, which means we would want to route them to the primary DC regardless of ChronologyProtector. (See also T91820)
    • Option B seems preferred.

Requirements:

  • 12-14K req/s (about 0.8x of current session traffic).
  • DC-local reads and writes only.

History:

  • The CP data used to be a built-in part of the user session object in MW, and were thus naturally in the session store.
  • In 2015 they were promoted to be siblings of the sesssion keys rather than children. As its own top-level key, this allowed CP to become stateless across domains and wikis, thus no longer dependent on session cookies.
  • Continuing to store these in the session store (Redis previously, now kask-sessionstore) seems to make the most sense operationally.
jijiki moved this task from Incoming 🐫 to Unsorted on the serviceops board.Aug 17 2020, 11:45 PM
jijiki added a subscriber: jijiki.

From the Multi-DC sync meeting on 31 Aug 2020. Attending: Gilles D, Timo T, Even E, Effie Giuseppe L

  • Historically, ChronologyProtector (CP) consumes sessionstore within MW as one of the various bits of data just stored in the session object. In 2015, this was refactored to use the sessionstore directly and have its own key in the session store, without requiring a session cookie etc. Keeping it with Redis or Kask is preferred for latency. Moving to the new main stash is not feasible. TTL is under a minute, around 10-30 seconds only.
  • It is relatively cheap to spin up additional Kask proxies with their own key spaces and TTLs, they can share existing Cassandra clusters.
  • And writes to Kask are fast. But, reads can be less so. Currently Kask is conservative, disallows per-key TTL to discourage re-use for unplanned purposes. Often if different TTL, different purpose. Short TTL values would cause much compaction in Cassandra. Cassandra is not the best choice for something so short-lived.
  • So Redis (without replication) seems like the better option.
  • In the future, could perhaps transition to a small dedicated memcached pool. That would allow better use of our experience with failover and gutterpool using Mcrouter. But we would need to make sure it won't be affected by LRU generally. More expensive to set up, for now Redis might be simplest and good enough to keep going.

Either way, from a MW perspective this means CP will use a simple dc-local BagOStuff.

jijiki removed a subscriber: User-jijiki.