Page MenuHomePhabricator

Determine and implement multi-dc strategy for ChronologyProtector
Closed, ResolvedPublic

Description

What

ChronologyProtector is responsible for making sure that after users make changes that result in database writes (e.g. editing a page, changing your user preferences, etc) that on future page loads the database reflect these changes. This is mainly to accomodate the fact that replication from db-master to the pool of db-replicas takes time.

The CP code is fairly small and exists part ot the the wikimedia/rdbms libary in MediaWiki.

Why
  • After having saved an edit and going to the history page, your edit always should be listed there. (Even in the edge case where the article content may be lagged for parser-stampede/Michael Jackson reasons.)
  • When changing your skin from Vector to Minerva, the confirmation page seen therafter and the next article you view should use Minerva.
How

Changes are made to the db-master. Information is read from a pool of multiple db replicas (for scale).

In general, a random replica is picked from the pool to read your preferences and render wiki content etc.

If you're recently submitted a form or otherwise did something that resulted in a db-master write action, then ChronologyProtector will have observed the current "position" of the db-master, and added to a list of positions internally associated with your connection (ip/user-agent). A temprorary cookie is used to so that MW knows to look for it, and which offset to expect. This information is then used to decide which db-replica to pick from the pool, and if needed try another or wait briefly for it to have caught up, etc.

So

This information used to be kept in the session store (both logically and physically.) It used the MW interface for adding data to the current session in a subkey of the session object, much like any other session-related feature does. In 2015, CP was logically decoupled from the session store (T111264, 85c0f85e92), so that it could be used also without requiring a user session. Services such as RESTBase, ChangeProp, and JobQueue and externa API consumers can use their cpPosIndex and enjoy the same guruantees that e.g. a job will execute knowing it can read back the information from the event that spawned the event. Decoupling from sessions also makes it easier to use CP across domains and in context of db writes to shared or external databases.

The decoupling meant that instead of using ther session store interface, it used the MainStash interface directly, and thus created its own top-level keys rather than adding data to to the user's main session key-value. But physically sesion store and MainStash used the same Redis backends (until recently)

Now what
  • Determine the garuantees that ChronologyProtector needs.
  • Decide where to store it.
  • Audit callers of getChronologyProtectorTouched() with regards to single-DC storage.
  • Make it happen.
  • Do it in Beta.
  • Do it in prod.
Priority
  • Solving this is blocker for decomissioning ths Redis cluster. – T243520
    • Redis cluster decom is a scheduled to complete ahead of the FY2020-2021 Q1 DC-switchover as it would significantly reduce the amount of work and risk required for a switchover. – T243314
  • Solving this is a blocker for moving MainStash away from Redis to a new, simpler backend (e.g. db-replicated), because ChronologyProtector cannot work on the simpler guruantees of the (new) MainStash.
  • Solving this is a blocker for migrating MainStash, because ChronologyProtector (after SessionStore and Echo) is the responsible for the largest amount of remaining Redis traffic which would make MainStash more expensive to support. Its data is small and short-lived but it has a high reads-per-second and low latency needs. – T229062#6186134
  • Solving this is a blocker for multi-dc (long term). – T88445
Numbers

About 14K operations per second. See also T212129#6190251.

Related Objects

StatusSubtypeAssignedTask
Resolvedaaron
Resolvedjijiki
Resolvedaaron
ResolvedKrinkle
ResolvedKrinkle
Resolvedaaron
Resolvedtstarling
Declinedaaron
Resolvedaaron
ResolvedEevans
Resolvedaaron
ResolvedKrinkle
ResolvedPapaul
ResolvedMarostegui
Resolvedaaron
ResolvedKrinkle
Resolvedtstarling
Resolvedtstarling
ResolvedPRODUCTION ERRORjcrespo
Resolvedtstarling
ResolvedMarostegui
Resolvedtstarling

Event Timeline

Restricted Application added a subscriber: Masumrezarock100. · View Herald Transcript

The ChronologyProtector keys likely could not be accomodated in the (TBD) weaker semantics of a multi-dc MainStash (e.g. one-way replication like Redis with dc-master wrties, or bi-di db-replicated like parser cache, ..). Incurring cross-dc writes would be undesirable for this. Plus, there is the complexity it brings of doing cross-dc "waiting". It would be nice to avoid this since CP itself exists to facciliate waiting for SQL. Waiting for a thing to help you wait for another, would be awkard. I think we can do better then that!

Do we need the highest and most complex guruantees of dc-local writes with bi-directional replication, like Session store and Echo store? I hope not, but maybe we do. I think the code is currently written in a way that expects all that (due to the legacy givens of the session store). But on the other hand, even for the new multi-dc session store (kask) we don't currently have support for cross-dc write quorum. I don't know if that's planned, but again, if at all possible to work with something simpler.. should be considered for sure.

I think it make sense to evaluate CPs actaul needs from "first principles" - ignoring its current implementation. Get a good understanding of what exactly is needed, what for, and what can and can't easily be improved/changed.

As a strawman proposal: Could we use some kind of "simple" dc-local persistent store? (no cross-dc coordination, leveraging the cdncache=false/sticky-dc=eqiad cookies for the same duration as the CP cookie).

I assume not. But I hope this can help as a way to frame what and why. One use case is cross-domain redirects and login/CentralAuth., but let's get this written down in more detail and see what our options are.

Misc thoughts:

  • I seem to recall that we currently code in MW as if we need to support arbitrary and regular jumping between DCs. I don't know if that's the case, but if so, based on more recent chats with SRE it seems this may be unrealistic and something we don't (currently) need to concern ourselves with and should instead be a future goal, separate from multi-dc itself.
  • On the other hand, even if we assume that a given geographic location is always tied in whole (no balancing) to a single application DC, one would still need to deal with the fact that a POST/write request will always go to the primary (jump 1) and that after stuff is settled, one continues on the inital one (jump 2). Jump 2 seems easy to handle given it naturally happens after the DCs are caught up. Jump 1 might be the harder one to solve. Or is it? If we assume pre-send DB writes only happen on POST requests and other requests routed to the primary DC, then there would naturally not be a need for cross-dc chronology given temporary dc-stickyness. But again, the use case of cross-domain comes up here as you might redirect from a POST action to a GET on another domain where the dc-sticky cookie isn't applied. That is currenly solved with a query parameter (cpPosIndex) which informs MW in "the other DC" that it will read a key from MainStash and that if it doesnt' exist yet, to wait for it. And after that, to find or wait for the relevant DB replica.

Ignoring all the complications above and whether it needs replication for a second, perhaps there's a simpler way forward here...

Talking purely about the traffic, this is too much for a simpler main-stash, but naturally in the same ballpark as sessions. (14K/s).
The data is small and needs to be written from either DC.

Perhaps it would be simplest to store this in the session store. Just like it used to be, and how it was effectively until now. This is quite literally tied to the concept of a session, just with a different key. It would be tied to the same BagOStuff instance by reference (kask-sessions), and not a separate store like Echo.

In terms of traffic that would mean a 2X increase in req/s for kask-session. Same ball park, but not insignificant either.

TODO for later - intuitively I would expect ChronoProtector req/s to be a subset of session req/s given that any write pretty much always involves a session, and that not every sessioned web request involves a write that CP needs to store or a CP cookie that needs to be applied, so why is it this high?

Krinkle removed Krinkle as the assignee of this task.
Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.
eprodromou subscribed.

Glad to see movement on this and we'll be monitoring the progress.

I thought about this and I think we will need to direct people who would set the CP session to the main datacenter anyways: the risk that none of the replicas in the read-only DC have not caught up to an acceptable lag will be non-negligible, so we will need to set some cookie on the client so that subsequent reads are sent to the main datacenter for a period of time.

In this scenario, we could get away with a dc-local storage:

  • User makes a POST request, say for an edit. This is directed to the primary datacenter
  • MediaWiki will both set CP in its store, and set a cookie read=consistent or something like that, expiring at the MAXLAG of the replica datacenter
  • On subsequent requests, as long as that cookie is defined, the traffic layer will direct the user to the primary datacenter

so we could either decide to just use a non-replicated redis for this, or sessionstore - either should be ok, really.

Krinkle updated the task description. (Show Details)
Krinkle triaged this task as Medium priority.Jun 10 2020, 3:14 PM
Krinkle updated the task description. (Show Details)

Notes from todays meeting:

  • Data in ChronologyProtector needs to be strongly persisted but only for fixed short duration of time (under 10 seconds).
  • For the 99% use case, MediaWiki only expects to read CP data that was written by it from within the same DC.
  • The main exception is account creation processes. When signing up on enwiki, MW creates your user on enwiki database and in the CentralAuth database. It then redirects to loginwiki where a local user is created as well. This GET request to login wiki needs to be able to read data that was just written to the CentralAuth master db from the enwiki Signup/POST request.
    • The way this currently works: 1) The enwiki POST request performs an async write to MainStash, and appends a special query parameter to the loginwiki redirect url. 2) On loginwiki, CP sees the query param, polls MainStash for the relevant key to arrive, and then selects/awaits the replica DB.
    • Two ways this could be made to work instead:
      • A) Let the enwiki POST request do a synchronous cross-dc quorum write. This comes at the cost of user-visible latency and additional guruantees on the underlying store to expose that.
      • B) Let the relevant loginwiki GET requests be treated the same as POST requests and naturally route to the primary DC. This reduces user-visible latency. This avoids exposing cross-dc writes. And it is likely needed regardless since these loginwiki GET requests also make database writes, which means we would want to route them to the primary DC regardless of ChronologyProtector. (See also T91820)
    • Option B seems preferred.

Requirements:

  • 12-14K req/s (about 0.8x of current session traffic).
  • DC-local reads and writes only.

History:

  • The CP data used to be a built-in part of the user session object in MW, and were thus naturally in the session store.
  • In 2015 they were promoted to be siblings of the sesssion keys rather than children. As its own top-level key, this allowed CP to become stateless across domains and wikis, thus no longer dependent on session cookies.
  • Continuing to store these in the session store (Redis previously, now kask-sessionstore) seems to make the most sense operationally.

From the Multi-DC sync meeting on 31 Aug 2020. Attending: Gilles D, Timo T, Even E, Effie Giuseppe L

  • Historically, ChronologyProtector (CP) consumes sessionstore within MW as one of the various bits of data just stored in the session object. In 2015, this was refactored to use the sessionstore directly and have its own key in the session store, without requiring a session cookie etc. Keeping it with Redis or Kask is preferred for latency. Moving to the new main stash is not feasible. TTL is under a minute, around 10-30 seconds only.
  • It is relatively cheap to spin up additional Kask proxies with their own key spaces and TTLs, they can share existing Cassandra clusters.
  • And writes to Kask are fast. But, reads can be less so. Currently Kask is conservative, disallows per-key TTL to discourage re-use for unplanned purposes. Often if different TTL, different purpose. Short TTL values would cause much compaction in Cassandra. Cassandra is not the best choice for something so short-lived.
  • So Redis (without replication) seems like the better option.
  • In the future, could perhaps transition to a small dedicated memcached pool. That would allow better use of our experience with failover and gutterpool using Mcrouter. But we would need to make sure it won't be affected by LRU generally. More expensive to set up, for now Redis might be simplest and good enough to keep going.

Either way, from a MW perspective this means CP will use a simple dc-local BagOStuff.

@Krinkle do you think it would be possible to schedule this for Q3? Either move to the 'redis_misc' cluster or, ideally, switch the backend to memcached. As we have discussed, if there is time and if it is possible, switching to memcached will provide better reliability and it supports multi-dc. Thank you!

@Krinkle do you think it would be possible to schedule this for Q3? Either move to the 'redis_misc' cluster or, ideally, switch the backend to memcached. As we have discussed, if there is time and if it is possible, switching to memcached will provide better reliability and it supports multi-dc. Thank you!

This is the first I hear about possibly using redis_misc. It was my understanding that from the Multi-DC meetings we had in past months that the only immediate need was to not have to worry about multi-dc replication of the main stash, which is being solved by moving that to MySQL (T212129). The use of ChronologyProtector was going to remain on its own Redis cluster for now, either the cluster formerly known as redis_sessions or a reduced or moved out version of it.

Since this is not a cache, I would worry about it being mixed in with unrelated applications and their redis cluster. It is not expected that evictions happen here for anything other than TTL or extraordinary circumstances.

I was planning to resolve this task once the following two are done:

  • wmf-config is updated to send MW CP to redis_session directly (instead of via the MainStash interface which is moving to CP-incompatible MySQL in T212129).
  • Source code and docs for MW CP are updated to reflect WMF's multi-dc strategy and outcome of this task, specifically that it does not need replication, that we require POST-routing etc, and require short-lived sticky router cookies etc.

Moving to a different Redis cluster can happen in its own task, and myself or PET can help evaluate the requirements etc for that and schedule it into the quarter. Or if it satisfies all current requirements, then it can probably also happen directly in wmf-config without our involvement the same way that in the past redis_sessions IPs were changed without coordination. That's fine.

@Krinkle do you think it would be possible to schedule this for Q3? Either move to the 'redis_misc' cluster or, ideally, switch the backend to memcached. As we have discussed, if there is time and if it is possible, switching to memcached will provide better reliability and it supports multi-dc. Thank you!

This is the first I hear about possibly using redis_misc. It was my understanding that from the Multi-DC meetings we had in past months that the only immediate need was to not have to worry about multi-dc replication of the main stash, which is being solved by moving that to MySQL (T212129). The use of ChronologyProtector was going to remain on its own Redis cluster for now, either the cluster formerly known as redis_sessions or a reduced or moved out version of it.

Moving to dedicated instances on the redis_misc cluster would be just that: having a couple dedicated redis instances per datacenter, with no cross-dc replication.

Since this is not a cache, I would worry about it being mixed in with unrelated applications and their redis cluster. It is not expected that evictions happen here for anything other than TTL or extraordinary circumstances.

I was planning to resolve this task once the following two are done:

  • wmf-config is updated to send MW CP to redis_session directly (instead of via the MainStash interface which is moving to CP-incompatible MySQL in T212129).
  • Source code and docs for MW CP are updated to reflect WMF's multi-dc strategy and outcome of this task, specifically that it does not need replication, that we require POST-routing etc, and require short-lived sticky router cookies etc.

Moving to a different Redis cluster can happen in its own task, and myself or PET can help evaluate the requirements etc for that and schedule it into the quarter. Or if it satisfies all current requirements, then it can probably also happen directly in wmf-config without our involvement the same way that in the past redis_sessions IPs were changed without coordination. That's fine.

Nothing would be shared with another application, with more space available to CP than we do now, and yes we can manage that independently once we move CP away from mainstash. We would however need to get away from using nutcracker for this, which is an additional complication.

Krinkle renamed this task from Determine multi-dc strategy for ChronologyProtector to Determine and implement multi-dc strategy for ChronologyProtector.Dec 16 2020, 2:13 AM

Change 649765 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] rdbms: Use PSR-3 for ChronologyProtector shutdownd debug message

https://gerrit.wikimedia.org/r/649765

Change 649765 merged by jenkins-bot:
[mediawiki/core@master] rdbms: Use PSR-3 for ChronologyProtector shutdown debug message

https://gerrit.wikimedia.org/r/649765

The only thing that currently updates the replication positions on HTTP GET, that is not an easily spotted entrypoint like rollback/createaccount/login (which can be routed like HTTP POST) are:

  • UpdateHitCountWatcher from AbuseFilter on edit form views (this should use the main stash or least the job queue); does not need chronology-protector
  • CentralAuthUser->lazyImportLocalNames(); this should be solved by the migration script (T150506)
  • SpecialContentTranslation setting global preferences just to store "user has seen X" state; this should use the main stash (or the job queue at least)
  • ShortUrlHooks on page views; this will be fixed by T256993

This is based on logstash:

+channel:DBPerformance +writes -by:"MediaWiki::restInPeace" +http_method:GET

Change 664044 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] rdbms: Remove cross-dc sync support for ChronologyProtector

https://gerrit.wikimedia.org/r/664044

Change 664415 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Add $wgChronologyProtectorStash settings and improve $wgMainStash comments

https://gerrit.wikimedia.org/r/664415

The only thing that currently updates the replication positions on HTTP GET […]:

  • UpdateHitCountWatcher from AbuseFilter on edit form views (this should use the main stash or least the job queue); does not need chronology-protector

Done, in T274455.

  • CentralAuthUser->lazyImportLocalNames(); this should be solved by the migration script (T150506)

TODO.

  • SpecialContentTranslation setting global preferences just to store "user has seen X" state; this should use the main stash (or the job queue at least)

Is there a task for this?

  • ShortUrlHooks on page views; this will be fixed by T256993

TODO. We should ping PET I guess.

Change 664415 merged by jenkins-bot:
[mediawiki/core@master] Add $wgChronologyProtectorStash and improve $wgMainStash comments

https://gerrit.wikimedia.org/r/664415

Change 664044 merged by jenkins-bot:
[mediawiki/core@master] rdbms: Improve ChronologyProtector documentation

https://gerrit.wikimedia.org/r/664044

Krinkle updated the task description. (Show Details)