Determine and implement multi-dc strategy for ChronologyProtector
Closed, ResolvedPublic
Actions

Description

What

ChronologyProtector is responsible for making sure that after users make changes that result in database writes (e.g. editing a page, changing your user preferences, etc) that on future page loads the database reflect these changes. This is mainly to accomodate the fact that replication from db-master to the pool of db-replicas takes time.

The CP code is fairly small and exists part ot the the wikimedia/rdbms libary in MediaWiki.

Why

After having saved an edit and going to the history page, your edit always should be listed there. (Even in the edge case where the article content may be lagged for parser-stampede/Michael Jackson reasons.)
When changing your skin from Vector to Minerva, the confirmation page seen therafter and the next article you view should use Minerva.

How

Changes are made to the db-master. Information is read from a pool of multiple db replicas (for scale).

In general, a random replica is picked from the pool to read your preferences and render wiki content etc.

If you're recently submitted a form or otherwise did something that resulted in a db-master write action, then ChronologyProtector will have observed the current "position" of the db-master, and added to a list of positions internally associated with your connection (ip/user-agent). A temprorary cookie is used to so that MW knows to look for it, and which offset to expect. This information is then used to decide which db-replica to pick from the pool, and if needed try another or wait briefly for it to have caught up, etc.

So

This information used to be kept in the session store (both logically and physically.) It used the MW interface for adding data to the current session in a subkey of the session object, much like any other session-related feature does. In 2015, CP was logically decoupled from the session store (T111264, 85c0f85e92), so that it could be used also without requiring a user session. Services such as RESTBase, ChangeProp, and JobQueue and externa API consumers can use their cpPosIndex and enjoy the same guruantees that e.g. a job will execute knowing it can read back the information from the event that spawned the event. Decoupling from sessions also makes it easier to use CP across domains and in context of db writes to shared or external databases.

The decoupling meant that instead of using ther session store interface, it used the MainStash interface directly, and thus created its own top-level keys rather than adding data to to the user's main session key-value. But physically sesion store and MainStash used the same Redis backends (until recently)

Now what

Determine the garuantees that ChronologyProtector needs.
Decide where to store it.
Audit callers of getChronologyProtectorTouched() with regards to single-DC storage.
- Out of scope for this task, continues instead at T275713: Misc ChronologyProtector follow-up (Feb 2021)
Make it happen.
Do it in Beta.
Do it in prod.

Priority

Solving this is blocker for decomissioning ths Redis cluster. – T243520
- Redis cluster decom is a scheduled to complete ahead of the FY2020-2021 Q1 DC-switchover as it would significantly reduce the amount of work and risk required for a switchover. – T243314
Solving this is a blocker for moving MainStash away from Redis to a new, simpler backend (e.g. db-replicated), because ChronologyProtector cannot work on the simpler guruantees of the (new) MainStash.
Solving this is a blocker for migrating MainStash, because ChronologyProtector (after SessionStore and Echo) is the responsible for the largest amount of remaining Redis traffic which would make MainStash more expensive to support. Its data is small and short-lived but it has a high reads-per-second and low latency needs. – T229062#6186134
Solving this is a blocker for multi-dc (long term). – T88445

Numbers

About 14K operations per second. See also T212129#6190251.

Details

Subject	Repo	Branch	Lines +/-
rdbms: Improve ChronologyProtector documentation	mediawiki/core	master	+107 -10
Add $wgChronologyProtectorStash and improve $wgMainStash comments	mediawiki/core	master	+441 -342
rdbms: Use PSR-3 for ChronologyProtector shutdown debug message	mediawiki/core	master	+3 -2

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		aaron	T252951 ResourceLoader DepStore lock acquired twice?
Resolved		jijiki	T267581 Phase out "redis_sessions" cluster and away from memcached cluster
Resolved		aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved		Krinkle	T270223 FY2021-2022: Enable basic Multi-DC operations for read traffic (tracking)
Resolved		Krinkle	T113916 Switch ResourceLoader file dependency tracking to MultiDC-friendly backend
Resolved		aaron	T247028 Database 'INSERT' query rate doubled (module_deps regression?)
Resolved		tstarling	T253433 Web installer CSS fails with ServiceDisabledException
Declined		aaron	T253697 Support hash-based deduplication in KeyValueDependencyStore
Resolved		aaron	T212129 Move MainStash out of Redis to a simpler multi-dc aware solution
Resolved		Eevans	T222851 Improve Echo seentime code for multi-DC access
Resolved		aaron	T229062 Look into a simple way to have global keys with db-replicated
Resolved		Krinkle	T254634 Determine and implement multi-dc strategy for ChronologyProtector
			Unknown Object (Task)
Resolved		Papaul	T267041 (Need By: 2020-11-29) rack/setup/install db214[234]
Resolved		Marostegui	T269324 Productionize x2 databases
Resolved		aaron	T274174 Add modtoken field and flags to objectcache table
Resolved		Krinkle	T288998 Significant ParserCache space increase after 2021-08-12 (1.37.0-wmf.18 regression)
Resolved		tstarling	T306118 Notify DBA prior to sending db traffic to x2
Resolved		tstarling	T315271 db1151, db2144 X2 masters error: Could not execute Delete_rows_v1 event on table mainstash.objectstash
Resolved	PRODUCTION ERROR	jcrespo	T315274 Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic().
Resolved		tstarling	T315427 Create a puppet role for x2 hosts
Resolved		Marostegui	T315853 reclone x2 codfw hosts
Resolved		tstarling	T315995 Document how to disable x2 per DC

Event Timeline

Krinkle created this task.Jun 6 2020, 2:22 AM

Restricted Application added a project: Platform Engineering. · View Herald TranscriptJun 6 2020, 2:22 AM

Restricted Application added a subscriber: Masumrezarock100. · View Herald Transcript

The ChronologyProtector keys likely could not be accomodated in the (TBD) weaker semantics of a multi-dc MainStash (e.g. one-way replication like Redis with dc-master wrties, or bi-di db-replicated like parser cache, ..). Incurring cross-dc writes would be undesirable for this. Plus, there is the complexity it brings of doing cross-dc "waiting". It would be nice to avoid this since CP itself exists to facciliate waiting for SQL. Waiting for a thing to help you wait for another, would be awkard. I think we can do better then that!

Do we need the highest and most complex guruantees of dc-local writes with bi-directional replication, like Session store and Echo store? I hope not, but maybe we do. I think the code is currently written in a way that expects all that (due to the legacy givens of the session store). But on the other hand, even for the new multi-dc session store (kask) we don't currently have support for cross-dc write quorum. I don't know if that's planned, but again, if at all possible to work with something simpler.. should be considered for sure.

I think it make sense to evaluate CPs actaul needs from "first principles" - ignoring its current implementation. Get a good understanding of what exactly is needed, what for, and what can and can't easily be improved/changed.

As a strawman proposal: Could we use some kind of "simple" dc-local persistent store? (no cross-dc coordination, leveraging the cdncache=false/sticky-dc=eqiad cookies for the same duration as the CP cookie).

I assume not. But I hope this can help as a way to frame what and why. One use case is cross-domain redirects and login/CentralAuth., but let's get this written down in more detail and see what our options are.

Misc thoughts:

I seem to recall that we currently code in MW as if we need to support arbitrary and regular jumping between DCs. I don't know if that's the case, but if so, based on more recent chats with SRE it seems this may be unrealistic and something we don't (currently) need to concern ourselves with and should instead be a future goal, separate from multi-dc itself.
On the other hand, even if we assume that a given geographic location is always tied in whole (no balancing) to a single application DC, one would still need to deal with the fact that a POST/write request will always go to the primary (jump 1) and that after stuff is settled, one continues on the inital one (jump 2). Jump 2 seems easy to handle given it naturally happens after the DCs are caught up. Jump 1 might be the harder one to solve. Or is it? If we assume pre-send DB writes only happen on POST requests and other requests routed to the primary DC, then there would naturally not be a need for cross-dc chronology given temporary dc-stickyness. But again, the use case of cross-domain comes up here as you might redirect from a POST action to a GET on another domain where the dc-sticky cookie isn't applied. That is currenly solved with a query parameter (cpPosIndex) which informs MW in "the other DC" that it will read a key from MainStash and that if it doesnt' exist yet, to wait for it. And after that, to find or wait for the relevant DB replica.

Krinkle updated the task description. (Show Details)Jun 6 2020, 3:42 AM

Krinkle updated the task description. (Show Details)Jun 6 2020, 3:48 AM

Ignoring all the complications above and whether it needs replication for a second, perhaps there's a simpler way forward here...

Talking purely about the traffic, this is too much for a simpler main-stash, but naturally in the same ballpark as sessions. (14K/s).
The data is small and needs to be written from either DC.

Perhaps it would be simplest to store this in the session store. Just like it used to be, and how it was effectively until now. This is quite literally tied to the concept of a session, just with a different key. It would be tied to the same BagOStuff instance by reference (kask-sessions), and not a separate store like Echo.

In terms of traffic that would mean a 2X increase in req/s for kask-session. Same ball park, but not insignificant either.

TODO for later - intuitively I would expect ChronoProtector req/s to be a subset of session req/s given that any write pretty much always involves a session, and that not every sessioned web request involves a write that CP needs to store or a CP cookie that needs to be applied, so why is it this high?

• Pchelolo moved this task from Inbox to Triage Meeting Inbox on the Platform Engineering board.Jun 8 2020, 1:40 PM

Krinkle claimed this task.Jun 8 2020, 8:10 PM

Krinkle removed Krinkle as the assignee of this task.

Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

Krinkle moved this task from Doing (old) to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.Jun 8 2020, 8:33 PM

Glad to see movement on this and we'll be monitoring the progress.

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.Jun 9 2020, 8:52 PM

I thought about this and I think we will need to direct people who would set the CP session to the main datacenter anyways: the risk that none of the replicas in the read-only DC have not caught up to an acceptable lag will be non-negligible, so we will need to set some cookie on the client so that subsequent reads are sent to the main datacenter for a period of time.

In this scenario, we could get away with a dc-local storage:

User makes a POST request, say for an edit. This is directed to the primary datacenter
MediaWiki will both set CP in its store, and set a cookie read=consistent or something like that, expiring at the MAXLAG of the replica datacenter
On subsequent requests, as long as that cookie is defined, the traffic layer will direct the user to the primary datacenter

so we could either decide to just use a non-replicated redis for this, or sessionstore - either should be ok, really.

Krinkle updated the task description. (Show Details)Jun 10 2020, 2:42 PM

Krinkle updated the task description. (Show Details)

Krinkle triaged this task as Medium priority.Jun 10 2020, 3:14 PM

Krinkle updated the task description. (Show Details)

Notes from todays meeting:

Data in ChronologyProtector needs to be strongly persisted but only for fixed short duration of time (under 10 seconds).
For the 99% use case, MediaWiki only expects to read CP data that was written by it from within the same DC.
The main exception is account creation processes. When signing up on enwiki, MW creates your user on enwiki database and in the CentralAuth database. It then redirects to loginwiki where a local user is created as well. This GET request to login wiki needs to be able to read data that was just written to the CentralAuth master db from the enwiki Signup/POST request.
- The way this currently works: 1) The enwiki POST request performs an async write to MainStash, and appends a special query parameter to the loginwiki redirect url. 2) On loginwiki, CP sees the query param, polls MainStash for the relevant key to arrive, and then selects/awaits the replica DB.
- Two ways this could be made to work instead:
  - A) Let the enwiki POST request do a synchronous cross-dc quorum write. This comes at the cost of user-visible latency and additional guruantees on the underlying store to expose that.
  - B) Let the relevant loginwiki GET requests be treated the same as POST requests and naturally route to the primary DC. This reduces user-visible latency. This avoids exposing cross-dc writes. And it is likely needed regardless since these loginwiki GET requests also make database writes, which means we would want to route them to the primary DC regardless of ChronologyProtector. (See also T91820)
- Option B seems preferred.

Requirements:

12-14K req/s (about 0.8x of current session traffic).
DC-local reads and writes only.

History:

The CP data used to be a built-in part of the user session object in MW, and were thus naturally in the session store.
In 2015 they were promoted to be siblings of the sesssion keys rather than children. As its own top-level key, this allowed CP to become stateless across domains and wikis, thus no longer dependent on session cookies.
Continuing to store these in the session store (Redis previously, now kask-sessionstore) seems to make the most sense operationally.

Krinkle added a project: Sustainability (MediaWiki-MultiDC).Jun 10 2020, 3:37 PM

Krinkle mentioned this in T91820: Create HTTP verb and sticky cookie DC routing in VCL .Jun 10 2020, 4:12 PM

Krinkle moved this task from Later to Discuss next on the Sustainability (MediaWiki-MultiDC) board.Jun 23 2020, 8:52 PM

Krinkle moved this task from Discuss next to Current: Platform Team on the Sustainability (MediaWiki-MultiDC) board.Jul 2 2020, 2:49 PM

Krinkle mentioned this in T252391: Reimage one memcached shard per DC to Buster.Jul 10 2020, 5:40 PM

jijiki moved this task from Incoming 🐫 to 🔦Unused2 on the serviceops board.Aug 17 2020, 11:45 PM

jijiki added a subscriber: User-jijiki.Aug 31 2020, 6:09 PM

jijiki subscribed.

From the Multi-DC sync meeting on 31 Aug 2020. Attending: Gilles D, Timo T, Even E, Effie Giuseppe L

Historically, ChronologyProtector (CP) consumes sessionstore within MW as one of the various bits of data just stored in the session object. In 2015, this was refactored to use the sessionstore directly and have its own key in the session store, without requiring a session cookie etc. Keeping it with Redis or Kask is preferred for latency. Moving to the new main stash is not feasible. TTL is under a minute, around 10-30 seconds only.
It is relatively cheap to spin up additional Kask proxies with their own key spaces and TTLs, they can share existing Cassandra clusters.
And writes to Kask are fast. But, reads can be less so. Currently Kask is conservative, disallows per-key TTL to discourage re-use for unplanned purposes. Often if different TTL, different purpose. Short TTL values would cause much compaction in Cassandra. Cassandra is not the best choice for something so short-lived.
So Redis (without replication) seems like the better option.
In the future, could perhaps transition to a small dedicated memcached pool. That would allow better use of our experience with failover and gutterpool using Mcrouter. But we would need to make sure it won't be affected by LRU generally. More expensive to set up, for now Redis might be simplest and good enough to keep going.

Either way, from a MW perspective this means CP will use a simple dc-local BagOStuff.

jijiki added a project: User-jijiki.Sep 8 2020, 11:12 AM

jijiki removed a subscriber: User-jijiki.

BPirkle mentioned this in T254422: Move CentralAuth sessions from redis backend to kask.Oct 27 2020, 2:29 AM

BPirkle mentioned this in T267270: Determine multi-dc strategy for CentralAuth.Nov 4 2020, 9:19 PM

jijiki added a parent task: T243520: Decommission the "session redis" cluster.Dec 10 2020, 11:52 AM

@Krinkle do you think it would be possible to schedule this for Q3? Either move to the 'redis_misc' cluster or, ideally, switch the backend to memcached. As we have discussed, if there is time and if it is possible, switching to memcached will provide better reliability and it supports multi-dc. Thank you!

In T254634#6681828, @jijiki wrote:

@Krinkle do you think it would be possible to schedule this for Q3? Either move to the 'redis_misc' cluster or, ideally, switch the backend to memcached. As we have discussed, if there is time and if it is possible, switching to memcached will provide better reliability and it supports multi-dc. Thank you!

This is the first I hear about possibly using redis_misc. It was my understanding that from the Multi-DC meetings we had in past months that the only immediate need was to not have to worry about multi-dc replication of the main stash, which is being solved by moving that to MySQL (T212129). The use of ChronologyProtector was going to remain on its own Redis cluster for now, either the cluster formerly known as redis_sessions or a reduced or moved out version of it.

Since this is not a cache, I would worry about it being mixed in with unrelated applications and their redis cluster. It is not expected that evictions happen here for anything other than TTL or extraordinary circumstances.

I was planning to resolve this task once the following two are done:

wmf-config is updated to send MW CP to redis_session directly (instead of via the MainStash interface which is moving to CP-incompatible MySQL in T212129).
Source code and docs for MW CP are updated to reflect WMF's multi-dc strategy and outcome of this task, specifically that it does not need replication, that we require POST-routing etc, and require short-lived sticky router cookies etc.

Moving to a different Redis cluster can happen in its own task, and myself or PET can help evaluate the requirements etc for that and schedule it into the quarter. Or if it satisfies all current requirements, then it can probably also happen directly in wmf-config without our involvement the same way that in the past redis_sessions IPs were changed without coordination. That's fine.

In T254634#6687682, @Krinkle wrote:

In T254634#6681828, @jijiki wrote:

@Krinkle do you think it would be possible to schedule this for Q3? Either move to the 'redis_misc' cluster or, ideally, switch the backend to memcached. As we have discussed, if there is time and if it is possible, switching to memcached will provide better reliability and it supports multi-dc. Thank you!

This is the first I hear about possibly using redis_misc. It was my understanding that from the Multi-DC meetings we had in past months that the only immediate need was to not have to worry about multi-dc replication of the main stash, which is being solved by moving that to MySQL (T212129). The use of ChronologyProtector was going to remain on its own Redis cluster for now, either the cluster formerly known as redis_sessions or a reduced or moved out version of it.

Moving to dedicated instances on the redis_misc cluster would be just that: having a couple dedicated redis instances per datacenter, with no cross-dc replication.

Since this is not a cache, I would worry about it being mixed in with unrelated applications and their redis cluster. It is not expected that evictions happen here for anything other than TTL or extraordinary circumstances.

I was planning to resolve this task once the following two are done:

wmf-config is updated to send MW CP to redis_session directly (instead of via the MainStash interface which is moving to CP-incompatible MySQL in T212129).

Source code and docs for MW CP are updated to reflect WMF's multi-dc strategy and outcome of this task, specifically that it does not need replication, that we require POST-routing etc, and require short-lived sticky router cookies etc.

Moving to a different Redis cluster can happen in its own task, and myself or PET can help evaluate the requirements etc for that and schedule it into the quarter. Or if it satisfies all current requirements, then it can probably also happen directly in wmf-config without our involvement the same way that in the past redis_sessions IPs were changed without coordination. That's fine.

Nothing would be shared with another application, with more space available to CP than we do now, and yes we can manage that independently once we move CP away from mainstash. We would however need to get away from using nutcracker for this, which is an additional complication.

Krinkle added a parent task: T270223: FY2021-2022: Enable basic Multi-DC operations for read traffic (tracking).Dec 16 2020, 2:08 AM

Krinkle renamed this task from Determine multi-dc strategy for ChronologyProtector to Determine and implement multi-dc strategy for ChronologyProtector.Dec 16 2020, 2:13 AM

Change 649765 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] rdbms: Use PSR-3 for ChronologyProtector shutdownd debug message

https://gerrit.wikimedia.org/r/649765

gerritbot added a project: Patch-For-Review.Dec 16 2020, 3:25 AM

Change 649765 merged by jenkins-bot:
[mediawiki/core@master] rdbms: Use PSR-3 for ChronologyProtector shutdown debug message

https://gerrit.wikimedia.org/r/649765

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.26; 2021-01-12).Jan 9 2021, 1:00 AM

Maintenance_bot removed a project: Patch-For-Review.Jan 9 2021, 1:10 AM

RhinosF1 subscribed.Jan 25 2021, 8:28 PM

Krinkle claimed this task.Jan 28 2021, 5:38 AM

Krinkle moved this task from To-do: Goals, prioritized next 4 Quarters to Doing (old) on the Performance-Team board.

The only thing that currently updates the replication positions on HTTP GET, that is not an easily spotted entrypoint like rollback/createaccount/login (which can be routed like HTTP POST) are:

UpdateHitCountWatcher from AbuseFilter on edit form views (this should use the main stash or least the job queue); does not need chronology-protector
CentralAuthUser->lazyImportLocalNames(); this should be solved by the migration script (T150506)
SpecialContentTranslation setting global preferences just to store "user has seen X" state; this should use the main stash (or the job queue at least)
ShortUrlHooks on page views; this will be fixed by T256993

This is based on logstash:

+channel:DBPerformance +writes -by:"MediaWiki::restInPeace" +http_method:GET

taavi subscribed.Feb 12 2021, 2:29 PM

Change 664044 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] rdbms: Remove cross-dc sync support for ChronologyProtector

https://gerrit.wikimedia.org/r/664044

gerritbot added a project: Patch-For-Review.Feb 14 2021, 2:11 AM

Change 664415 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Add $wgChronologyProtectorStash settings and improve $wgMainStash comments

https://gerrit.wikimedia.org/r/664415

aaron updated the task description. (Show Details)Feb 16 2021, 7:53 PM

aaron updated the task description. (Show Details)Feb 19 2021, 5:22 AM

aaron updated the task description. (Show Details)Feb 19 2021, 10:51 PM

Krinkle mentioned this in T275713: Misc ChronologyProtector follow-up (Feb 2021).Feb 24 2021, 11:40 PM

In T254634#6820865, @aaron wrote:

The only thing that currently updates the replication positions on HTTP GET […]:

UpdateHitCountWatcher from AbuseFilter on edit form views (this should use the main stash or least the job queue); does not need chronology-protector

Done, in T274455.

CentralAuthUser->lazyImportLocalNames(); this should be solved by the migration script (T150506)

TODO.

SpecialContentTranslation setting global preferences just to store "user has seen X" state; this should use the main stash (or the job queue at least)

Is there a task for this?