Finish session storage to actually meet multi-DC requirements
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Dec 16 2020, 3:14 AM

Description

Left to do:

Decide, finish, and document how to perform session writes.
- MediaWiki\Session\SessionBackend passes WRITE_SYNC with a comment " write to all datacenters". However this is at odds with RESTBagOStuff and RFC: Session Storage API which state that only a local quorum is expected. We instead rely on "sticky DC" cookies to pin a userthe few seconds around session replication, and thus don't need that to be immediate in the same web request. Either this parameter needs to be removed, or otherwise the other two parts reconciled with a change in direction.
Decide, finish, and document how to perform session deletions.
- RESTBagOStuff::doDelete() states "TODO: respect WRITE_SYNC (e.g. EACH_QUORUM)". Either do this, or decide otherwise. See RFC: Session Storage API in which it states that Kask /delete is synchronous across DCs.
- MediaWiki\Session\SessionBackend uses store->delete() in two places: "unpresist" and "resetId". Neither of these actually pass WRITE_SYNC currently, which is at odds with the previous point. I suspect that we do need sync deletes, but that we expected Kask to handle this transparently and thus didn't bother having MW specify this in its call. Parameter must be added then, or otherwise reconciled with the decision of the previous point.
- If deletes require waited-for cross-dc replication, document this in the SessionManager class doc as a requirement of the storage backend for future reference. Both for developers to discover this, as well as for sysadmins (e.g. third parties) to know what and how to configure their own session backend.
RESTBagOStuff::doAdd and ::doIncr states "TODO: respect WRITE_SYNC (e.g. EACH_QUORUM)". Either do, or explain why not and remove.
- Update SessionManager class doc as-needed.
Resolve T206016: Create a service for session storage which currently has one unresolved subtask:
- T222990: Audit session storage to determine max age of un-GC'd sessions

Details

Subject	Repo	Branch	Lines +/-
objectcache: Remove unused WRITE_SYNC flag	mediawiki/core	master	+29 -92
session: Remove unrealized claims about WRITE_SYNC requirements	mediawiki/core	master	+63 -24
Remove redundant use of WRITE_SYNC	mediawiki/extensions/CentralAuth	master	+3 -4
SessionManager: Document expectations for storage backend	mediawiki/core	master	+34 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	Krinkle	T270223 FY2021-2022: Enable basic Multi-DC operations for read traffic (tracking)
Resolved	Krinkle	T270225 Finish session storage to actually meet multi-DC requirements
Resolved	Eevans	T222990 Audit session storage to determine max age of un-GC'd sessions

Event Timeline

Krinkle created this task.Dec 16 2020, 3:14 AM

Change 649766 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] SessionManager: Document expectations for storage backend

https://gerrit.wikimedia.org/r/649766

gerritbot added a project: Patch-For-Review.Dec 16 2020, 3:25 AM

Krinkle updated the task description. (Show Details)Dec 17 2020, 2:02 AM

Krinkle added subscribers: BPirkle, WDoranWMF.

Krinkle added a subscriber: • Pchelolo.

Krinkle added a subtask: T222990: Audit session storage to determine max age of un-GC'd sessions.Dec 17 2020, 2:04 AM

• Clarakosi triaged this task as Medium priority.Dec 22 2020, 9:12 PM

• Clarakosi edited projects, added Platform Team Workboards (External Code Reviews); removed Platform Engineering.

RhinosF1 subscribed.Jan 25 2021, 8:27 PM

Blocking task for Multidc

I'm not sure I properly parse/grok all of this but, I can confirm that Kask does not directly expose Cassandra consistency to callers; The assumption is that all reads, and all writes, will utilize a quorum of nodes in the local datacenter. Writes are still replicated to the remote datacenter(s), but asynchronously, so all the usual caveats apply. Deletes utilize a quorum of nodes from each datacenter, so a successful operation does guarantee that subsequent reads will fail.

I'm opposed to altering these semantics, and AFAICS, that makes implementing the complete BagOStuff contract intractable. Not sure what the best course of action would be in that case.

Ultimately, I think this is a case of over generalization, all the way down. We started from the assumption that what we were building for sessions was close-enough to an opaque key/value store that we would generalize it as such, and that we would likewise generalize it on the MediaWiki side by utilizing RESTBagOStuff (instead of - for example - a KaskBagOStuff). And of course, BagOStuff itself is a pathological worst-case scenario for over generalization. While it is not exactly sessions-specific per say, we did ultimately stop short of making Kask another everything-and-the-kitchen-sink store (which I think was the right move), but that creates some impedance mismatch with BagOStuff.

Perhaps we should decide whether we want/need a RESTBagOStuff versus a KaskBagOStuff (where "KaskBagOStuff" here could just mean that we treat RESTBagOStuff as if it was Kask-specific). If it's the former, then someone should think through what a REST interface would look like that supports BagOStuff. If it's the latter, we can treat what we have as special-cased and just adjust the documentation accordingly.

Krinkle updated the task description. (Show Details)Feb 6 2021, 6:10 AM

@Eevans

RE: writes and deletes:
Going with dc-local-consensus for writes and wait-for-crossdc-consensus for deletes makes sense to me. I've updated the task description with what I'd expect to happen in that case, which is for the session code in MW to not claim/ask the things that we've decided not to to, eg. remove/unset those options and update or remove the inline docs that claim or state needs to the contrary.

The code I'm referring to is not in BagOStuff. BagOStuff has various modes and flags for guruantees, some of which we won't be able to meet at WMF with Kask, but that's fine. The issue I have is with the consumption of those flags specifically in code for Kask and for SessionManager. There is recently-written code in RestBagOStuff, and code in SessionManager, which state needs that we don't currently meet.

RE: BagOStuff.

I think it's fine to keep RestBagOStuff as-is. If you think there's significant potential for confusion or misuse, I suppose we can document in its class comment that some of the advanced BagOStuff flags that could be passed to its methods are not currently supported. I'm not sure how useful that would be since this could be said of all BagOStuff implementations. The issue right now I think is that SessionManager is currently actively passing some of those unsupported options, which doesn't make sense to me. If we don't want MW to rely on those promises and we don't plan to meet them at WMF, we shouldn't set them in our own code.

To be clear, I am referring minimum requirements (not the implementation of the default backend or WMF's backend. For example a simple stock install with single server MySQL as backend will have much higher guruantees etc but that overdelivered simplicy is not a promise and not what we'd document as interface behaviour).

Developers need to know what they can rely on when writing core code and extensions.

Site admins need to know what kind of backend they can plug into session storage. I've submitted a draft above for what such documentation might look like. Feel free to use that as starting point for you/your team to document what those requirements and promises are.

RE: add/incr

See task description. I don't know the right answer here, but either we need to meet the needs, or document them as intended short-comings that we believe we don't need for use via SessionStorage, and thus intentionally don't promise to core and extension code interacting with SessionStorage. Whether a third party implemetnation does or doesn't meet those needs isn't relevant. What matters to me is what we guarantee from an interface perspective, which is currently unclear to me from reading the code.

Krinkle mentioned this in T267270: Determine multi-dc strategy for CentralAuth.Feb 6 2021, 6:50 AM

WDoranWMF edited projects, added Platform Engineering Roadmap; removed Platform Engineering Roadmap Decision Making.Feb 11 2021, 6:19 PM

WDoranWMF moved this task from Later (future inbox) to Next on the Platform Engineering Roadmap board.

WDoranWMF set the point value for this task to 1.Mar 1 2021, 7:01 PM

We instead rely on "sticky DC" cookies to pin a user the few seconds around session replication

As I wrote on T267270, a UseDC cookie is sent on login by accident, due to database writes which occur with our particular configuration. This feature is linked to ChronologyProtector. SessionManager has no awareness of sticky DC cookies and is not sending them.

Since session storage is imagined as being multi-master, really you want to stick to the DC where you wrote the data, you don't want to stick to the primary DC. ChronologyProtector wants to stick the client to the primary DC because it has the freshest data, but that will conflict with the session storage. It will be necessary to guess what URLs will cause session writes and send them all to the primary DC in advance.

Imagine if we did an automatic login on a random page view. A local session gets written to whatever DC served the read, and core or CentralAuth will simultaneously do a cross-DC connection to write to the MySQL master. So then neither DC will be able to serve the subsequent request correctly.

Honestly, it would be simpler to just put sessions in MySQL.

Decide, finish, and document how to perform session deletions.

After reading T206010, T206015 and the comments above, I don't understand why session deletions are different to session writes. Is it just a requirement imported from WANObjectCache? In WANObjectCache, the recommended API is getWithSetCallback(), and you call delete() when the cache is invalidated by database writes. It's necessary for delete() to be synchronous so that getWithSetCallback() doesn't write stale data back to the cache with a new expiry. I don't see any similar thing in sessions.

Looking at the git blame, I think these quoted WRITE_SYNC comments were added by @aaron as part of his WANObjectCache work, and are not related to the requirements of session storage.

In T270225#6927436, @tstarling wrote:

[…] Imagine if we did an automatic login on a random page view. A local session gets written to whatever DC served the read, and core or CentralAuth will simultaneously do a cross-DC connection to write to the MySQL master. So then neither DC will be able to serve the subsequent request correctly.

The below is based on multi-dc strategy meeting notes (restricted) between SRE/PET/Perf, from a few months ago.

For the login request, which starts at a local wiki and is a local POST request, I believe the thinking was that the GET redirect to login.wikimedia.org works fine because we would pin that domain in its entirety to the primary DC ( (T91820; based on the assumption requests to this domain are not used in the critical path except for request that we'd want to route to the primary DC regardless). At the end of that redirect chain, it would end up on the local wiki. For db writes that would initially seem problematic as being a GET request, but the thinking was that the sticky DC request would last long enough to cover this chain, and so given it starts from a POST, we're good.

For background autologin from page views (not via login form), we uses a redirect chain of Ajax requests, which similarly go through login.wikimedia.org. That chain, however, also ends up at the local wiki domain touch, via a GET request, and I'm not exactly sure what we thought about that. If I had to improvise now, I'd say we could have the centralauth JS code set a shortlived sticky DC cookie proactively on the local domain at the start of the chain. However, this would be a new idea and previously it didn't feel like our strategy was incomplete, so I perhaps @aaron thought of another way already and I've just forgotten it?

Honestly, it would be simpler to just put sessions in MySQL.

I don't disagree.

Decide, finish, and document how to perform session deletions.

After reading T206010, T206015 and the comments above, I don't understand why session deletions are different to session writes.

Me neither. I vaguely recall hearing in a multi-dc strategy meeting that session writes mostly or only happen during POST web requests, which made us not look too closely at the fact that the new Session store does not offer synchronous writes cross-dc.

But, the MediaWiki code has indeed expected for many years to perform SYNC_WRITES, and afaik that is what @aaron preferred we keep. But when the Session storage RFC (T206010) was writtne, it proposed to reduce this to local-dc quorum with eventual consistency, except for deletes. I'm not sure why, and I'm unable to find a discussion about this. I suppose @Eevans might know it this was a trade-off and intentional change.

Is it just a requirement imported from WANObjectCache? In WANObjectCache, the recommended API is getWithSetCallback(), and you call delete() when the cache is invalidated by database writes. It's necessary for delete() to be synchronous so that getWithSetCallback() doesn't write stale data back to the cache with a new expiry.

I'm not aware of WAN and sessions being related, other than that both are major pieces of MW that underwent refactoring in 2015 in prep for Multi DC.

SessionBackend.php save()

$flags |= CachedBagOStuff::WRITE_SYNC; // write to all datacenters
$this->store->set( … );

Afaik the reason for WRITES_SYNC for sessions is that otherwise we have no mechanism indeed for subsequent reads to see the same data from another DC, and the latency was presumably thought to be acceptable if they generally only happen on POST requests or post-send during request shutdown. I believe session syncing originally happend always post-send, but this was changed to satisy rapid automation for T214471. That does bring the latency pre-send, but of course still limited to requests where the data changed.

Do we have common cases where session data is written GET requests that under T91820 we would currently route to the secondary DC?

For the login request [...]

Special:CentralLogin and Special:CentralAutoLogin I wrote about at T267270, and I think they can both be made to work as I outlined there. The pathological case, not discussed there, is automatic creation of a local user from a central session, i.e. AuthManager::autoCreateUser(). It can be called from Setup.php on any request. This case was discussed in the earliest design stages of the multi-DC proposal, from the point of view of DB writes, and the decision was to allow rare cross-DC write requests. ChronologyProtector will get the real master position and will wait for it on a subsequent request to the same DC.

The Cassandra equivalent of a cross-DC write to the master is EACH_QUORUM.

But, the MediaWiki code has indeed expected for many years to perform SYNC_WRITES, and afaik that is what @aaron preferred we keep. But when the Session storage RFC (T206010) was writtne, it proposed to reduce this to local-dc quorum with eventual consistency, except for deletes. I'm not sure why, and I'm unable to find a discussion about this. I suppose @Eevans might know it this was a trade-off and intentional change.

Aside from the lack of rationale, there is also the problem that it won't work.

Do we have common cases where session data is written GET requests that under T91820 we would currently route to the secondary DC?

Yes, CSRF tokens.

Here's a review of Session::set() callers that I started working on before you wrote your comment:

CentralAuthSession::setCentralSession(): autologin or explicit login/logout
CheckUser\TokenManager::getSessionKey(): Implements paging by writing the current query offset in one request and reading it in the next request. Maybe not quite as pathological as it looks since maybe it uses POST requests and does database writes each time. See T239680.
ApiAuthManagerHelper::formatAuthenticationResponse(): Temporary cross-request serialization of AuthenticationRequest objects, can use GET. The accidental DB writes that help us out with normal login probably don't happen here.
AuthManager::autoCreateUser(): This is the classic auto-login on page view that I was talking about above. It writes to the database and the session simultaneously, with no detectable pattern in the URL.
AuthManager::setSessionDataForUser(): Normal login
DefaultPreferencesFactory::submitForm(), SpecialPreferences: A temporary UI token set on POST and read on the next request to show the result. It will work as long as a DB write is done.
Session::getToken(): Most CSRF tokens use this. For example, a token may be set on action=edit (a GET request with no DB writes) and read on the subsequent POST request a few seconds later.
SpecialUserrights: A session-based success message on the same model as Special:Preferences
User::loadFromSession(), SessionManager::logPotentialSessionLeakage(): probably tolerant

There are also some callers of WebRequest::setSessionData(), but we have above a good sampling of the toughest cases for consistency purposes.

In T270225#6927520, @tstarling wrote:

The Cassandra equivalent of a cross-DC write to the master is EACH_QUORUM.

But, the MediaWiki code has indeed expected for many years to perform SYNC_WRITES, and afaik that is what @aaron preferred we keep. But when the Session storage RFC (T206010) was writtne, it proposed to reduce this to local-dc quorum with eventual consistency, except for deletes. I'm not sure why, and I'm unable to find a discussion about this. […]

Aside from the lack of rationale, there is also the problem that it won't work. […]

"respect EACH_QUORUM" is one of the todo comments in RESTBagOStuff that I flagged. I wasn't 100% sure it was the same thing, but okay, we're on the same page now.

Request flow A

POST writes session
GET reads session

Example: Preferences
Cassandra analysis: If a UseDC cookie is sent then we don't need EACH_QUORUM, because the subsequent GET request will be stuck to the same DC as the session write. In that case we only need LOCAL_QUORUM. But it's not guaranteed that such a cookie will be sent.
MySQL analysis: ChronologyProtector knows what to do if sessions are written via LBFactory.

Request flow B

GET writes session
POST reads session

Example: CSRF token
Cassandra analysis: If the DC used for the GET is the primary DC, we don't need EACH_QUORUM. We can predict that the next request will go to the same DC.
MySQL analysis: Cross-DC write to master.
Ideally, the session write would be queued locally, and the POST request would wait for it to arrive, like ChronologyProtector but with secondary->primary data flow, to give us a better chance of hiding the latency from the user. The inter-request time is substantial in these use cases.

Request flow C

GET writes session
GET reads session

Example: auto-create
Cassandra analysis: as long as both requests go to the same local DC, we don't need EACH_QUORUM. But we would have to be very sure no POST request will be done, or have some way to wait if a POST request is done.
MySQL analysis: High latency is incurred. The first request does a cross-DC write, and the second request to waits for it to be replicated to the local DC.

Request flow D

POST writes session
POST reads session

Example: this doesn't really happen, but the effect of both requests being forced to go to the primary DC is what I'm proposing for Special:CentralLogin on T267270. I'm just proposing a URL match instead of a method match.
Cassandra analysis: LOCAL_QUORUM is sufficient on the face of it, but if the session needs to be read by subsequent GET requests without a UseDC cookie, that implies waiting.
MySQL analysis: Essentially the legacy situation as in RF(A).

More complex flows

Login has a CSRF token and so starts as RF(B). Normal core login is like RF(A). Then the login.wikimedia.org handshake is like RF(D).

tstarling mentioned this in T277834: Two-level session storage and the consistency problem with serialized blob stores.Mar 19 2021, 4:45 AM

In T270225#6927504, @Krinkle wrote:

In T270225#6927436, @tstarling wrote:

[ ... ]
After reading T206010, T206015 and the comments above, I don't understand why session deletions are different to session writes.

Me neither. I vaguely recall hearing in a multi-dc strategy meeting that session writes mostly or only happen during POST web requests, which made us not look too closely at the fact that the new Session store does not offer synchronous writes cross-dc.

But, the MediaWiki code has indeed expected for many years to perform SYNC_WRITES, and afaik that is what @aaron preferred we keep. But when the Session storage RFC (T206010) was writtne, it proposed to reduce this to local-dc quorum with eventual consistency, except for deletes. I'm not sure why, and I'm unable to find a discussion about this. I suppose @Eevans might know it this was a trade-off and intentional change.

This was all a long time ago, and what follows is from memory; I haven't (yet) gone back to look at the full history of what was discussed (which presumably spans tickets, wiki pages, & meeting notes), but...

The semantics as implemented were based on the idea that anything that wrote state was the result of an HTTP request with a corresponding verb (POST, PUT, DELETE, ...), whereas anything that read state was the result of a verb corresponding to reads (GET, HEAD, ...). I don't know if this was meant to explain how things actually were at the time, or if it was what we were working toward. What I do recall was that SRE (traffic?) was adamant that routing any more complex than this would quickly become intractable.

As reads and writes (and deletes) would be blocking, they would be restricted to a quorum of local DC nodes to minimize latency. Deletes were to be each quorum (thus incurring intra-DC latency) because it was deemed possible that a racing request (or networking partition, etc) could be exploited to use session data from one DC that had been invalidated in the other (and something like a network partition that resulted in a failure to reach quorum on delete, could at least be handled to inform the user their logout may not have succeeded). Again, I don't know how much of this was implied to be a description of how things worked, versus how they were meant to.

In T270225#6929751, @Eevans wrote:

The semantics as implemented were based on the idea that anything that wrote state was the result of an HTTP request with a corresponding verb (POST, PUT, DELETE, ...), whereas anything that read state was the result of a verb corresponding to reads (GET, HEAD, ...). I don't know if this was meant to explain how things actually were at the time, or if it was what we were working toward.

I remember this being discussed in the context of database writes, and it is a reasonable first approximation in that case, but there is no clear relationship between HTTP method and session writes as I have documented above.

What I do recall was that SRE (traffic?) was adamant that routing any more complex than this would quickly become intractable.

Yes, I remember that. The contention is that we should rewrite MediaWiki and the way it interacts with browsers, because five extra lines of Lua code is intolerable complexity, despite the proposed lines being closely analogous to existing special cases. If that's still the policy, I'd be happy to declare this project blocked.