Page MenuHomePhabricator

SessionBackend seems to store session changes too often
Open, Needs TriagePublic

Description

SessionBackend has a dirty flag to keep track of session changes, to avoid having to write to the store every time Session::set() is called. Based on some of the investigation in {T390514}, this doesn't seem to work reliably. We should test it and see if there's an easy fix.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
In T390514#10727936, @Tgr wrote:

Looking at the code, I think one issue might be that SessionBackend::persist() always saves, even when there is nothing to persist (the cookie exists and doesn't need to be changed), via renew() -> autosave() -> save(). Renew should probably not autosave when the session isn't near expiry?

T362324: Disable PHPSessionHandler in Wikimedia production might help. At the very least, it would make the control flow for session operations much more understandable.

Eevans added a parent task: Restricted Task.May 2 2025, 9:17 PM

Change #1143960 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@master] [DNM] Add stack traces to session write logs

https://gerrit.wikimedia.org/r/1143960

On a local CentralAuth setup, session writes involved in visiting the login page but not doing anything else look like this:

  • SUL2: one write
    • triggered by the Session::persist() call on LoginSignupSpecialPage, on a previously unpersisted session
  • SUL3: five writes (P75894)
    • triggered by the Session::persist() call on LoginSignupSpecialPage, on a previously unpersisted session
    • triggered by the WebRequest::getSession() call at the end of AuthManager::beginAuthentication(). I am not fully sure what's going on here.
      • One aspect of it is that the previous write was initiated through SessionManager::getGlobalSession(). Usually that just falls back to calling getSession() on the global context's WebRequest, which would then be cached, so this getSession() wouldn't do anything. But in some code paths it doesn't, and apparently one of those code paths is used.
      • The other hald if that SessionManager::getSessionFromInfo() will be called to create the new session object, that calls delaySave(), and that eventually calls save(). I'm not really sure but think it should be calling autosave() instead.
    • triggered by the Session::persist() call just two lines later. persist() calls renew(), and renew() calls autosave() even if the session isn't near expiry. That seems like a bug.
    • another WebRequest::getSession() call that isn't cached. This one is triggered by obtaining the CSRF token (that was supposed to be fixed by rMW3168669776d3: LoginSignupSpecialPage: Get a login token before persisting the session, but it seems like that didn't work?) at the point where LoginSignupSpecialPage generates $returnToUrl. It is called on the DerivativeRequest that has replaced the normal request as a way of doing a simulated POST on the local login page to skip it and immediately go to the remote login page (see AuthManagerSpecialPage::setRequest() and LoginSignupSpecialPage::getAuthForm()), but DerivativeRequest just returns the session of the base request object, so that doesn't explain why there is no cached Session object on the request.
    • another Session::persist() call on LoginSignupSpecialPage on a previously unpersisted session, now on the central domain.

There is also some weirdness with SessionBackend::checkPHPSession() which results in SessionBackend::save() getting called from inside another SessionBackend::save(). I don't think that changes the outcome, but it's certainly confusing.

Another weirdness is that the session ID for the first write is different from the IDs of the other writes on that domain.

Usually that just falls back to calling getSession() on the global context's WebRequest, which would then be cached

Apparently I misunderstood/misremembered how that works: the session ID is cached on the WebRequest but the Session object itself is not. WebRequest::getSession() calls SessionManager::getSessionById() every time, that calls SessionManager::getSessionFromInfo() every time (assuming the SessionBackend for the given ID is loaded already), and that will do things with the SessionBackend that result in a write (as long as the session is persisted). So calling RequestContext::getMain()->getRequest()->getSession() ten times in a row will result in ten Cassandra writes. Not great.
(In actual testing, this doesn't quite seem to be the case, probably due to the vagaries of garbage collection. But adding a bunch of no-op getSession() calls definitely increases the number of saves, even if the ratio is not 1:1.)

There is a save "loop" where the first SessionBackend::save() that causes cookies to be written calls CentralAuthSessionProvider::persistSession(), that calls SessionBackend::setProviderMetadata(), and that does an (auto)save. Then there's another loop inside that where save() calls checkPHPSession() which through some complex interplay between SessionManager and native PHP session handling calls SessionManager::getSessionFromInfo() again. Fun. Probably not harmful since calling save() twice in a row will be a noop the second time, but very confusing.

I had high hopes of T362324: Disable PHPSessionHandler in Wikimedia production but it doesn't seem to reduce the number of writes, although it perturbs the exact timing and call path somewhat: the first write is from the getSession() call triggered by FilteredRequestTracker::saveState() (the other four are the same). So this seems even more confusing than before (why is one of the persist() calls causing a save, but the other not?). But gets rid of the weird loops of triggering a save from within a save, at least. Probably worth it for the reduced mental complexity, if it doesn't break anything else.

So fundamentally I think the problem is that a some things call save() and rely on it generally being a no-op: WebRequest::getSession() via the delaySave() call in SessionManager::getSessionFromInfo(); and Session::persist() via the renew() call. But save() is only a noop when there is no change; if anything changed the session data, it will do a write.

So e.g. this snippet:

$this->getRequest()->getSession()->persist();
$this->getRequest()->getSession()->set( 'x', 1 );
$this->getRequest()->getSession()->set( 'x', 2 );
$this->getRequest()->getSession()->set( 'x', 3 );
$this->getRequest()->getSession()->set( 'x', 4 );
$this->getRequest()->getSession()->set( 'x', 5 );
$this->getRequest()->getSession()->set( 'x', 6 );
$this->getRequest()->getSession()->set( 'x', 7 );
$this->getRequest()->getSession()->set( 'x', 8 );
$this->getRequest()->getSession()->set( 'x', 9 );

will result in ten writes even though it should only result in two (one when the session goes from unpersisted -> persisted, and one on shutdown because the data is dirty).

Change #1147154 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@master] Do not save on Session::renew() when there's nothing to renew

https://gerrit.wikimedia.org/r/1147154

Change #1147155 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@master] Don't save after Session::delaySave() when there's no delayed save

https://gerrit.wikimedia.org/r/1147155

Change #1147155 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@master] Don't save after Session::delaySave() when there's no delayed save

https://gerrit.wikimedia.org/r/1147155

Your finding contradict some of the assumptions we generally hold, which is that a session is read at the start of a request, potentially modified any number of times, and then saved only at the end of the request. E.g. as described in T277834: Two-level session storage and the consistency problem with serialized blob stores. This change moves us back in that direction, which is useful I suppose, but it's interesting to know that this hasn't been the case for a while.

Change #1147155 merged by jenkins-bot:

[mediawiki/core@master] Don't save after Session::delaySave() when there's no delayed save

https://gerrit.wikimedia.org/r/1147155

Change #1147154 merged by jenkins-bot:

[mediawiki/core@master] Do not save on Session::renew() when there's nothing to renew

https://gerrit.wikimedia.org/r/1147154

Change #1149822 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@wmf/1.45.0-wmf.2] Do not save on Session::renew() when there's nothing to renew

https://gerrit.wikimedia.org/r/1149822

Change #1149823 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/core@wmf/1.45.0-wmf.2] Don't save after Session::delaySave() when there's no delayed save

https://gerrit.wikimedia.org/r/1149823

The two merged patches should reduce the number of writes when visiting the login page from 5 to 3. (So that's a 40% decrease of writes for login page visits, which is about half of all writes; so at least 20% in total. Probably more, because the changes aren't specific to the login page, and should reduce writes for all kinds of logged-in workflows too, but I have no idea of the extent of that.) We should backport them to make it easier to measure the effect, and also because SessionBackend is hard to reason about and maybe there will be unexpected bugs / side effects.

The theoretical minimum of writes on a login page visit is 2 (since there are two separate domains involved). Unless there's a very compelling reason I'd not spend time on going 3->2, that seems more risky. (The reason for the extra write is that there is normally one on GET when the login page is shown, the session gets started, and the CSRF token is set up; and another one on POST when the (first step of) the login page is submitted and AuthManager stores a bunch of state in the session. But on the local domain, when using a redirect-based login provider, those two steps are kind of merged together.) We could try to wrap the entire login process in a delaySave() call but I'd rather not do that. T394075: Investigate using different stores for different kinds of sessions might offer more flexible way to improve sessionstore reliability.

Change #1149822 merged by jenkins-bot:

[mediawiki/core@wmf/1.45.0-wmf.2] Do not save on Session::renew() when there's nothing to renew

https://gerrit.wikimedia.org/r/1149822

Change #1149823 merged by jenkins-bot:

[mediawiki/core@wmf/1.45.0-wmf.2] Don't save after Session::delaySave() when there's no delayed save

https://gerrit.wikimedia.org/r/1149823

Mentioned in SAL (#wikimedia-operations) [2025-05-27T08:16:28Z] <hashar@deploy1003> Started scap sync-world: Backport for [[gerrit:1149822|Do not save on Session::renew() when there's nothing to renew (T392251)]], [[gerrit:1149823|Don't save after Session::delaySave() when there's no delayed save (T392251)]]

Mentioned in SAL (#wikimedia-operations) [2025-05-27T08:18:30Z] <hashar@deploy1003> hashar, tgr: Backport for [[gerrit:1149822|Do not save on Session::renew() when there's nothing to renew (T392251)]], [[gerrit:1149823|Don't save after Session::delaySave() when there's no delayed save (T392251)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-05-27T08:27:57Z] <hashar@deploy1003> Finished scap sync-world: Backport for [[gerrit:1149822|Do not save on Session::renew() when there's nothing to renew (T392251)]], [[gerrit:1149823|Don't save after Session::delaySave() when there's no delayed save (T392251)]] (duration: 11m 29s)

The patches were deployed on the 27th around 8AM UTC. There's a ~40% drop at the time, but with the amount of random fluctuation session writes have, it's hard to say for sure if that was related. And the log cap mentioned in T393038#10810826 means I can't manually check the number of writes in Logstash :(

Screenshot from 2025-06-01 18-54-47.png (1×3 px, 393 KB)
Screenshot from 2025-06-01 18-52-27.png (1×3 px, 348 KB)

In the session Logstash dashboard filtered to login pages (without filtering it's not very useful because of the caps), the change can be seen more clearly:

Screenshot Capture - 2025-06-01 - 19-13-57.png (660×1 px, 148 KB)

writes are down from ~250K per hour to ~100K per hour, and the composition is changed ("other", which mostly means writes because of the session data being dirty, is gone). That's a bigger decrease than I would have expected - it seems more like down to 2 writes from 5 rather than to 3. Not sure what's up with that.

We should check manually via WikimediaDebug how many writes now happen per login page view. That is blocked on T394402: Reduce noisy auth logs since the log cap makes WikimediaDebug unusable for noisy channels.

In the session Logstash dashboard filtered to login pages (without filtering it's not very useful because of the caps), the change can be seen more clearly:

Screenshot Capture - 2025-06-01 - 19-13-57.png (660×1 px, 148 KB)

writes are down from ~250K per hour to ~100K per hour, and the composition is changed ("other", which mostly means writes because of the session data being dirty, is gone). That's a bigger decrease than I would have expected - it seems more like down to 2 writes from 5 rather than to 3. Not sure what's up with that.

This graph is storage utilization. I went back to the beginning of the month just to show how stable it has been. There definitely seems to be an increase starting the 27th (in fact the lowest point between the 26th & 28th is 08:00 on the 27th). See: https://grafana.wikimedia.org/goto/2f1_wCfHR?orgId=1.

image.png (672×1 px, 200 KB)

The total number of writes started an upward trend at the same time.

image.png (811×1 px, 151 KB)

The timing is suspect, but I would also have zero problem believing that the timing is a coincidence, and that something unrelated is responsible for the increase... but how do we determine that? Hint: I think we continue to have an observability problem. :(

Storage utilization is in a way an aggregate of metric, so session logic changes can have complex effects on it, but for total write volume, it's hard for a code change to start a trend. It's just going to be a one-time drop (or increase if it works very differently than expected) as the number of writes per pageview changes, so basically post-deployment values on the graph are multiplied by some (hopefully smaller than 1) constant.

I guess a change that goes very wrong could do something like cause occasional session loss so the affected users would feel the need to log in again and that would increase login page traffic and that increase would have a gradual ramp-up (as more active users would notice sooner) but given that ~99% of login page traffic is scrapers, it wouldn't really have much effect.

In any case you can see a big uptick in authmetrics in captcha displays without any change in failed / successful captcha submissions, which is a sign of increased scraper traffic.

Screenshot Capture - 2025-06-03 - 17-30-09.png (616×1 px, 88 KB)
Screenshot Capture - 2025-06-03 - 17-38-25.png (618×1 px, 154 KB)

Login and signup rates are also unchanged, so these are just pageviews:

Screenshot Capture - 2025-06-03 - 17-41-00.png (610×1 px, 142 KB)
Screenshot Capture - 2025-06-03 - 17-41-49.png (610×1 px, 122 KB)

(though we apparently broke the logging of failed signup attempts at some point :/ )

Here's "Special:UserLogin" traffic from turnilo:

Screenshot Capture - 2025-06-03 - 17-59-17.png (1×2 px, 198 KB)

So yeah I'd chalk it up to coincidence.

Filtering for a single user agent seems like a workable way of excluding most scraper traffic. Here is session write logs for latest two Chromes only:

Screenshot Capture - 2025-06-03 - 18-16-28.png (525×1 px, 77 KB)

hard to gauge the extent because of the huge spike (that was T395579, a Logstash infrastructure issue) but there was definitely some decrease.

Hint: I think we continue to have an observability problem. :(

I will add sampling to the logs to make T395899: Some subset of MediaWiki Logstash events are capped to 100/s break them in a less annoying way, but beyond that, I am not sure what else we could do. If we could identify scraper spikes, we wouldn't allow them in the first place.

I thought that the reduced write rate would also be visible on this chart: https://grafana.wikimedia.org/d/000000377/host-overview?var-server=sessionstore1004&viewPanel=panel-12&from=now-14d&to=now (from the parent task, T390514)

image.png (848×1 px, 72 KB)

I expected the slope of the chart to decrease. But it doesn't, and if anything it seems to have increased. Is this explained by the extra traffic, or some config change? Or was that a wrong assumption to make?

The rise in authentication page pageviews is maybe related to T395934: NEW BUG REPORT: Investigate rise in May 2025 Reader metrics? The two turnilo graphs match in timing, at least.

In any case, whatever drove login traffic seems to be subsiding (although not quite back to pre-June-25th levels yet):

Screenshot Capture - 2025-06-10 - 15-48-39.png (1×2 px, 253 KB)
Screenshot Capture - 2025-06-10 - 15-46-42.png (616×1 px, 110 KB)
Screenshot Capture - 2025-06-10 - 15-57-08.png (448×1 px, 76 KB)

Sessionstore disk space utilization recovered somewhat, but only somewhat:

Screenshot Capture - 2025-06-10 - 16-23-29.png (468×1 px, 276 KB)

One unsettling thing is that there was a more-or-less corresponding spike in central session write rates, but central sessions are only used for logged-in traffic:

Screenshot Capture - 2025-06-10 - 15-46-57.png (684×1 px, 155 KB)

Maybe another coincidence, but it's weird.

In any case, not much left to do here. I'll manually verify the number of writes per visit once T395967: Allow sampling of Logstash events is fully deployed (that will fix our observability problems too, to the extent they are fixable), then close this task. T394075: Investigate using different stores for different kinds of sessions will hopefully give us more flexibility in how we respond to increases in certain kinds of session events.

Change #1143960 abandoned by Gergő Tisza:

[mediawiki/core@master] [DNM] Add stack traces to session write logs

https://gerrit.wikimedia.org/r/1143960