Page MenuHomePhabricator

NetworkSessionProvider / CirrusSearch Streaming Updater causing 'session' log spam and possibly Sessionstore (Kask) problems
Closed, ResolvedPublicPRODUCTION ERROR

Description

I was browsing through channel:session logs looking for a cause of T372702. I didn't find that, but I found this instead.

We have ~5000 warnings logged per day, starting on 2024-08-14, that look like this:
Session "[100]MediaWiki\Extension\NetworkSession\NetworkSessionProvider<+:0:CirrusSearch Streaming Updater>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX": the session store entry is for an anonymous user, but the session metadata indicates a non-anonynmous user

https://logstash.wikimedia.org/goto/4c457e78f5b591a9d45d3999b8a01731

image.png (514×2 px, 83 KB)

Either the NetworkSession extension needs to tweak its session not to trigger this warning, or MediaWiki needs to not complain about if it it's expected. I don't know which is right.

After fixing this, someone should also fix the typo in "anonynmous".

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I was also looking for a cause of T372702 in Grafana dashboards, looked at the one for Sessionstore (Kask): https://grafana.wikimedia.org/d/000001590/sessionstore and I noticed that traffic has greatly increased, and most of the metrics like latency etc. have worsened, at the same time on 2024-08-14:

eqiad
image.png (440×3 px, 134 KB)
codfw
image.png (440×3 px, 180 KB)

I looked in the Server Admin Log around that time, and there were cirrus-streaming-updater deployments around 22:00 on 2024-08-14: https://wikitech.wikimedia.org/wiki/Server_Admin_Log/Archive_84#2024-08-14

It could be a coincidence, but it looks suspicious.

matmarex renamed this task from Log spam from NetworkSessionProvider / CirrusSearch Streaming Updater: "session store entry is for an anonymous user, but the session metadata indicates a non-anonynmous user" to NetworkSessionProvider / CirrusSearch Streaming Updater causing 'session' log spam and possibly Sessionstore (Kask) problems.Sep 2 2024, 11:45 PM

Change #1070344 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] NetworkSession: Only enable for private wikis

https://gerrit.wikimedia.org/r/1070344

Change #1070344 merged by jenkins-bot:

[operations/mediawiki-config@master] NetworkSession: Only enable for private wikis

https://gerrit.wikimedia.org/r/1070344

Mentioned in SAL (#wikimedia-operations) [2024-09-04T20:46:01Z] <dancy@deploy1003> Started scap sync-world: Backport for [[gerrit:1070344|NetworkSession: Only enable for private wikis (T373826)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-04T20:48:02Z] <dancy@deploy1003> ebernhardson, dancy: Backport for [[gerrit:1070344|NetworkSession: Only enable for private wikis (T373826)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-04T20:52:36Z] <dancy@deploy1003> Finished scap sync-world: Backport for [[gerrit:1070344|NetworkSession: Only enable for private wikis (T373826)]] (duration: 06m 34s)

That change has completely resolved the log spam:
(this is more than I expected… maybe it really was all about AbuseFilters and T373778 is not just related but the exact same problem, since private wikis don't usually have filters)
https://logstash.wikimedia.org/goto/4c457e78f5b591a9d45d3999b8a01731

image.png (535×2 px, 115 KB)

…and significantly improved the Kask metrics, although it's hard to say just yet if they're completely back to normal:
https://grafana.wikimedia.org/d/000001590/sessionstore

image.png (431×3 px, 148 KB)

the session store entry is for an anonymous user, but the session metadata indicates a non-anonynmous user means that the session provider returned a SessionInfo indicating this is a logged-in user, but the data in the session store (Kask) is anonymous. The NetworkSession provider will always return a non-anonymous SessionInfo as long as the header is there, so somehow an anonymous session is saved, and then since for a given Authorization header value NetworkSession will always generate the same session ID, this warning results.

NetworkSession returns persist => false which should prevent SessionBackend from writing to the session store, but it still writes to the in-process cache for the session store (cf. the $flags = $this->persist ? 0 : CachedBagOStuff::WRITE_CACHE_ONLY line in SessionBackend), so it's still possible to get a conflict within the same request. (This matches the Kask stats where disabling had a big effect on reads but no effect on writes.) I guess the session save is initiated by SessionManager::getSessionFromInfo(), which would be invoked whenever something checks whether who the request user is? And at that point the SessionInfo object is not the one returned by NetworkSession but an anonymous SessionInfo coming from a failed getSessionById()? (The only code path that can result in an anonymous SessionInfo with a conflicting ID, I think?) But I'm not sure where or why that would get invoked.

matmarex claimed this task.

From my perspective this looks fixed, the log spam is gone and the Kask metrics are back to what they were. It looks like it hasn't helped with T372702: editors are repeatedly getting logged out (August 2024), but at least we know that there must be a different cause now.

Some follow-up work is happening on T373778: NetworkSession and AbuseFilter may be spammy, so I'll close this one and keep that one open.

Thanks for the quick fix @EBernhardson!

I wouldn't really consider this fixed, all we've done is deploy this to less places to reduce the frequency of the errors. Ideally we would like to get this turned back on everywhere. The problem is that with NetworkSession being used for such a small fraction of our requests (update rate on private wikis is very low) it will be easy to not notice that something went wrong with NetworkSession and that updates are not properly flowing to the private wikis. Additionally we would like to, at some point in the future, transition the indexing of archived titles into our external updater process as well. That will require additional read access rights beyond what the anonymous user has.

Ideally I'd like to see us resolve the root cause identified by tgr, that sessions are being persisted even though we set persist => false.

matmarex removed matmarex as the assignee of this task.

If you want to work on that, that's fine by me.

T373778: NetworkSession and AbuseFilter may be spammy fixes the problem AbuseFilter causes for NetworkSession (that it prevents an authenticated session). It also fixes the logspam as I don't think anything is logged when the specified user account is available. The mismatch between the stored and current session was caused because the autocreation attempt changes the session to anonymous when it fails, so "get SessionInfo -> autocreate -> save session" (in Setup.php) resulted in an anonymous session getting stored, and the next time something tried to get a session the SessionInfo, it reuslted in a mismatch.

I don't think the AbuseFilter fix has anything to do with Kask load, the number of session reads/writes won't be any different when autocreation succeeds or the user already exists. So to the extent the increased load is a problem (not sure it is - just by eyeballing the charts, it was a 30% increase in reads, no change in writes, and no visible effect on error rates or latencies), we'd have to change something in the session handling logic or in Kask.

Ideally I'd like to see us resolve the root cause identified by tgr, that sessions are being persisted even though we set persist => false.

Given that there was an increase in reads but not writes, I'm not sure they are actually persisted. (Granted it would only happen on a few wikis with anomalous abuse filters, so maybe that's rare enough to not be visible on the charts. But then if that's the case, it's also not problematic.)

I would at this point just reenable and see if that causes any problems. Maybe ask SREs first.

But also, the change in load was around 600 requests per second. That can't be the expected behavior of cirrus-streaming-updater, right?

Given that there was an increase in reads but not writes, I'm not sure they are actually persisted. (Granted it would only happen on a few wikis with anomalous abuse filters, so maybe that's rare enough to not be visible on the charts. But then if that's the case, it's also not problematic.)

I would at this point just reenable and see if that causes any problems. Maybe ask SREs first.

I guess i didn't realize those errors were only coming from the abuse filter bits, for some reason i was thinking it was a more general problem. We can roll it back out at monitor the dashboards to verify.

But also, the change in load was around 600 requests per second. That can't be the expected behavior of cirrus-streaming-updater, right?

That is about right, there are three clusters being updated by three separate updaters. Each updater is running on average around 200 updates/sec, most of those come from rerenders. The breakdown is something like:

mediawiki based updates: ~150/s
old doc rerender (specifically the <cluster>.old lines, rolling 16 week rerender of all pages): 40-50/s

three updaters at 200/s gets 600 requests/s. The requests land in different datacenters though, 400 to eqiad and 200 in codfw. We had pondered combining the ones in eqiad, the problem is that cloudelastic can have slowdowns due to having a fraction of the hardware (6 hosts vs 50) and we don't want to allow problems in cloudelastic to cause production problems.

Restricted Application changed the subtype of this task from "Task" to "Production Error". · View Herald TranscriptAug 26 2025, 8:31 AM

Change #1182088 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] Revert "NetworkSession: Only enable for private wikis"

https://gerrit.wikimedia.org/r/1182088

I have a use-case for which I would need this extension enabled everywhere, @Tgr @Eevans would you have objections trying to enable it again to see if the impact on kask is acceptable?

The AbuseFilter issue should have been fixed by rEABF65c10f5fa06f: Skip auth checks when autocreate is allowed by provider and we now have a session writes dashboard with reliable stats, so IMO good to go.

Change #1182088 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "NetworkSession: Only enable for private wikis"

https://gerrit.wikimedia.org/r/1182088

Mentioned in SAL (#wikimedia-operations) [2025-08-26T13:42:46Z] <dcausse@deploy1003> Started scap sync-world: Backport for [[gerrit:1182088|Revert "NetworkSession: Only enable for private wikis" (T373826)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-26T13:48:42Z] <dcausse@deploy1003> dcausse: Backport for [[gerrit:1182088|Revert "NetworkSession: Only enable for private wikis" (T373826)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-26T13:57:02Z] <dcausse@deploy1003> Finished scap sync-world: Backport for [[gerrit:1182088|Revert "NetworkSession: Only enable for private wikis" (T373826)]] (duration: 13m 49s)

After deploying: the session write dashboard remains almost empty, sessionstore rps increased but unless I'm not reading the dashboard properly latencies appear to be fine, only the buckets "<1m" and "1-2.5ms" increased.

Unfortunately the the session store entry is for an anonymous user, but the session metadata indicates a non-anonymous user are back, it's not from all the wikis, e.g. I don't see such warnings from enwiki.

After deploying: the session write dashboard remains almost empty, sessionstore rps increased but unless I'm not reading the dashboard properly latencies appear to be fine, only the buckets "<1m" and "1-2.5ms" increased.

Specifically, it looks like GET rate has increased by ~33%. Is this...expected?

After deploying: the session write dashboard remains almost empty, sessionstore rps increased but unless I'm not reading the dashboard properly latencies appear to be fine, only the buckets "<1m" and "1-2.5ms" increased.

Specifically, it looks like GET rate has increased by ~33%. Is this...expected?

The user using the NetworkSession auth mechanism can make a lot of requests, see T373826#10141552, I'm not clear if there's a need to actually check the session store since they should not be persisted. I can revert this again if this is not acceptable.

After deploying: the session write dashboard remains almost empty, sessionstore rps increased but unless I'm not reading the dashboard properly latencies appear to be fine, only the buckets "<1m" and "1-2.5ms" increased.

Specifically, it looks like GET rate has increased by ~33%. Is this...expected?

The user using the NetworkSession auth mechanism can make a lot of requests, see T373826#10141552,

Out of curiosity, how were these requests being satisfied before NetworkSession was re-enabled?

I'm not clear if there's a need to actually check the session store since they should not be persisted.

There was a small(ish) bump in 404s after the deploy, which seems to have settled down since (relative to the new baseline). So I'm guessing they are being persisted, and are being looked up. If that should not be happening, do we know why it is?

I can revert this again if this is not acceptable.

I'm kind of torn here. One the one hand, the increase isn't causing an issue of any kind; We're well within our capacity. On the other, it's already the case that we do a lot of superfluous requests. We have to provide resources for those, while maintaining sufficient headspace beyond that to deal with the unexpected (read: there is a cost to be paid). So while I definitely don't want to be in the business of optimizing every single request, if you're saying that this 33% increase shouldn't even be hitting sessionstore, that seems worth flagging.

Out of curiosity, how were these requests being satisfied before NetworkSession was re-enabled?

The API they hit is public and the authorization header was simply ignored apparently when this extension was disabled.
In T401220 I found the need to identify such requests and sadly came up with an approach that assumed that this extension was enabled everywhere.
In the future we may want to use this extension to access non-public information on public wikis (indexing archived page to power search on Special:Undelete) and for this we will need this NetworkSession to allow non-MW internal components to access some restricted APIs.

I'm not clear if there's a need to actually check the session store since they should not be persisted.

There was a small(ish) bump in 404s after the deploy, which seems to have settled down since (relative to the new baseline). So I'm guessing they are being persisted, and are being looked up. If that should not be happening, do we know why it is?

Yes I think you are right so something is not working as expected yes, this extension declares itself as persistsSessionId() -> false. So something is not honoring this.
I can see a couple log lines like: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2025.08.26?id=TEUJ55gBAlpnrixJf0U6

I can revert this again if this is not acceptable.

I'm kind of torn here. One the one hand, the increase isn't causing an issue of any kind; We're well within our capacity. On the other, it's already the case that we do a lot of superfluous requests. We have to provide resources for those, while maintaining sufficient headspace beyond that to deal with the unexpected (read: there is a cost to be paid). So while I definitely don't want to be in the business of optimizing every single request, if you're saying that this 33% increase shouldn't even be hitting sessionstore, that seems worth flagging.

I agree with you, something is definitely not working as we expect, @Tgr do you have an idea on what could cause the sessions to still be persisted? And additionally if there's a need to actually check the store for a session that's constructed by a provider that is not supposed to store sessions?

Specifically, it looks like GET rate has increased by ~33%. Is this...expected?

Yeah, it happened the last time the updater was enabled, and nothing changed since then. The AbuseFilter issue we fixed was causing bogus writes, and those are mostly gone (we still have a few thousand, not sure why, but that's a drop in the bucket); it didn't affect reads.

SessionManager does a session store read at the beginning of every authenticated session, these are authenticated sessions; the same would happen if CirrusSearch did API requests in any other way, using OAuth or cookies or whatever. There isn't any way for a session provider to indicate that it doesn't need session store lookups (conceptually, session providers are responsible for session tokens in the web request and response, not for communication with the store; that is uniform across all session types).

It's also strictly speaking not true that the session store is never used meaningfully; there are some very fringe situations in which we store data there and read it back when using NetworkSession. User autocreation errors are cached in the session to avoid repeating potentially expensive user creation attempts on every request.

If we wanted to avoid reads (which, per above, isn't necessarily a good idea), once multi-session backends land, a hacky but simple solution would be to direct all session store reads/writes with a NetworkSession provider to a fake session store (HashBagOStuff). Or maybe a real but very cheap, non-replicated cache like APC.

Specifically, it looks like GET rate has increased by ~33%. Is this...expected?

Yeah, it happened the last time the updater was enabled, and nothing changed since then. The AbuseFilter issue we fixed was causing bogus writes, and those are mostly gone (we still have a few thousand, not sure why, but that's a drop in the bucket); it didn't affect reads.

SessionManager does a session store read at the beginning of every authenticated session, these are authenticated sessions; the same would happen if CirrusSearch did API requests in any other way, using OAuth or cookies or whatever. There isn't any way for a session provider to indicate that it doesn't need session store lookups (conceptually, session providers are responsible for session tokens in the web request and response, not for communication with the store; that is uniform across all session types).

It's also strictly speaking not true that the session store is never used meaningfully; there are some very fringe situations in which we store data there and read it back when using NetworkSession. User autocreation errors are cached in the session to avoid repeating potentially expensive user creation attempts on every request.

If we wanted to avoid reads (which, per above, isn't necessarily a good idea), once multi-session backends land, a hacky but simple solution would be to direct all session store reads/writes with a NetworkSession provider to a fake session store (HashBagOStuff). Or maybe a real but very cheap, non-replicated cache like APC.

Thanks for this context, it helps!

Wrt the various flags, persistsSessionId() means that the session provider is able to update the client on what session ID to use, so that's irrelvant here. persist => false isn't very well defined for immutable session providers; conceptually, you could read it as "the session hasn't been stored anywhere", except, how would the session provider tell whether the session is persisted or not? Immutable providers like NetworkSession can't access any stateful information on their own and have to rely on the session store for that.

Currently the way it works is that, because NetworkSession returns persist => false, the session is not stored in the backend by default; but if something calls $session->persist() (as various things like autocreation or the token API might do), the session does get written to the store, and on the next request it will be loaded back, and seeing that there is stored data, SessionManager will adjust the persisted flag to true.

I guess the design assumption here is that an extension like NetworkSession can't know what kind of requests will be authenticated with the mechanism it provides, and certain kinds of requests need to store session data. So it doesn't make sense for a session provider to have a "will never need to store session data" flag.

In practice, MediaWiki isn't using session data much for non-cookie-based sessions. If T126257: The API should not require CSRF tokens for an OAuth request gets fixed, the last major use case for session data in OAuth, NetworkSession etc. will be gone, and only edge cases like autocreation errors or chunked file uploads remain which could probably be moved to a different storage mechanism. So maybe we should rethink that. It's not a simple change though.

Thanks for all the context on this!
Regarding the the session store entry is for an anonymous user, but the session metadata indicates a non-anonynmous user, I checked a few sessions from the store and they indeed mention AuthManager::AutoCreateBlacklist with an error message centralauth-account-exists. So I suspect that once this session is gone the auto-creation will fail again for the same reason?

I am not sure why that error is happening in the first place. If there is an existing user account, in theory NetworkSession should just provide access to it. Maybe it's some kind of race condition, with multiple NetworkSession requests trying to autocreate a non-existent account at the same time? In theory autocreation handles race conditions but CentralAuth might mess that up. If so, then this error will go away once the account exists everywhere. (The blacklist is short-lived, five minutes IIRC.)

dcausse assigned this task to Tgr.

Tentatively closing, I no longer see this warning message since September 16.
I do see a some limited logspam with Session data is authenticated, should not be an anonymous user since sept 27.
Please feel free to re-open if this ticket is useful to track some work.

I do see a some limited logspam with Session data is authenticated, should not be an anonymous user since sept 27.

Related to T402808: Deploy separate anonymous session backend to Wikimedia production, in log-only mode.

We had about 60K warnings since the code emitting it was deployed to group 1 a couple days ago, 40K from NetworkSessionProvider, 20K from CentralAuthSessionProvider, 2K from CentralAuthApiSessionProvider. Maybe that just reflects how much each provider is used? (Although e.g. the OAuth provider is conspicuously missing.)

Anyway, will look into that at the other task.

T405633: Session data is authenticated, should not be an anonymous user

Unrelated to that, maybe NetworkSession should be limited to private wikis for now, to preserve resources? It's not causing any problems, but I don't think there is any benefit to being authenticated on public wikis, either? And limiting the extension to the private dblist is straightforward.

T405633: Session data is authenticated, should not be an anonymous user

Unrelated to that, maybe NetworkSession should be limited to private wikis for now, to preserve resources? It's not causing any problems, but I don't think there is any benefit to being authenticated on public wikis, either? And limiting the extension to the private dblist is straightforward.

CirrusSearch may be able to work on public wikis without it. The reason we enabled it everywhere was mainly because the update rate on private wikis is so low that it's "easy" for it to silently break.
More recently we started to rely on it to escape a new PoolCounter we added to protect against T401220 (but we added a work-around in case we had to disable NetworkSession again).
In the future we might want this extension everywhere in order for Cirrus to access deleted pages in order to support its archive index (used in Special:Undelete).
I think we could certainly disable it to preserve resources but ideally we'd like to keep the option of being able to re-enable it everywhere at some point.