Page MenuHomePhabricator

sessionstore workload observability
Open, Needs TriagePublic

Description

Incidents/2025-03-31_sessionstore_unavailability laid bare just how opaque the sessionstore workload is. Preceding the incident, traffic levels had increased, but the knock-on effect on storage utilization suggests that it wasn’t more of the same, something about the nature of the traffic changed. We are able to infer that one of the things that changed was the proportion of session overwrites, but even here we aren’t able to say anything more (for example, by how much, nor anything about the sessions being overwritten).

Progress on T392170: sessionstorage namespacing would go some ways toward improving this. For example: Were that in place prior to the incident, we would have been able to ascertain whether the general increase could be attributed to any one type of session (a namespace). And, if namespacing by group or wiki, identifying a regression in the SUL3 rollout would have been more straightforward (not only would metrics correlate to the phase of a rollout, but it would be easier to spot a modest increase relative to a lower throughput "group").

It still leaves a lot to be desired though, case-in-point: it wouldn't be enough to expose the source of the aforementioned overwrites.

Ideally we'd also have:

  • Session ID write frequency (topN) ...same session being rewritten
  • Session ID write frequency, same value (topN) ...same session, no change in value
  • Write frequency by user, ID, IP (topN) ...possibly different sessions, same "user"
  • ...

See also:

T390630: Alert when disk space utilization on sessionstore nodes is trending high

Event Timeline

Eevans updated the task description. (Show Details)

I'm not sure how much information can be obtained on the sessionstore side. Some typical scenarios that might cause workload problems:

  • A scraper or a DDoS attempt generates unusual levels of load via login page visits. To identify what's going on, you'll probably need information about the clients making the MediaWiki web or API requests that lead to sessionstore requests (and notice that they come from the same IP or have the same user agent or such).
  • A broken bot generates a huge number of (successful or unsuccessful) login attempts. What you'd probably want to know in this situation is that all the requests are related to the same user account (although for a well-behaved client, the UA should also suffice to identify it).
  • The number of session saves went up due to the rollout of a major feature. Ideally, we'd want some sort of feature flag to be easily differentiate between the old and new behavior. This is not readily available today but will be in MediaWiki Logstash once T142313: Add global information to debug logger context is done.
  • The number of session saves for a given operation went up for some unexpected reason (a code change that was not supposed to do that). There is probably no easy way to identify this via logs/metrics, but being able to filter saves by MediaWiki endpoint (special page, API module etc) would at least narrow down where the change happened. Again, not readily available today but will be in MediaWiki Logstash once T142313: Add global information to debug logger context is done.

So IMO while better observability at any point in the system is a good thing, mainly what would be useful here is logging every session write on the MediaWiki side. This is technically already done in RESTBagOStuff, but currently these log events are sent to null, they don't contain the IP/UA, and are lower-level than what we'd want (BagOStuff does not know anything about the user, it's not even part of the storage key). So IMO the ideal place to log to Logstash (and maybe Prometheus if needed - Logstash is great for investigations but not so much for a quick overview, using it for dashboards is kinda painful) would be the AuthManager and SessionBackend classes (for core sessions) and CentralAuthSessionProvider (for central sessions).