[[ https://wikitech.wikimedia.org/wiki/Incidents/2025-03-31_sessionstore_unavailability | Incidents/2025-03-31_sessionstore_unavailability ]] laid bare just how opaque the sessionstore workload is. Preceding the incident, traffic levels had increased, but the knock-on effect on storage utilization suggests that it wasn’t //more of the same//, something about the nature of the traffic changed. We are able to infer that one of the things that changed was the proportion of session overwrites, but even here we aren’t able to say anything more (for example, //by how much//, nor anything about the sessions being overwritten).
Progress on {T392170} would go some ways toward improving this. For example: Were that in place prior to the incident, we would have been able to ascertain whether the general increase could be attributed to any one type of session (a namespace). And, if namespacing by group or wiki, identifying a regression in the [[ https://phabricator.wikimedia.org/T384219 | SUL3 rollout ]] would have been more straightforward (not only would metrics correlate to the phase of a rollout, but it would be easier to spot a modest increase relative to a lower throughput "group").
It still leaves a lot to be desired though, case-in-point: it wouldn't be enough to expose the source of the aforementioned overwrites. Ideally:
- Breakdown by kind of session (see: T390630)
- central auth
- group ...or
- wiki
- Session ID write frequency (topN) //...same session being rewritten//
- Session ID write frequency, same value (topN) //...same session, no change in value//
- Write frequency by user, ID, IP (topN) //...possibly different sessions, same "user"//
- ...
----
See also:
{T390630}