[[ https://wikitech.wikimedia.org/wiki/Incidents/2025-03-31_sessionstore_unavailability | Incidents/2025-03-31_sessionstore_unavailability ]] laid bare just how opaque the sessionstore workload is. Preceding the incident, traffic levels had increased, but the knock-on effect on storage utilization suggests that it wasn’t //more of the same//, something about the nature of the traffic changed. We are able to infer that one of the things that changed was the proportion of session overwrites, but even here we aren’t able to say anything more (for example, //by how much//, or anything about the sessions being overwritten).
Progress on {T392170} would go some ways toward improving this. For example: Were that in place prior, we would have been able to ascertain whether the general increase could be attributed to any one type of session (a namespace). And, if namespacing by group or wiki, identifying a regression in the [[ https://phabricator.wikimedia.org/T384219 | SUL3 rollout ]] would have been more straightforward (not only would they correspond, but it would be easier to spot a modest increase relative to a lower throughput "group").
That still leaves gaps however. Ideally, we'd also have histograms of write frequency by unique session ID (a topN), distinct values, value sizes, etc.