Our documentation for Kask and Sessionstore as a service in general is extremely light. Given that this service is fundamental to the critical path, we should expand upon how Kask works, how to monitor it, how it fails and how to mitigate issues.
Description
Event Timeline
Dashboard for all of the relevant metrics to the incident that triggered this ticket: https://grafana-rw.wikimedia.org/d/p_bmgVS4k/hnowlan-sessionstore-health
I guess this task is surely in the "serviceops" area, but probably @Eevans has the most experience being one of the original authors?
Yes, happy to help (or to simply take this task), but I could probably benefit from some input (perhaps from @hnowlan since he opened this); I've had this ticket on my short-list for a while, and I've never been quite sure where to begin.
The documentation is light, but there is also not a whole lot to it. I suspect the problem is at least partially one of opacity (i.e. it isn't obvious that there isn't more to it/more to understand).
TL;DR Is there someone(s) —who isn't as close to this as I am— who has suggestions for https://www.mediawiki.org/wiki/Kask and/or https://wikitech.wikimedia.org/wiki/SessionStorage?
my 2 cents
- "A runbook to refer too in case of an emergency like this one.
- The service page requires a bit more information as to what is the functions that is serves and how critical it is. The software page is pretty fine for what is worth.
- Some links to important graphs to look at and correlate when in an outage.
This would be the main thing I was thinking of when creating this ticket. Particularly of use would be graphs that have been indicative of failures in the past that aren't necessarily Kask-specific but can be heavily affected by Kask outages (like centrallogin and session loss)
Also useful would be stuff like specifying that there's a specific sessionstore cassandra cluster. Not really fitting under the umbrella of this ticket but it'd be cool to have notes about the taints we have in place for Kask too