Maniphest T320398

Expand upon Kask/Sessionstore documentation
Open, MediumPublic
Actions

Assigned To

None

Authored By

	hnowlan
	Oct 10 2022, 11:31 AM

Description

Our documentation for Kask and Sessionstore as a service in general is extremely light. Given that this service is fundamental to the critical path, we should expand upon how Kask works, how to monitor it, how it fails and how to mitigate issues.

Event Timeline

hnowlan created this task.Oct 10 2022, 11:31 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 10 2022, 11:31 AM

hnowlan claimed this task.Oct 10 2022, 11:36 AM

hnowlan added a subscriber: Eevans.

Dashboard for all of the relevant metrics to the incident that triggered this ticket: https://grafana-rw.wikimedia.org/d/p_bmgVS4k/hnowlan-sessionstore-health

Adding serviceops, removing SRE as the more specific team that can drive it forward.

Clement_Goubert triaged this task as Medium priority.Mar 15 2023, 11:57 AM

Clement_Goubert moved this task from Incoming 🐫 to 🛎 Services & Oids on the serviceops board.

I guess this task is surely in the "serviceops" area, but probably @Eevans has the most experience being one of the original authors?

In T320398#8710536, @Joe wrote:

I guess this task is surely in the "serviceops" area, but probably @Eevans has the most experience being one of the original authors?

Yes, happy to help (or to simply take this task), but I could probably benefit from some input (perhaps from @hnowlan since he opened this); I've had this ticket on my short-list for a while, and I've never been quite sure where to begin.

The documentation is light, but there is also not a whole lot to it. I suspect the problem is at least partially one of opacity (i.e. it isn't obvious that there isn't more to it/more to understand).

TL;DR Is there someone(s) —who isn't as close to this as I am— who has suggestions for https://www.mediawiki.org/wiki/Kask and/or https://wikitech.wikimedia.org/wiki/SessionStorage?

In T320398#8711722, @Eevans wrote:

TL;DR Is there someone(s) —who isn't as close to this as I am— who has suggestions for https://www.mediawiki.org/wiki/Kask and/or https://wikitech.wikimedia.org/wiki/SessionStorage?

my 2 cents

"A runbook to refer too in case of an emergency like this one.
The service page requires a bit more information as to what is the functions that is serves and how critical it is. The software page is pretty fine for what is worth.
Some links to important graphs to look at and correlate when in an outage.

In T320398#8718719, @akosiaris wrote:

Some links to important graphs to look at and correlate when in an outage.

This would be the main thing I was thinking of when creating this ticket. Particularly of use would be graphs that have been indicative of failures in the past that aren't necessarily Kask-specific but can be heavily affected by Kask outages (like centrallogin and session loss)

Also useful would be stuff like specifying that there's a specific sessionstore cassandra cluster. Not really fitting under the umbrella of this ticket but it'd be cool to have notes about the taints we have in place for Kask too

hnowlan removed hnowlan as the assignee of this task.Oct 14 2024, 1:50 PM

Expand upon Kask/Sessionstore documentationOpen, MediumPublicActions

Description

Event Timeline

Expand upon Kask/Sessionstore documentation
Open, MediumPublic
Actions