Page MenuHomePhabricator

Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes
Open, Needs TriagePublic

Description

As an SRE operating on the k8s cluster I want to have clear runbooks related to the WDQS Streaming Updater so that I can act on the various components needed by this application without impacting negatively users of this service.

The WDQS Streaming Updater is moving a good part of the WDQS process to a flink application running on the k8s service cluster. We should have proper runbooks/cookbooks to handle cases where:

  • long maintenance (or long unexpected downtime) procedure on a dependent k8s cluster
  • version upgrade of a dependent k8s cluster
  • cleanup of old flink configmaps (jobmanager leader election related)
  • cleanup old/unused savepoints/checkpoints

Moreover, we need to update our SRE cookbooks and/or create new ones to handle the above cases.

Current runbooks have been compiled here https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater

AC:

  • new or existing runbooks are adapted
  • new or existing cookbooks are adapted

Event Timeline

jijiki renamed this task from Write and adapt Runbooks related to the WDQS Streaming Updater and kubernetes to Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes.Oct 13 2021, 10:30 AM
jijiki added a project: serviceops.
jijiki updated the task description. (Show Details)

@dcausse IIRC we said that "something in the areas of hours" would be considered a "short maintenance" and thus would not need any additional actions to be carried out, right?
As part of T251305 we will re-create the helm release of flink in both datacenters (one after the other ofc.) and that would mean flink will be down for a couple of minutes. If my memory and understanding is still intact, the checkpoint/tombstone metadata is not part of the helm release itself (it's in those flink managed configmaps). So it should survive purging and recreating the helm release.
@Jelto has alredy done that for the staging flink release. If you have the chance it would be nice if you could double check that is still working as expected.

Besides that I tried to understand what would be needed to do for a "longer downtime" of k8s and it's not exactly clear to me. Could we have a dedicated section for that on whe wikitech page? IIRC that also needed a change to WQDS itself.

@dcausse IIRC we said that "something in the areas of hours" would be considered a "short maintenance" and thus would not need any additional actions to be carried out, right?

We are targeting a SLO with an update lag below 10minutes for 99% of the time, we are still learning what is the operational cost of this and are happy to discuss/re-adjust all this depending on your constraints.

As part of T251305 we will re-create the helm release of flink in both datacenters (one after the other ofc.) and that would mean flink will be down for a couple of minutes. If my memory and understanding is still intact, the checkpoint/tombstone metadata is not part of the helm release itself (it's in those flink managed configmaps). So it should survive purging and recreating the helm release.

Yes if the configmaps are kept flink will just autorestart on its own, regarding lag I'm not worried as flink already restarts on its own from time to time without affecting the 10min lag SLO.

@Jelto has alredy done that for the staging flink release. If you have the chance it would be nice if you could double check that is still working as expected.

Checking the logs I see 2 restarts in the last 7 days and both restarts properly restored the job:

Nov 3, 2021 @ 15:44:33.739	syslog	kubestage1002	Restoring job 095b671d83457ebf4c59166fda7a7055 from Checkpoint 106609 @ 1635954210959 for 095b671d83457ebf4c59166fda7a7055 located at swift://rdf-streaming-updater-staging.thanos-swift/wikidata/checkpoints/095b671d83457ebf4c59166fda7a7055/chk-106609.

Nov 4, 2021 @ 13:36:35.097	syslog	kubestage1002	Restoring job 095b671d83457ebf4c59166fda7a7055 from Checkpoint 109216 @ 1636032918483 for 095b671d83457ebf4c59166fda7a7055 located at swift://rdf-streaming-updater-staging.thanos-swift/wikidata/checkpoints/095b671d83457ebf4c59166fda7a7055/chk-109216.

So, if one of these restarts corresponds to the helm 3 upgrade then I can confirm that it will work properly the production clusters.

Besides that I tried to understand what would be needed to do for a "longer downtime" of k8s and it's not exactly clear to me. Could we have a dedicated section for that on whe wikitech page? IIRC that also needed a change to WQDS itself.

Certainly, this task is all about clarifying all this.