Page MenuHomePhabricator

Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes
Open, HighPublic8 Estimated Story Points

Description

As an SRE operating on the k8s cluster I want to have clear runbooks related to the WDQS Streaming Updater so that I can act on the various components needed by this application without impacting negatively users of this service.

The WDQS Streaming Updater is moving a good part of the WDQS process to a flink application running on the k8s service cluster. We should have proper runbooks/cookbooks to handle cases where:

  • long maintenance (or long unexpected downtime) procedure on a dependent k8s cluster
  • version upgrade of a dependent k8s cluster
  • cleanup of old flink configmaps (jobmanager leader election related)
  • cleanup old/unused savepoints/checkpoints

Current runbooks have been compiled here https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater

There are two different SRE use cases for cookbooks currently:

Temporarily stop/disable rdf-streaming-updater (flink) in one DC ("depool rdf-streaming-updater")

Used in case SRE has to stop/disable rdf-streaming-updater, take down a kubernetes cluster (without deleting etcd) or something alike.

Actions that need to be taken to depool:

  • Downtime rdf-streaming-updater alerts (list)
  • Downtime WQDS and WCQS alerts in the affected DC (should be RdfStreamingUpdaterHighConsumerUpdateLag part of the alerts above, but seems to be not working at the moment T316882)
  • Depool dnsdisc=wdqs in that same DC
  • Depool dnsdisc=wcqs in that same DC

This would ensure users will still query an up to date dataset (from the other DC).

Re-Init a kubernetes cluster

Used in case SRE updates/reinstalles a kubernetes cluster, losing data in etcd (flink's dynamic config maps)

Graceful depool/re-pool

Actions that need to be taken:

To restore:

Hard depool/re-pool (not fully tested)

Actions that need to be taken:

  • [service-ops] depool rdf-streaming-updater
  • [service-ops] undeploy the rdf-streaming-updater service using helm destroy, this will force flink to stop abruptly
  • [service-ops] Dump all dynamic Configmaps in the rdf-streaming-updater namespace (as they should contain savepoint data)
    • The job artifact will remain in the swift object storage

To restore:

  • [service-ops] Restore all dumped config maps into the rdf-streaming-updater namespace
  • [service-ops] helmfile deploy rdf-streaming-update
    • WDQS&WCQS job should resume (the alerts should resolve, if not the search team should investigate and recover the job manually)
    • The job artifact jar is stored in the swift object storage and no specific actions has to be taken here
  • Wait for the lag to catch-up (https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1)
  • [serviceops] Repool WDQS/WCQS read traffic

Event Timeline

jijiki renamed this task from Write and adapt Runbooks related to the WDQS Streaming Updater and kubernetes to Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes.Oct 13 2021, 10:30 AM
jijiki added a project: serviceops.
jijiki updated the task description. (Show Details)

@dcausse IIRC we said that "something in the areas of hours" would be considered a "short maintenance" and thus would not need any additional actions to be carried out, right?
As part of T251305 we will re-create the helm release of flink in both datacenters (one after the other ofc.) and that would mean flink will be down for a couple of minutes. If my memory and understanding is still intact, the checkpoint/tombstone metadata is not part of the helm release itself (it's in those flink managed configmaps). So it should survive purging and recreating the helm release.
@Jelto has alredy done that for the staging flink release. If you have the chance it would be nice if you could double check that is still working as expected.

Besides that I tried to understand what would be needed to do for a "longer downtime" of k8s and it's not exactly clear to me. Could we have a dedicated section for that on whe wikitech page? IIRC that also needed a change to WQDS itself.

@dcausse IIRC we said that "something in the areas of hours" would be considered a "short maintenance" and thus would not need any additional actions to be carried out, right?

We are targeting a SLO with an update lag below 10minutes for 99% of the time, we are still learning what is the operational cost of this and are happy to discuss/re-adjust all this depending on your constraints.

As part of T251305 we will re-create the helm release of flink in both datacenters (one after the other ofc.) and that would mean flink will be down for a couple of minutes. If my memory and understanding is still intact, the checkpoint/tombstone metadata is not part of the helm release itself (it's in those flink managed configmaps). So it should survive purging and recreating the helm release.

Yes if the configmaps are kept flink will just autorestart on its own, regarding lag I'm not worried as flink already restarts on its own from time to time without affecting the 10min lag SLO.

@Jelto has alredy done that for the staging flink release. If you have the chance it would be nice if you could double check that is still working as expected.

Checking the logs I see 2 restarts in the last 7 days and both restarts properly restored the job:

Nov 3, 2021 @ 15:44:33.739	syslog	kubestage1002	Restoring job 095b671d83457ebf4c59166fda7a7055 from Checkpoint 106609 @ 1635954210959 for 095b671d83457ebf4c59166fda7a7055 located at swift://rdf-streaming-updater-staging.thanos-swift/wikidata/checkpoints/095b671d83457ebf4c59166fda7a7055/chk-106609.

Nov 4, 2021 @ 13:36:35.097	syslog	kubestage1002	Restoring job 095b671d83457ebf4c59166fda7a7055 from Checkpoint 109216 @ 1636032918483 for 095b671d83457ebf4c59166fda7a7055 located at swift://rdf-streaming-updater-staging.thanos-swift/wikidata/checkpoints/095b671d83457ebf4c59166fda7a7055/chk-109216.

So, if one of these restarts corresponds to the helm 3 upgrade then I can confirm that it will work properly the production clusters.

Besides that I tried to understand what would be needed to do for a "longer downtime" of k8s and it's not exactly clear to me. Could we have a dedicated section for that on whe wikitech page? IIRC that also needed a change to WQDS itself.

Certainly, this task is all about clarifying all this.

MPhamWMF set the point value for this task to 8.Aug 15 2022, 3:54 PM

@JMeybohm thanks for the write-up! I added few more notes.

Assigning to myself in the hopes that I work on this with ServiceOps when I join their team starting next week.

Gehel removed bking as the assignee of this task.Jan 5 2023, 2:39 PM

Hey @dcausse, I'm reading this again because of the upcoming k8s 1.23 upgrade and was wondering:
In "To restore:" section of "Alternate actions (not fully untested):" - do we need to start the job somehow as well, specifying which jar file to use? Or is that information part of the configmaps/safepoint and the job can start automatically without submitting a jar?

Hey @dcausse, I'm reading this again because of the upcoming k8s 1.23 upgrade and was wondering:
In "To restore:" section of "Alternate actions (not fully untested):" - do we need to start the job somehow as well, specifying which jar file to use? Or is that information part of the configmaps/safepoint and the job can start automatically without submitting a jar?

Hey, clarified this a bit, renamed it to "Hard depool/re-pool", yes in this method the jobs should start right after the helm deploy, the jar is stored in swift so no need to deploy it manually.

Hey, clarified this a bit, renamed it to "Hard depool/re-pool", yes in this method the jobs should start right after the helm deploy, the jar is stored in swift so no need to deploy it manually.

Cool, thanks. That would make it hands-off for anybody but sre/serviceops which ofc would be nice.

Anyhow. AIUI this process will be more or less the same for flink deployments managed by the flink operator. It would be nice if you could verify this during your tests with the operator (I'm happy to help/pair ofc.) or if there maybe even is a better option in flink-operator world.

Anyhow. AIUI this process will be more or less the same for flink deployments managed by the flink operator. It would be nice if you could verify this during your tests with the operator (I'm happy to help/pair ofc.) or if there maybe even is a better option in flink-operator world.

Yes definitely! (I also have some hope that the k8s-operator might be able to discover the latest valid checkpoint without having to save the config maps but we'll see...). Thanks for the help!

Gehel triaged this task as High priority.Mar 16 2023, 2:08 PM