Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes
Open, HighPublic8 Estimated Story Points
Actions

Assigned To

None

Authored By

	dcausse
	Oct 12 2021, 8:51 AM

Description

As an SRE operating on the k8s cluster I want to have clear runbooks related to the WDQS Streaming Updater so that I can act on the various components needed by this application without impacting negatively users of this service.

The WDQS Streaming Updater is moving a good part of the WDQS process to a flink application running on the k8s service cluster. We should have proper runbooks/cookbooks to handle cases where:

long maintenance (or long unexpected downtime) procedure on a dependent k8s cluster
version upgrade of a dependent k8s cluster
cleanup of old flink configmaps (jobmanager leader election related)
cleanup old/unused savepoints/checkpoints

Current runbooks have been compiled here https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater

There are two different SRE use cases for cookbooks currently:

Temporarily stop/disable rdf-streaming-updater (flink) in one DC ("depool rdf-streaming-updater")

Used in case SRE has to stop/disable rdf-streaming-updater, take down a kubernetes cluster (without deleting etcd) or something alike.

Actions that need to be taken to depool:

Downtime rdf-streaming-updater alerts (list)
Downtime WQDS and WCQS alerts in the affected DC (should be RdfStreamingUpdaterHighConsumerUpdateLag part of the alerts above, but seems to be not working at the moment T316882)
Depool dnsdisc=wdqs in that same DC
Depool dnsdisc=wcqs in that same DC

This would ensure users will still query an up to date dataset (from the other DC).

Re-Init a kubernetes cluster

Used in case SRE updates/reinstalles a kubernetes cluster, losing data in etcd (flink's dynamic config maps)

Graceful depool/re-pool

Actions that need to be taken:

[service-ops] depool rdf-streaming-updater
[search-team] Stop all rdf-streaming-updater jobs with savepoints (https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Stop_the_job)
[search-team] Store the path to the generated savepoints somewhere
[service-ops] undeploy the rdf-steaming-updater chart & possibly delete all configmaps

To restore:

[service-ops] Deploy rdf-streaming updater
[search-team] Start all jobs using the savepoints, needs to be done on a deploy host, needs access to jobs jar files (https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Start_the_job)
- Alerts should resolve
Wait for the lag to catch-up (https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1)
[serviceops] Repool WDQS/WCQS read traffic

Hard depool/re-pool (not fully tested)

Actions that need to be taken:

[service-ops] depool rdf-streaming-updater
[service-ops] undeploy the rdf-streaming-updater service using helm destroy, this will force flink to stop abruptly
[service-ops] Dump all dynamic Configmaps in the rdf-streaming-updater namespace (as they should contain savepoint data)
- The job artifact will remain in the swift object storage

To restore:

[service-ops] Restore all dumped config maps into the rdf-streaming-updater namespace
[service-ops] helmfile deploy rdf-streaming-update
- WDQS&WCQS job should resume (the alerts should resolve, if not the search team should investigate and recover the job manually)
- The job artifact jar is stored in the swift object storage and no specific actions has to be taken here
Wait for the lag to catch-up (https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1)
[serviceops] Repool WDQS/WCQS read traffic

Related Objects

Mentioned In: T317045: [Epic] Re-architect the Search Update Pipeline
T326409: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model
T307943: Update Kubernetes clusters to v1.23
T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00)
T300879: Add a kubernetes module to spicerack
T251305: Migrate to helm v3
Mentioned Here: T238751: Only generate maxlag from pooled query service servers.
T316882: RdfStreamingUpdaterHighConsumerUpdateLag alert is not fired
T251305: Migrate to helm v3

Event Timeline

dcausse created this task.Oct 12 2021, 8:51 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 12 2021, 8:51 AM

Maintenance_bot added a project: Wikidata.Oct 12 2021, 9:45 AM

jijiki renamed this task from Write and adapt Runbooks related to the WDQS Streaming Updater and kubernetes to Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes.Oct 13 2021, 10:30 AM

jijiki added a project: serviceops.

jijiki updated the task description. (Show Details)

Restricted Application added a project: [DEPRECATED] wdwb-tech. · View Herald TranscriptOct 13 2021, 10:30 AM

jijiki added a project: Prod-Kubernetes.Oct 13 2021, 10:31 AM

Armando805ox moved this task from Incoming to Scaling on the Wikidata-Query-Service board.Oct 17 2021, 5:16 PM

dcausse moved this task from Scaling to Incoming on the Wikidata-Query-Service board.Oct 17 2021, 8:37 PM

• MPhamWMF moved this task from Incoming to Current work on the Wikidata-Query-Service board.Oct 18 2021, 3:35 PM

• MPhamWMF added a project: Discovery-Search (Current work).

Addshore moved this task from Inbox to External Realm on the [DEPRECATED] wdwb-tech board.Oct 26 2021, 9:56 AM

JMeybohm mentioned this in T251305: Migrate to helm v3.Nov 9 2021, 8:47 AM

@dcausse IIRC we said that "something in the areas of hours" would be considered a "short maintenance" and thus would not need any additional actions to be carried out, right?
As part of T251305 we will re-create the helm release of flink in both datacenters (one after the other ofc.) and that would mean flink will be down for a couple of minutes. If my memory and understanding is still intact, the checkpoint/tombstone metadata is not part of the helm release itself (it's in those flink managed configmaps). So it should survive purging and recreating the helm release.
@Jelto has alredy done that for the staging flink release. If you have the chance it would be nice if you could double check that is still working as expected.

Besides that I tried to understand what would be needed to do for a "longer downtime" of k8s and it's not exactly clear to me. Could we have a dedicated section for that on whe wikitech page? IIRC that also needed a change to WQDS itself.

In T293063#7491903, @JMeybohm wrote:

@dcausse IIRC we said that "something in the areas of hours" would be considered a "short maintenance" and thus would not need any additional actions to be carried out, right?

We are targeting a SLO with an update lag below 10minutes for 99% of the time, we are still learning what is the operational cost of this and are happy to discuss/re-adjust all this depending on your constraints.

As part of T251305 we will re-create the helm release of flink in both datacenters (one after the other ofc.) and that would mean flink will be down for a couple of minutes. If my memory and understanding is still intact, the checkpoint/tombstone metadata is not part of the helm release itself (it's in those flink managed configmaps). So it should survive purging and recreating the helm release.

Yes if the configmaps are kept flink will just autorestart on its own, regarding lag I'm not worried as flink already restarts on its own from time to time without affecting the 10min lag SLO.

@Jelto has alredy done that for the staging flink release. If you have the chance it would be nice if you could double check that is still working as expected.

Checking the logs I see 2 restarts in the last 7 days and both restarts properly restored the job:

Nov 3, 2021 @ 15:44:33.739	syslog	kubestage1002	Restoring job 095b671d83457ebf4c59166fda7a7055 from Checkpoint 106609 @ 1635954210959 for 095b671d83457ebf4c59166fda7a7055 located at swift://rdf-streaming-updater-staging.thanos-swift/wikidata/checkpoints/095b671d83457ebf4c59166fda7a7055/chk-106609.

Nov 4, 2021 @ 13:36:35.097	syslog	kubestage1002	Restoring job 095b671d83457ebf4c59166fda7a7055 from Checkpoint 109216 @ 1636032918483 for 095b671d83457ebf4c59166fda7a7055 located at swift://rdf-streaming-updater-staging.thanos-swift/wikidata/checkpoints/095b671d83457ebf4c59166fda7a7055/chk-109216.

So, if one of these restarts corresponds to the helm 3 upgrade then I can confirm that it will work properly the production clusters.

Besides that I tried to understand what would be needed to do for a "longer downtime" of k8s and it's not exactly clear to me. Could we have a dedicated section for that on whe wikitech page? IIRC that also needed a change to WQDS itself.

Certainly, this task is all about clarifying all this.

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Nov 15 2021, 4:40 PM

Joe mentioned this in T300879: Add a kubernetes module to spicerack.Feb 3 2022, 4:04 PM

dcausse mentioned this in T301147: The WDQS streaming updater went unstable for several hours (2022-02-06T23:00:00 - 2022-02-07T06:20:00).Feb 10 2022, 8:32 AM

dcausse updated the task description. (Show Details)Mar 14 2022, 9:21 AM

dcausse updated the task description. (Show Details)Mar 14 2022, 9:34 AM

bking added subscribers: Gehel, RKemper.Mar 28 2022, 3:40 PM

bking subscribed.

JMeybohm mentioned this in T307943: Update Kubernetes clusters to v1.23.May 9 2022, 5:10 PM

• MPhamWMF set the point value for this task to 8.Aug 15 2022, 3:54 PM

JMeybohm updated the task description. (Show Details)Aug 30 2022, 5:20 PM

JMeybohm updated the task description. (Show Details)Aug 30 2022, 5:24 PM

@JMeybohm thanks for the write-up! I added few more notes.

akosiaris subscribed.Sep 2 2022, 9:34 AM

Assigning to myself in the hopes that I work on this with ServiceOps when I join their team starting next week.

jijiki moved this task from Incoming 🐫 to 🙈🙉🙊Backlog on the serviceops board.Sep 28 2022, 2:20 PM

bking updated the task description. (Show Details)Sep 28 2022, 7:35 PM

bking moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Oct 12 2022, 1:42 PM

jijiki moved this task from 🙈🙉🙊Backlog to 🛎 Services & Oids on the serviceops board.Oct 12 2022, 9:09 PM

jijiki moved this task from 🛎 Services & Oids to 🌻Mediawiki on the serviceops board.

bking moved this task from In Progress to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.Jan 5 2023, 2:38 PM

Gehel removed bking as the assignee of this task.Jan 5 2023, 2:39 PM

JMeybohm mentioned this in T326409: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model.Feb 2 2023, 3:41 PM

Hey @dcausse, I'm reading this again because of the upcoming k8s 1.23 upgrade and was wondering:
In "To restore:" section of "Alternate actions (not fully untested):" - do we need to start the job somehow as well, specifying which jar file to use? Or is that information part of the configmaps/safepoint and the job can start automatically without submitting a jar?

dcausse updated the task description. (Show Details)Feb 2 2023, 4:10 PM

In T293063#8582548, @JMeybohm wrote:

Hey @dcausse, I'm reading this again because of the upcoming k8s 1.23 upgrade and was wondering:
In "To restore:" section of "Alternate actions (not fully untested):" - do we need to start the job somehow as well, specifying which jar file to use? Or is that information part of the configmaps/safepoint and the job can start automatically without submitting a jar?

Hey, clarified this a bit, renamed it to "Hard depool/re-pool", yes in this method the jobs should start right after the helm deploy, the jar is stored in swift so no need to deploy it manually.

In T293063#8582600, @dcausse wrote:

Hey, clarified this a bit, renamed it to "Hard depool/re-pool", yes in this method the jobs should start right after the helm deploy, the jar is stored in swift so no need to deploy it manually.

Cool, thanks. That would make it hands-off for anybody but sre/serviceops which ofc would be nice.

Anyhow. AIUI this process will be more or less the same for flink deployments managed by the flink operator. It would be nice if you could verify this during your tests with the operator (I'm happy to help/pair ofc.) or if there maybe even is a better option in flink-operator world.

In T293063#8582625, @JMeybohm wrote:

Anyhow. AIUI this process will be more or less the same for flink deployments managed by the flink operator. It would be nice if you could verify this during your tests with the operator (I'm happy to help/pair ofc.) or if there maybe even is a better option in flink-operator world.

Yes definitely! (I also have some hope that the k8s-operator might be able to discover the latest valid checkpoint without having to save the config maps but we'll see...). Thanks for the help!

Gehel removed a project: Discovery-Search (Current work).Mar 13 2023, 4:50 PM

Gehel moved this task from Current work to Operations/SRE on the Wikidata-Query-Service board.

Gehel triaged this task as High priority.Mar 16 2023, 2:08 PM

Gehel moved this task from Operations/SRE to Incoming on the Wikidata-Query-Service board.Jun 26 2023, 3:19 PM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.

Gehel added a project: Discovery-Search (Current work).

Gehel added a project: Data-Platform-SRE.Jun 27 2023, 12:57 PM

Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.

bking mentioned this in T317045: [Epic] Re-architect the Search Update Pipeline.Sep 6 2023, 2:17 PM

Gehel moved this task from Ready for Work to Misc on the Data-Platform-SRE board.Sep 13 2023, 8:55 AM

Gehel moved this task from Misc to Toil / Automation on the Data-Platform-SRE board.Dec 6 2023, 1:28 PM

Gehel removed a project: Discovery-Search (Current work).Jan 16 2024, 3:19 PM

Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetesOpen, HighPublic8 Estimated Story PointsActions