As an SRE operating on the k8s cluster I want to have clear runbooks related to the WDQS Streaming Updater so that I can act on the various components needed by this application without impacting negatively users of this service.
The WDQS Streaming Updater is moving a good part of the WDQS process to a flink application running on the k8s service cluster. We should have proper runbooks/cookbooks to handle cases where:
- long maintenance (or long unexpected downtime) procedure on a dependent k8s cluster
- version upgrade of a dependent k8s cluster
- cleanup of old flink configmaps (jobmanager leader election related)
- cleanup old/unused savepoints/checkpoints
Current runbooks have been compiled here https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater
There are two different SRE use cases for cookbooks currently:
===== Temporarily stop/disable rdf-streaming-updater (flink) in one DC ("depool rdf-streaming-updater") =====
Used in case SRE has to stop/disable rdf-streaming-updater, take down a kubernetes cluster (without deleting etcd) or something alike.
Actions that need to be taken:
* Downtime rdf-streaming-updater alerts (**are there any?**)
* Downtime WQDS and WCQS alerts in the affected DC
* Depool dnsdisc=wdqs in that same DC
* Depool dnsdisc=wcqs in that same DC
This would ensure users will still query an up to date dataset (from the other DC).
===== Re-Init a kubernetes cluster =====
Used in case SRE updates/reinstalles a kubernetes cluster, loosing data in etcd (flink's dynamic config maps)
Actions that need to be taken:
* Run the above cookbook to "depool rdf-steaming-updater"
* Stop all rdf-streaming-updater jobs with savepoints (https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Stop_the_job)
* Store the path to the generated savepoints somewhere
To restore:
* Deploy rdf-streaming updater
* Start all jobs using the savepoints, needs to be done on a deploy host, needs access to jobs jar files (https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Start_the_job)
* Repool rdf-streaming-updater
Alternate actions (not fully untested):
* Run the above cookbook to "depool rdf-steaming-updater"
* Stop all rdf-streaming-updater jobs (https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Stop_the_job) - don't care about savepoints
** This could also be achieved by just deleting the helm release, no interaction with flink itself
* Dump all dynamic Configmaps in the rdf-streaming-updater namespace (as they should contain savepoint data)
To restore:
* Restore all dumped config maps into the rdf-streaming-updater namespace
* helmfile deploy rdf-streaming-update
* Repool rdf-streaming-updater