As an SRE operating on the k8s cluster I want to have clear runbooks related to the WDQS Streaming Updater so that I can act on the various components needed by this application without impacting negatively users of this service.
The WDQS Streaming Updater is moving a good part of the WDQS process to a flink application running on the k8s service cluster. We should have proper runbooks/cookbooks to handle cases where:
- long maintenance (or long unexpected downtime) procedure on a dependent k8s cluster
- version upgrade of a dependent k8s cluster
- cleanup of old flink configmaps (jobmanager leader election related)
- cleanup old/unused savepoints/checkpoints
Current runbooks have been compiled here https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater
There are two different SRE use cases for cookbooks currently:
Temporarily stop/disable rdf-streaming-updater (flink) in one DC ("depool rdf-streaming-updater")
Used in case SRE has to stop/disable rdf-streaming-updater, take down a kubernetes cluster (without deleting etcd) or something alike.
Actions that need to be taken to depool:
- Downtime rdf-streaming-updater alerts (list)
- Downtime WQDS and WCQS alerts in the affected DC (should be RdfStreamingUpdaterHighConsumerUpdateLag part of the alerts above, but seems to be not working at the moment T316882)
- Depool dnsdisc=wdqs in that same DC
- Depool dnsdisc=wcqs in that same DC
This would ensure users will still query an up to date dataset (from the other DC).
Re-Init a kubernetes cluster
Used in case SRE updates/reinstalles a kubernetes cluster, losing data in etcd (flink's dynamic config maps)
Graceful depool/re-pool
Actions that need to be taken:
- [service-ops] depool rdf-streaming-updater
- [search-team] Stop all rdf-streaming-updater jobs with savepoints (https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Stop_the_job)
- [search-team] Store the path to the generated savepoints somewhere
- [service-ops] undeploy the rdf-steaming-updater chart & possibly delete all configmaps
To restore:
- [service-ops] Deploy rdf-streaming updater
- [search-team] Start all jobs using the savepoints, needs to be done on a deploy host, needs access to jobs jar files (https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Start_the_job)
- Alerts should resolve
- Wait for the lag to catch-up (https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1)
- [serviceops] Repool WDQS/WCQS read traffic
Hard depool/re-pool (not fully tested)
Actions that need to be taken:
- [service-ops] depool rdf-streaming-updater
- [service-ops] undeploy the rdf-streaming-updater service using helm destroy, this will force flink to stop abruptly
- [service-ops] Dump all dynamic Configmaps in the rdf-streaming-updater namespace (as they should contain savepoint data)
- The job artifact will remain in the swift object storage
To restore:
- [service-ops] Restore all dumped config maps into the rdf-streaming-updater namespace
- [service-ops] helmfile deploy rdf-streaming-update
- WDQS&WCQS job should resume (the alerts should resolve, if not the search team should investigate and recover the job manually)
- The job artifact jar is stored in the swift object storage and no specific actions has to be taken here
- Wait for the lag to catch-up (https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1)
- [serviceops] Repool WDQS/WCQS read traffic