As part of recent incident (T336134) we identified that the documentation to diagnose and fix the issue needs improvement. A runbook explain the necessary steps to understand and mitigate an ongoing incident so that operators can act quickly to restore service.
Note that the focus should be on restoring the service to our users. Some remediation steps that can be taken at a later date can wait for an expert to be available. In this case, we want to have the appropriate documentation needed to diagnose the issue (which dashboards to look at, how to interpret them, what logs might be expected, etc...). We might also want to have documentation on how to restart the WDQS Streaming Updater, how to redeploy it, how to change memory limits.
AC:
- documentation is updated with the steps that were needed during this incident
- dashboards and logs used
- restart of the WDQS Streaming Updater
- memory configuration of the WDQS Streaming Updater
- Flesh out https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#The_job_is_not_starting