As an SRE managing WDQS I want a simple procedure to follow for the known issue of "killer queries" so that I can quickly resolve an outage and share that responsibility with other SREs.
We now had 2 [[ https://wikitech.wikimedia.org/wiki/Incident_documentation/20200723-wdqs-outage | documented ]] [[ https://wikitech.wikimedia.org/wiki/Incident_documentation/20200902-wdqs-outage | incidents ]] where WDQS was taken down by specific queries. We need a [[ https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook | runbook ]] entry to document how to identify problematic queries and ban problematic user agents. In particular, we should document what kind of debug information needs to be collected during such an incident so that we can investigate and hopefully find a root cause (at least t[[ https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook#Further_analysis | ake a few stack traces ]] before restarting the servers).
AC:
[] documented procedure in the [[ https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook | WDQS Runbook ]]