As an SRE managing WDQS I want a simple procedure to follow for the known issue of "killer queries" so that I can quickly resolve an outage and share that responsibility with other SREs.
We now had 2 documented incidents where WDQS was taken down by specific queries. We need a runbook entry to document how to identify problematic queries and ban problematic user agents. In particular, we should document what kind of debug information needs to be collected during such an incident so that we can investigate and hopefully find a root cause (at least take a few stack traces before restarting the servers).
AC:
- documented procedure in the WDQS Runbook
See https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook#Blazegraph_deadlock for the new documentation