Page MenuHomePhabricator

Add an entry in the WDQS Runbook on killer queries
Closed, ResolvedPublic2 Estimated Story Points

Description

As an SRE managing WDQS I want a simple procedure to follow for the known issue of "killer queries" so that I can quickly resolve an outage and share that responsibility with other SREs.

We now had 2 documented incidents where WDQS was taken down by specific queries. We need a runbook entry to document how to identify problematic queries and ban problematic user agents. In particular, we should document what kind of debug information needs to be collected during such an incident so that we can investigate and hopefully find a root cause (at least take a few stack traces before restarting the servers).

AC:

See https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook#Blazegraph_deadlock for the new documentation

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel triaged this task as High priority.Sep 4 2020, 8:00 AM
Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.
Gehel updated the task description. (Show Details)
Gehel set the point value for this task to 2.Sep 14 2020, 5:22 PM
Gehel claimed this task.