Page MenuHomePhabricator

Update WDQS Runbook following update lag incident
Closed, ResolvedPublic3 Estimated Story Points

Description

As part of recent incident (T336134) we identified that the documentation to diagnose and fix the issue needs improvement. A runbook explain the necessary steps to understand and mitigate an ongoing incident so that operators can act quickly to restore service.

Note that the focus should be on restoring the service to our users. Some remediation steps that can be taken at a later date can wait for an expert to be available. In this case, we want to have the appropriate documentation needed to diagnose the issue (which dashboards to look at, how to interpret them, what logs might be expected, etc...). We might also want to have documentation on how to restart the WDQS Streaming Updater, how to redeploy it, how to change memory limits.

AC:

Event Timeline

Updated the Streaming Updater operations docs after today's pairing session with @dcausse . We'll continue to update the docs as we examine previous alerts in T336574 .

Other action items:

  • Add link to new WDQS superset dashboard to WDQS runbook page.
  • Fix dead logstash link on WDQS runbook page
  • Better documentation of throttling behavior as described in rdf repo at blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/throttling/ThrottlingFilter.java
  • Learn if we have a "slow log" for SPARQL queries and how to quickly access it.
Gehel triaged this task as High priority.Jun 27 2023, 3:46 PM

Based on a quick read of the linked documentation and a small addition, I believe we have satisfied the requirements. Closing...

bking claimed this task.
bking moved this task from Backlog to Done on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.