Page MenuHomePhabricator

Investigate WDQS & WCQS codfw 24h+ lag
Closed, ResolvedPublic

Description

Beginning at 1100 UTC on 2025-06-23, the rdf streaming updater for WDQS and WCQS failed. Lag continued to grow until we intervened around 1745 UTC on 2025-06-24 .

Creating this ticket to:

  • Identify root cause (it was the k8s upgrade in T397148 )
  • Create runbook for this failure scenario
  • Identify and implement any changes that would make this less likely to happen and/or easier to fix in the future (such as updating the Flink App dashboard)

Event Timeline

Change #1163434 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: point to last valid checkpoint for restore

https://gerrit.wikimedia.org/r/1163434

Change #1163434 merged by Bking:

[operations/deployment-charts@master] rdf-streaming-updater: point to last valid checkpoint for restore

https://gerrit.wikimedia.org/r/1163434

RKemper renamed this task from Investigate WDQS codfw 24h+ lag to Investigate WDQS & WCQS codfw 24h+ lag.Jun 24 2025, 9:26 PM

Change #1163475 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/deployment-charts@master] wcqs: restore from checkpoint

https://gerrit.wikimedia.org/r/1163475

Change #1163475 merged by jenkins-bot:

[operations/deployment-charts@master] wcqs: restore from checkpoint

https://gerrit.wikimedia.org/r/1163475

Change #1163478 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/deployment-charts@master] wcqs: restore from checkpoint

https://gerrit.wikimedia.org/r/1163478

Change #1163478 merged by jenkins-bot:

[operations/deployment-charts@master] wcqs: restore from checkpoint

https://gerrit.wikimedia.org/r/1163478

bking changed the task status from Open to In Progress.Jun 25 2025, 3:38 PM
bking claimed this task.
bking triaged this task as High priority.
bking updated Other Assignee, added: RKemper.
bking updated the task description. (Show Details)

The AC for this ticket have been completed, so I am closing it out. Feel free to reopen if I missed something.