Page MenuHomePhabricator

wdqs2*** lagged for more than one day
Closed, ResolvedPublic5 Estimated Story Points

Description

Incident report: https://wikitech.wikimedia.org/wiki/Incidents/2023-05-05_wdqs_not_updating_in_codfw

See https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m

Events:

  • 2023-05-04T10:00: the streaming updater flink job stopped to function in codfw for both WDQS and WCQS
  • 2023-05-05T16:22: the problem is reported by Bovlb via https://www.wikidata.org/wiki/Wikidata:Report_a_technical_problem/WDQS_and_Search
  • 2023-05-05T19:00: the flink jobmanager container is manually restarted and the jobs resume but the WDQS one is very unstable (k8s is heavily throttling cpu usage and taskmanager mem usage grows quickly)
    • (assumptions) because the job was backfilling 1day of data it required more resources than usual, though this is not the first time that a backfill happens (e.g. k8s cluster upgrades went well)
    • (assumptions) because the job was resource constrained rocksdb resource compaction did not happen in a timely manner
  • 2023-05-05T21:00: the job fails again
  • 2023-05-06T10:00: the job resumes (unknown reasons)
  • 2023-05-06T19:00: the job fails again
    • Seeing jvm OutOfMemoryError
    • The checkpoint it tries to recover from is abnormally large (6G instead of 1.5G usually), assumption is that rocksdb compaction did not occur properly
  • 2023-05-07T17:27: this ticket is created as UBN
  • 2023-05-08T16:00: wdqs in CODFW is depooled
    • user impact ends
  • 2023-05-09T14:00: increasing taskmanager memory from 1.9G to 2.5G did not help
  • 2023-05-09T14:00: starting the job from yarn using across 12 containers with 5G did help
    • the job recovered and started to produce reasonable checkpoint sizes
  • 2023-05-10T00:00: lag is back to normal on all wdqs servers
  • 2023-05-10T10:30: the job is resumed from k8s@codfw

Remaining actions:

  • Repool WDQS in codfw

AC:

  • WDQS codfw cluster is pooled and running with up to date data
  • Incident report is created
  • issue is communicated on wikidata mailing list

Event Timeline

Bugreporter triaged this task as Unbreak Now! priority.May 7 2023, 3:27 PM
Gehel set the point value for this task to 5.

Icinga downtime and Alertmanager silence (ID=d5700f27-224a-464e-878a-4a5780b823f5) set by bking@cumin1001 for 4:00:00 on 19 host(s) and their services with reason: rebooting to help with lag

wdqs[2004-2022].codfw.wmnet

Icinga downtime and Alertmanager silence (ID=8eda2109-041e-49e8-9896-a24bb9b2645e) set by bking@cumin1001 for 4:00:00 on 1 host(s) and their services with reason: rebooting to help with lag

wdqs1004.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=40727833-628f-4d74-8179-b284df022778) set by bking@cumin1001 for 4:00:00 on 1 host(s) and their services with reason: rebooting to help with lag

wdqs2006.codfw.wmnet

Host rebooted by bking@cumin1001 with reason: None

Host rebooted by bking@cumin1001 with reason: None

Host rebooted by bking@cumin1001 with reason: None

Change 917820 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] flink-session-cluster: enable rocksdb metrics and increase jvm heap

https://gerrit.wikimedia.org/r/917820

Change 917820 merged by jenkins-bot:

[operations/deployment-charts@master] flink-session-cluster: enable rocksdb metrics and increase jvm heap

https://gerrit.wikimedia.org/r/917820

Change 917935 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: Increase task manager memory alloc

https://gerrit.wikimedia.org/r/917935

dcausse lowered the priority of this task from Unbreak Now! to High.May 10 2023, 10:48 AM
dcausse updated the task description. (Show Details)

Change 917935 merged by jenkins-bot:

[operations/deployment-charts@master] rdf-streaming-updater: Increase task manager memory alloc

https://gerrit.wikimedia.org/r/917935