Incident report: https://wikitech.wikimedia.org/wiki/Incidents/2023-05-05_wdqs_not_updating_in_codfw
See https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m
Events:
- 2023-05-04T10:00: the streaming updater flink job stopped to function in codfw for both WDQS and WCQS
- user impact starts: stale results are seen when using WDQS from a region that hits CODFW
- reason is likely https://issues.apache.org/jira/browse/FLINK-22597
- 2023-05-05T16:22: the problem is reported by Bovlb via https://www.wikidata.org/wiki/Wikidata:Report_a_technical_problem/WDQS_and_Search
- 2023-05-05T19:00: the flink jobmanager container is manually restarted and the jobs resume but the WDQS one is very unstable (k8s is heavily throttling cpu usage and taskmanager mem usage grows quickly)
- (assumptions) because the job was backfilling 1day of data it required more resources than usual, though this is not the first time that a backfill happens (e.g. k8s cluster upgrades went well)
- (assumptions) because the job was resource constrained rocksdb resource compaction did not happen in a timely manner
- 2023-05-05T21:00: the job fails again
- 2023-05-06T10:00: the job resumes (unknown reasons)
- 2023-05-06T19:00: the job fails again
- Seeing jvm OutOfMemoryError
- The checkpoint it tries to recover from is abnormally large (6G instead of 1.5G usually), assumption is that rocksdb compaction did not occur properly
- 2023-05-07T17:27: this ticket is created as UBN
- 2023-05-08T16:00: wdqs in CODFW is depooled
- user impact ends
- 2023-05-09T14:00: increasing taskmanager memory from 1.9G to 2.5G did not help
- 2023-05-09T14:00: starting the job from yarn using across 12 containers with 5G did help
- the job recovered and started to produce reasonable checkpoint sizes
- 2023-05-10T00:00: lag is back to normal on all wdqs servers
- 2023-05-10T10:30: the job is resumed from k8s@codfw
Remaining actions:
- Repool WDQS in codfw
AC:
- WDQS codfw cluster is pooled and running with up to date data
- Incident report is created
- issue is communicated on wikidata mailing list