WDQS updater misbehaving in codfw
Open, HighPublic
Actions

Assigned To

Authored By

	dcausse
	Mon, Apr 15, 8:01 AM

Description

The updater is misbehaving in codfw, apparently processing too many reconciliations which triggers a slow update mode and thus is not able to keep up with the update rate and causes maxlag to throttle bot edits in wikidata.

On April 14:

07:00: the wcqs streaming updater starts to fail checkpointing
09:00: the wdqs streaming updater shows same failures
11:30: both jobs resume normal operation

The errors all relates to a read timeout between flink and the object store: P60589

During this period the job restarted multiple times from the same set of kafka offsets.
For some reasons during these restarts some events were considered late (this is yet to be investigated).
We do not use transactional producers for these side output streams and the same late events were re-emitted multiple times:

wcqs emitted around 200k late events during this crashloop
wdqs emitted around 2M of these

Doing a simple deduplication by item ID:

wcqs emitted 22k distincts item ID
wdqs emitted 85773k distincts item ID

The spark job that analyses these late events to re-emit them as reconcile was not designed to handle such volume of events:

no deduplication
event time set too early causing some these reconcile events to be considered late again by the flink job and thus reprocessed by the spark job on the next hour

All this caused the backlog of mutations to contain way too many reconcile events that are slow to process in blazegraph.

There could be two ways to recover:

wait for the corrupted backlog to be absorbed, at around 8 reconciliations/sec this could take 70hours (roughly ending on Friday 19 april midday UTC)
perform a data-transfer from an sane eqiad host forcing to skip the corrupted backlog

Action items:

depool wdqs@codfw to mitigate the impact on bot edits
the spark job scanning side outputs should deduplicate events (https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/1019813/)
the spark job emitting reconciliation events should set an event-time that is less likely to be considred late (https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/1019813/)
possibly increase the socket timeout of the s3 client from 5s (default) to 30s (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020263)
possibly understand why both wdqs and wcqs jobs failed with read timeouts between 7:00am and 11:00am on April 14
possibly understand why the flink jobs considered many events as late during the crashloop
repool wdqs@codfw once codfw nodes are sane

Details

Subject	Repo	Branch	Lines +/-
rdf-streaming-updater: increase s3 socket-timeout to 30s	operations/deployment-charts	master	+1 -0
rdf-streaming-updater: increase s3 socket-timeout to 30s	operations/deployment-charts	master	+2 -0
Deduplicate side output events and send them with a recent event-time	wikidata/query/rdf	master	+55 -27

Customize query in gerrit

	Title	Reference	Author	Source Branch	Dest Branch
	search: bump rdf-spark-tools to 0.3.139	repos/data-engineering/airflow-dags!670	dcausse	search-bump-rdf-spark-tools_0_3_139	main

Customize query in GitLab

Related Objects

Mentioned In: T363004: Investigate WDQS ProbeDown alerts
T362983: Investigate/fix WDQS data-transfer cookbook
T336443: Investigate performance differences between wdqs2022 and older hosts
Mentioned Here: T362983: Investigate/fix WDQS data-transfer cookbook
P60589 Checkpoint read timeout from object store

Event Timeline

dcausse created this task.Mon, Apr 15, 8:01 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Apr 15, 8:01 AM

Mentioned in SAL (#wikimedia-operations) [2024-04-15T08:01:42Z] <Emperor> depool wdqs in codfw T362508

Maintenance_bot added a project: Wikidata.Mon, Apr 15, 8:29 AM

Gehel triaged this task as High priority.Mon, Apr 15, 1:18 PM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.

Gehel edited projects, added Discovery-Search (Current work); removed Wikidata-Query-Service.

bking subscribed.Mon, Apr 15, 1:20 PM

Change #1019813 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] Deduplicate side output events and send them with a recent event-time

https://gerrit.wikimedia.org/r/1019813

gerritbot added a project: Patch-For-Review.Mon, Apr 15, 1:53 PM

Gehel moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Mon, Apr 15, 3:52 PM

Mentioned in SAL (#wikimedia-operations) [2024-04-15T22:09:17Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 19 hosts with reason: T362508

Mentioned in SAL (#wikimedia-operations) [2024-04-15T22:09:48Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 19 hosts with reason: T362508

dr0ptp4kt subscribed.Tue, Apr 16, 12:03 PM

dcausse updated the task description. (Show Details)Tue, Apr 16, 12:28 PM

dcausse updated the task description. (Show Details)Tue, Apr 16, 12:49 PM

Change #1020263 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] rdf-streaming-updater: increase s3 socket-timeout to 30s

https://gerrit.wikimedia.org/r/1020263

dcausse updated the task description. (Show Details)Tue, Apr 16, 4:34 PM

Mentioned in SAL (#wikimedia-operations) [2024-04-17T22:10:56Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 19 hosts with reason: T362508

Mentioned in SAL (#wikimedia-operations) [2024-04-17T22:11:28Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 19 hosts with reason: T362508

bking mentioned this in T336443: Investigate performance differences between wdqs2022 and older hosts.Thu, Apr 18, 2:10 PM

Mentioned in SAL (#wikimedia-operations) [2024-04-18T20:42:44Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on wdqs2023.codfw.wmnet with reason: T362508

Mentioned in SAL (#wikimedia-operations) [2024-04-18T20:42:53Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on wdqs2023.codfw.wmnet with reason: T362508

Mentioned in SAL (#wikimedia-operations) [2024-04-18T21:11:50Z] <bking@cumin2002> START - Cookbook sre.wdqs.data-transfer (T362508, excessive lag) xfer wikidata from wdqs2022.codfw.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-04-18T22:31:40Z] <bking@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T362508, excessive lag) xfer wikidata from wdqs2022.codfw.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards

Unfortunately, @RKemper noticed that our data-transfer cookbook seems to have some issues applying the correct Kafka offsets. We've created T362983 to work through the problem.

bking updated the task description. (Show Details)Fri, Apr 19, 1:18 PM

bking mentioned this in T362983: Investigate/fix WDQS data-transfer cookbook.Fri, Apr 19, 1:46 PM

Mentioned in SAL (#wikimedia-operations) [2024-04-19T18:34:36Z] <bking@cumin2002> START - Cookbook sre.wdqs.data-transfer (T362508, journal in uncertain state) xfer wikidata from wdqs2022.codfw.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-04-19T19:56:36Z] <bking@cumin2002> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T362508, journal in uncertain state) xfer wikidata from wdqs2022.codfw.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards

dcausse claimed this task.Wed, Apr 24, 7:31 AM

dcausse moved this task from Ready for Dev -- SWE to Needs review on the Discovery-Search (Current work) board.

Change #1019813 merged by jenkins-bot:

[wikidata/query/rdf@master] Deduplicate side output events and send them with a recent event-time

https://gerrit.wikimedia.org/r/1019813

dcausse opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/670

search: bump rdf-spark-tools to 0.3.139

dcausse merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/670

search: bump rdf-spark-tools to 0.3.139

Mentioned in SAL (#wikimedia-operations) [2024-04-30T14:19:08Z] <dcausse@deploy1002> Started deploy [airflow-dags/search@ab19bcd]: wdqs: deduplicate side-output events (T362508)

Mentioned in SAL (#wikimedia-operations) [2024-04-30T14:19:37Z] <dcausse@deploy1002> Finished deploy [airflow-dags/search@ab19bcd]: wdqs: deduplicate side-output events (T362508) (duration: 00m 29s)

dcausse updated the task description. (Show Details)Tue, Apr 30, 3:56 PM

Change #1020263 merged by jenkins-bot:

[operations/deployment-charts@master] rdf-streaming-updater: increase s3 socket-timeout to 30s

https://gerrit.wikimedia.org/r/1020263

Maintenance_bot removed a project: Patch-For-Review.Tue, Apr 30, 7:31 PM

Change #1025850 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: increase s3 socket-timeout to 30s

https://gerrit.wikimedia.org/r/1025850

gerritbot added a project: Patch-For-Review.Tue, Apr 30, 8:19 PM

bking mentioned this in T363004: Investigate WDQS ProbeDown alerts.Tue, Apr 30, 8:51 PM

Change #1025850 merged by Bking:

[operations/deployment-charts@master] rdf-streaming-updater: increase s3 socket-timeout to 30s