Page MenuHomePhabricator

WDQS updater misbehaving in codfw
Open, HighPublic

Description

The updater is misbehaving in codfw, apparently processing too many reconciliations which triggers a slow update mode and thus is not able to keep up with the update rate and causes maxlag to throttle bot edits in wikidata.

On April 14:

  • 07:00: the wcqs streaming updater starts to fail checkpointing
  • 09:00: the wdqs streaming updater shows same failures
  • 11:30: both jobs resume normal operation

The errors all relates to a read timeout between flink and the object store: P60589

During this period the job restarted multiple times from the same set of kafka offsets.
For some reasons during these restarts some events were considered late (this is yet to be investigated).
We do not use transactional producers for these side output streams and the same late events were re-emitted multiple times:

  • wcqs emitted around 200k late events during this crashloop
  • wdqs emitted around 2M of these

Doing a simple deduplication by item ID:

  • wcqs emitted 22k distincts item ID
  • wdqs emitted 85773k distincts item ID

The spark job that analyses these late events to re-emit them as reconcile was not designed to handle such volume of events:

  • no deduplication
  • event time set too early causing some these reconcile events to be considered late again by the flink job and thus reprocessed by the spark job on the next hour

All this caused the backlog of mutations to contain way too many reconcile events that are slow to process in blazegraph.

There could be two ways to recover:

  • wait for the corrupted backlog to be absorbed, at around 8 reconciliations/sec this could take 70hours (roughly ending on Friday 19 april midday UTC)
  • perform a data-transfer from an sane eqiad host forcing to skip the corrupted backlog

Action items:

Event Timeline

Gehel triaged this task as High priority.Mon, Apr 15, 1:18 PM
Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.

Change #1019813 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] Deduplicate side output events and send them with a recent event-time

https://gerrit.wikimedia.org/r/1019813

Mentioned in SAL (#wikimedia-operations) [2024-04-15T22:09:17Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 19 hosts with reason: T362508

Mentioned in SAL (#wikimedia-operations) [2024-04-15T22:09:48Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 19 hosts with reason: T362508

Change #1020263 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] rdf-streaming-updater: increase s3 socket-timeout to 30s

https://gerrit.wikimedia.org/r/1020263

Mentioned in SAL (#wikimedia-operations) [2024-04-17T22:10:56Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 19 hosts with reason: T362508

Mentioned in SAL (#wikimedia-operations) [2024-04-17T22:11:28Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 19 hosts with reason: T362508

Mentioned in SAL (#wikimedia-operations) [2024-04-18T20:42:44Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on wdqs2023.codfw.wmnet with reason: T362508

Mentioned in SAL (#wikimedia-operations) [2024-04-18T20:42:53Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on wdqs2023.codfw.wmnet with reason: T362508

Mentioned in SAL (#wikimedia-operations) [2024-04-18T21:11:50Z] <bking@cumin2002> START - Cookbook sre.wdqs.data-transfer (T362508, excessive lag) xfer wikidata from wdqs2022.codfw.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-04-18T22:31:40Z] <bking@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T362508, excessive lag) xfer wikidata from wdqs2022.codfw.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards

Unfortunately, @RKemper noticed that our data-transfer cookbook seems to have some issues applying the correct Kafka offsets. We've created T362983 to work through the problem.

Mentioned in SAL (#wikimedia-operations) [2024-04-19T18:34:36Z] <bking@cumin2002> START - Cookbook sre.wdqs.data-transfer (T362508, journal in uncertain state) xfer wikidata from wdqs2022.codfw.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-04-19T19:56:36Z] <bking@cumin2002> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T362508, journal in uncertain state) xfer wikidata from wdqs2022.codfw.wmnet -> wdqs2023.codfw.wmnet w/ force delete existing files, repooling both afterwards

Change #1019813 merged by jenkins-bot:

[wikidata/query/rdf@master] Deduplicate side output events and send them with a recent event-time

https://gerrit.wikimedia.org/r/1019813

Mentioned in SAL (#wikimedia-operations) [2024-04-30T14:19:08Z] <dcausse@deploy1002> Started deploy [airflow-dags/search@ab19bcd]: wdqs: deduplicate side-output events (T362508)

Mentioned in SAL (#wikimedia-operations) [2024-04-30T14:19:37Z] <dcausse@deploy1002> Finished deploy [airflow-dags/search@ab19bcd]: wdqs: deduplicate side-output events (T362508) (duration: 00m 29s)

Change #1020263 merged by jenkins-bot:

[operations/deployment-charts@master] rdf-streaming-updater: increase s3 socket-timeout to 30s

https://gerrit.wikimedia.org/r/1020263

Change #1025850 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: increase s3 socket-timeout to 30s

https://gerrit.wikimedia.org/r/1025850

Change #1025850 merged by Bking:

[operations/deployment-charts@master] rdf-streaming-updater: increase s3 socket-timeout to 30s

https://gerrit.wikimedia.org/r/1025850