The updater is misbehaving in codfw, apparently processing too many reconciliations which triggers a slow update mode and thus is not able to keep up with the update rate and causes maxlag to throttle bot edits in wikidata.
On April 14:
- 07:00: the wcqs streaming updater starts to fail checkpointing
- 09:00: the wdqs streaming updater shows same failures
- 11:30: both jobs resume normal operation
The errors all relates to a read timeout between flink and the object store: P60589
During this period the job restarted multiple times from the same set of kafka offsets.
For some reasons during these restarts some events were considered late (this is yet to be investigated).
We do not use transactional producers for these side output streams and the same late events were re-emitted multiple times:
- wcqs emitted around 200k late events during this crashloop
- wdqs emitted around 2M of these
Doing a simple deduplication by item ID:
- wcqs emitted 22k distincts item ID
- wdqs emitted 85773k distincts item ID
The spark job that analyses these late events to re-emit them as reconcile was not designed to handle such volume of events:
- no deduplication
- event time set too early causing some these reconcile events to be considered late again by the flink job and thus reprocessed by the spark job on the next hour
All this caused the backlog of mutations to contain way too many reconcile events that are slow to process in blazegraph.
There could be two ways to recover:
- wait for the corrupted backlog to be absorbed, at around 8 reconciliations/sec this could take 70hours (roughly ending on Friday 19 april midday UTC)
- perform a data-transfer from an sane eqiad host forcing to skip the corrupted backlog
Action items:
- depool wdqs@codfw to mitigate the impact on bot edits
- the spark job scanning side outputs should deduplicate events (https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/1019813/)
- the spark job emitting reconciliation events should set an event-time that is less likely to be considred late (https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/1019813/)
- possibly increase the socket timeout of the s3 client from 5s (default) to 30s (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020263)
- possibly understand why both wdqs and wcqs jobs failed with read timeouts between 7:00am and 11:00am on April 14
- possibly understand why the flink jobs considered many events as late during the crashloop
- repool wdqs@codfw once codfw nodes are sane