Page MenuHomePhabricator

Root cause zhwiki namespace 10 article corruption in snapshot
Closed, ResolvedPublic8 Estimated Story Points

Description

More than 2500 articles are corrupted (unparseable) in zhwiki_namespace_10 snapshot. We need to understand the reason for this in order to attempt a fix for this.

Refer to Investigations/Investigation: Unparseable zhwiki articles in snapshots for details.

To do

  • Diagnose whether the snapshot process after getting the json article works ok. -> No problem here
  • Diagnose basic avro -> Golang struct instance -> json process works ok. -> No problem here
  • Get a list of corrupted articles in several snapshots of zhwiki. Corrupted articles are the ones we cannot read as a json.
  • Try to find these articles in the kafka topic. See if they are corrupted.
  • Run bulk ingetion in dev for zhwiki namespace 10.
    • Root cause and fix the 0 consumer lag problem for articlebulk - after auto.offset.reset config, worked by running DAG, but did not work when spawning new articlebulk instance from dashboard.
    • Root cause why bad messages are not skipped during unmarsaling in snapshot service (if err := h.Stream.Unmarshal(ctx, msg.Value, art); err != nil {)
  • Create snapshot in dev for zhwiki_namespace_10 and try to get the list of corrupted articles (not readable as json).

Acceptance criteria

  • Root cause identified
  • Hypothesis of fix verified

Event Timeline

prabhat updated the task description. (Show Details)
prabhat updated the task description. (Show Details)
prabhat updated the task description. (Show Details)
prabhat set Final Story Points to 8.
prabhat set the point value for this task to 8.