More than 2500 articles are corrupted (unparseable) in zhwiki_namespace_10 snapshot. We need to understand the reason for this in order to attempt a fix for this.
Refer to Investigations/Investigation: Unparseable zhwiki articles in snapshots for details.
To do
- Diagnose whether the snapshot process after getting the json article works ok. -> No problem here
- Diagnose basic avro -> Golang struct instance -> json process works ok. -> No problem here
- Get a list of corrupted articles in several snapshots of zhwiki. Corrupted articles are the ones we cannot read as a json.
- Try to find these articles in the kafka topic. See if they are corrupted.
- Run bulk ingetion in dev for zhwiki namespace 10.
- Root cause and fix the 0 consumer lag problem for articlebulk - after auto.offset.reset config, worked by running DAG, but did not work when spawning new articlebulk instance from dashboard.
- Root cause why bad messages are not skipped during unmarsaling in snapshot service (if err := h.Stream.Unmarshal(ctx, msg.Value, art); err != nil {)
- Create snapshot in dev for zhwiki_namespace_10 and try to get the list of corrupted articles (not readable as json).
Acceptance criteria
- Root cause identified
- Hypothesis of fix verified