Page MenuHomePhabricator

Scraper aggregation step failing on YAML input
Closed, ResolvedPublic

Description

The scraper crashes while making the final aggregate:

mix run pipeline.exs

15:05:02.499 [error] Failed: ./reports/all-wikis-20230601-summary.csv

Overall progress                      773/773 [≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡] 100%
** (YamlElixir.ParsingError) malformed yaml
    (yaml_elixir 2.9.0) lib/yaml_elixir.ex:22: YamlElixir.read_from_file!/2
    (scrape_wiki_dump 0.1.0) lib/pipeline.ex:96: anonymous fn/1 in Wiki.DumpPipeline.aggregate_across_wikis/1
    (elixir 1.14.3) lib/enum.ex:1658: Enum."-map/2-lists^map/1-0-"/2
    (elixir 1.14.3) lib/enum.ex:1658: Enum."-map/2-lists^map/1-0-"/2
    (scrape_wiki_dump 0.1.0) lib/checkpointer.ex:120: anonymous fn/3 in Wiki.Checkpointer.run_once/4
    (scrape_wiki_dump 0.1.0) lib/checkpointer.ex:134: Wiki.Checkpointer.report_progress/2
    (scrape_wiki_dump 0.1.0) lib/checkpointer.ex:116: Wiki.Checkpointer.run_once/4

Debugging is showing that the arwikinews summary is the file causing the crash, however the YAML file is read correctly when the same library call is made from the commandline:

iex -S mix

YamlElixir.read_from_file!("reports/arwikinews-20230601-summary.yaml")

Implementation

Code to review:
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/83

Event Timeline

awight updated the task description. (Show Details)

The problematic lines are YAML special characters appearing without escaping:

- - :{{BASEPAGENAME}}/04
  - 1

Putting quotes around these lines allows the file to be parsed correctly. The "ymlr" library is responsible for writing YAML here, so I'll file an upstream bug and will swap in another library.

Filed upstream as https://github.com/ufirstgroup/ymlr/issues/140 , with a PR included. I'm not sure it will be accepted however, the maintainer's philosophy is to keep quotation to a minimum and this seems to be a mistake in the yamlerl parser to interpret the initial :{ as syntax. Changing that library is even less likely...

Fortunately for us, the value is suspicious so maybe we can workaround by filtering it out. Template names don't begin with a colon so this seems to be avoidable by changing our scrape parser. Unfortunately for us, it would require a complete re-run, so we might end up doing something hacky for this run.

awight moved this task from Doing to Review on the WMDE-TechWish-Maintenance-2023 board.

But in the interest of making the scraper more robust to bad data, we can just reencode these files as JSON. Patch pushed for review.