The scraper crashes while making the final aggregate:
mix run pipeline.exs 15:05:02.499 [error] Failed: ./reports/all-wikis-20230601-summary.csv Overall progress 773/773 [≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡] 100% ** (YamlElixir.ParsingError) malformed yaml (yaml_elixir 2.9.0) lib/yaml_elixir.ex:22: YamlElixir.read_from_file!/2 (scrape_wiki_dump 0.1.0) lib/pipeline.ex:96: anonymous fn/1 in Wiki.DumpPipeline.aggregate_across_wikis/1 (elixir 1.14.3) lib/enum.ex:1658: Enum."-map/2-lists^map/1-0-"/2 (elixir 1.14.3) lib/enum.ex:1658: Enum."-map/2-lists^map/1-0-"/2 (scrape_wiki_dump 0.1.0) lib/checkpointer.ex:120: anonymous fn/3 in Wiki.Checkpointer.run_once/4 (scrape_wiki_dump 0.1.0) lib/checkpointer.ex:134: Wiki.Checkpointer.report_progress/2 (scrape_wiki_dump 0.1.0) lib/checkpointer.ex:116: Wiki.Checkpointer.run_once/4
Debugging is showing that the arwikinews summary is the file causing the crash, however the YAML file is read correctly when the same library call is made from the commandline:
iex -S mix YamlElixir.read_from_file!("reports/arwikinews-20230601-summary.yaml")
Implementation
Code to review:
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/83