Page MenuHomePhabricator

Scraper: scan outputs for inconsistencies
Closed, ResolvedPublic

Description

Before we destroy the instance, let's take the opportunity to check for errors:

  • Check that .json files can be parsed.
  • Check that each line of .ndjson can be parsed.
  • Check for duplicate entries in each page summary and mapdata output file, by building a set keyed on revid.

Implementation: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/tree/validate-outputs?ref_type=heads

Related Objects

Event Timeline

The validator has been run on outputs: we found no JSON errors, but several metric tons of duplicate errors. A tally of duplicates in each file will be included in reports, in a file "duplicates.txt".

As far as I can tell the only new file in that branch is https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/validate-outputs/validate-outputs.exs. Is that correct?

Can we see the output anywhere online? Specifically the duplicates file?

Can we see the output anywhere online? Specifically the duplicates file?

Unfortunately not yet, this is tracked in T341751. The outputs and duplicates file are currently on a closed WMCS instance but members of our team have login access:

ssh runner.dump-references-processor.eqiad1.wikimedia.cloud
ls /srv/reports/duplicates.txt
ls /srv/reports/