Scraper: scan outputs for inconsistencies
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	awight
	Sep 1 2023, 8:29 AM

Description

Before we destroy the instance, let's take the opportunity to check for errors:

Check that .json files can be parsed.
Check that each line of .ndjson can be parsed.
Check for duplicate entries in each page summary and mapdata output file, by building a set keyed on revid.

Implementation: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/tree/validate-outputs?ref_type=heads

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T345411 Scraper: destroy Cloud VPS runner instance
		Resolved		None	T345417 Scraper: scan outputs for inconsistencies

Event Timeline

awight created this task.Sep 1 2023, 8:29 AM

awight updated the task description. (Show Details)Sep 1 2023, 9:06 AM

The validator has been run on outputs: we found no JSON errors, but several metric tons of duplicate errors. A tally of duplicates in each file will be included in reports, in a file "duplicates.txt".

As far as I can tell the only new file in that branch is https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/validate-outputs/validate-outputs.exs. Is that correct?

Can we see the output anywhere online? Specifically the duplicates file?

In T345417#9163228, @thiemowmde wrote:

Can we see the output anywhere online? Specifically the duplicates file?

Unfortunately not yet, this is tracked in T341751. The outputs and duplicates file are currently on a closed WMCS instance but members of our team have login access:

ssh runner.dump-references-processor.eqiad1.wikimedia.cloud
ls /srv/reports/duplicates.txt
ls /srv/reports/

Outputs and duplicates.txt are publicly available now: https://analytics.wikimedia.org/published/datasets/one-off/html-dump-scraper-refs-and-maps/2023-06-01/

thiemowmde closed this task as Resolved.Oct 4 2023, 12:15 PM

Scraper: scan outputs for inconsistenciesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Scraper: scan outputs for inconsistencies
Closed, ResolvedPublic
Actions

Related Objects
Search...