Scraping enterprise dumps: investigate incomplete article lists
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	awight
	Tue, Apr 16, 1:31 PM

Description

At least 1% of articles are missing from the HTML dumps, when comparing the meta=siteinfo&siprop=statistics API results to the count of unique pages found in the dumps. This task is done when we have an explanation for why the numbers differ.

Hypothesis #1: enterprise dump source code gets its article list by making repeated API call to the list=allpages action API endpoint. This seems inherently unstable as the responding query is re-run unpredictably, and the cursor works by row count rather than something like page id.

Hypothesis #2: perhaps deleted articles are counted in the stats but not in the dump? The sitestats endpoint does omit page_is_redirect != 0 but seems to count deleted pages.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T357611 Re-run the scraper on a limited set of wikis
		Resolved		WMDE-Fisch	T362659 Scraping enterprise dumps: investigate incomplete article lists

Event Timeline

awight created this task.Tue, Apr 16, 1:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Apr 16, 1:31 PM

awight mentioned this in T357611: Re-run the scraper on a limited set of wikis.Tue, Apr 16, 1:32 PM

ApiQueryAllPages uses the page title to carry continuation state, which is very reasonable! This would only be fooled by page renames happening during the dump interval, which is possible but not likely to add up to 1%. Hypothesis #1 is looking unlikely.

Deleted articles shouldn't show up in either list, so Hypothesis #2 is also looking unlikely.

A good next step would be to list all article titles and spot-check which are missing from the dump.

analytics-mysql dewiki -B -e 'select page_title from page where page_namespace=0 and page_is_redirect=0' | sort > dewiki_all_pages_db.txt

gunzip -c /srv/published/datasets/one-off/html-dump-scraper-refs/20240201/dewiki-20240201-page-summary.ndjson.gz | jq -r '.title' | sed -e 's/ /_/g' | sort > dewiki-processed-normalized.txt

Oookay there are all kinds of things happening. Diffing the two lists, we can see that the scraper is still producing duplicates:

+1._Buch_Samuel
+1._Buch_Samuel
+1._Buch_Samuel
+1._Buch_Samuel
+1._Buch_Samuel

and there are lots of pages missing from the scraped dump:

 1943_B3
-1943_–_Kampf_um_das_Vaterland
-1943_–_Operation_Russland
 1943:_The_Battle_of_Midway
...
-Ådalens_IF
...
 A_Great_Day_in_Harlem
-A_Great_Day_in_Harlem_(Film)
-A_Great_Day_in_Harlem_(Foto)
+A_Great_Place_to_Call_Home
+A_Great_Place_to_Call_Home
+A_Great_Place_to_Call_Home
 A_Great_Place_to_Call_Home

Bringing this task into our sprint because it has data quality implications and probably blocks scraping for the moment.

awight added a parent task: T357611: Re-run the scraper on a limited set of wikis.Tue, Apr 16, 2:13 PM

awight claimed this task.Tue, Apr 16, 2:20 PM

awight moved this task from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2024-04-12 board.

Duplicates: each copy of a page comes with a different revid, and checking the final counts we can see that our deduplication did catch the extra copies during the aggregation step:

uniq dewiki-processed-normalized.txt  | wc -l
2884601

Interestingly, there are additional pages which were rejected because the final aggregate count is 2880242.

Checking whether these can be explained by page renames where we detected a duplication by pageid:

gunzip -c /srv/published/datasets/one-off/html-dump-scraper-refs/20240201/dewiki-20240201-page-summary.ndjson.gz | jq -r '.pageid' | sort | uniq | wc -l
2880242

Yes, this matches exactly so from our side the duplicates issue can be considered closed. There are still questions about why the enterprise dump includes these lines, of course...

Hmm, spot-checking is only turning up articles which were created or moved after the snapshot date.

And we realized that the API checksum is actually showing the number of articles on today's date, which is 3 months later. It's correct that it would be 1% higher.

The only remaining issue is with the dump duplicates, but we've already protected ourselves from that.

WMDE-Fisch closed this task as Resolved.Wed, Apr 24, 5:06 PM

WMDE-Fisch claimed this task.

Scraping enterprise dumps: investigate incomplete article listsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Scraping enterprise dumps: investigate incomplete article lists
Closed, ResolvedPublic
Actions

Related Objects
Search...