Page MenuHomePhabricator

Scraping enterprise dumps: investigate incomplete article lists
Closed, ResolvedPublic

Description

At least 1% of articles are missing from the HTML dumps, when comparing the meta=siteinfo&siprop=statistics API results to the count of unique pages found in the dumps. This task is done when we have an explanation for why the numbers differ.

Hypothesis #1: enterprise dump source code gets its article list by making repeated API call to the list=allpages action API endpoint. This seems inherently unstable as the responding query is re-run unpredictably, and the cursor works by row count rather than something like page id.

Hypothesis #2: perhaps deleted articles are counted in the stats but not in the dump? The sitestats endpoint does omit page_is_redirect != 0 but seems to count deleted pages.

Event Timeline

ApiQueryAllPages uses the page title to carry continuation state, which is very reasonable! This would only be fooled by page renames happening during the dump interval, which is possible but not likely to add up to 1%. Hypothesis #1 is looking unlikely.

Deleted articles shouldn't show up in either list, so Hypothesis #2 is also looking unlikely.

A good next step would be to list all article titles and spot-check which are missing from the dump.

analytics-mysql dewiki -B -e 'select page_title from page where page_namespace=0 and page_is_redirect=0' | sort > dewiki_all_pages_db.txt

gunzip -c /srv/published/datasets/one-off/html-dump-scraper-refs/20240201/dewiki-20240201-page-summary.ndjson.gz | jq -r '.title' | sed -e 's/ /_/g' | sort > dewiki-processed-normalized.txt

Oookay there are all kinds of things happening. Diffing the two lists, we can see that the scraper is still producing duplicates:

+1._Buch_Samuel
+1._Buch_Samuel
+1._Buch_Samuel
+1._Buch_Samuel
+1._Buch_Samuel

and there are lots of pages missing from the scraped dump:

 1943_B3
-1943_–_Kampf_um_das_Vaterland
-1943_–_Operation_Russland
 1943:_The_Battle_of_Midway
...
-Ådalens_IF
...
 A_Great_Day_in_Harlem
-A_Great_Day_in_Harlem_(Film)
-A_Great_Day_in_Harlem_(Foto)
+A_Great_Place_to_Call_Home
+A_Great_Place_to_Call_Home
+A_Great_Place_to_Call_Home
 A_Great_Place_to_Call_Home
awight renamed this task from Enterprise dumps: investigate incomplete article lists to Scraping enterprise dumps: investigate incomplete article lists.Tue, Apr 16, 2:12 PM

Bringing this task into our sprint because it has data quality implications and probably blocks scraping for the moment.

Duplicates: each copy of a page comes with a different revid, and checking the final counts we can see that our deduplication did catch the extra copies during the aggregation step:

uniq dewiki-processed-normalized.txt  | wc -l
2884601

Interestingly, there are additional pages which were rejected because the final aggregate count is 2880242.

Checking whether these can be explained by page renames where we detected a duplication by pageid:

gunzip -c /srv/published/datasets/one-off/html-dump-scraper-refs/20240201/dewiki-20240201-page-summary.ndjson.gz | jq -r '.pageid' | sort | uniq | wc -l
2880242

Yes, this matches exactly so from our side the duplicates issue can be considered closed. There are still questions about why the enterprise dump includes these lines, of course...

awight moved this task from Doing to Done on the WMDE-TechWish-Sprint-2024-04-12 board.

Hmm, spot-checking is only turning up articles which were created or moved after the snapshot date.

And we realized that the API checksum is actually showing the number of articles on today's date, which is 3 months later. It's correct that it would be 1% higher.

The only remaining issue is with the dump duplicates, but we've already protected ourselves from that.

WMDE-Fisch claimed this task.