Page MenuHomePhabricator

[priority] Scraper: count number of sub-references (on dewiki)
Closed, ResolvedPublic

Description

This metric is split out to speed up information gathering ahead of reporting.

Top priority is to add this metric, test it locally, then run on dewiki:

  • total number of sub-references across a wiki

Implementation

Current status

We've finished a dewiki scrape but the results need to be verified. The scraper detected 1,245 pages with subrefs but at the time of updating this task there are 5,023 pages in the category so either the analysis is failing or growth has been abrupt.

Nov 28 run outputs: https://analytics.wikimedia.org/published/datasets/one-off/html-dump-scraper-refs/2025-11-28/
dewiki details: https://analytics.wikimedia.org/published/datasets/one-off/html-dump-scraper-refs/2025-11-28/dewiki-summary.json
csv: https://analytics.wikimedia.org/published/datasets/one-off/html-dump-scraper-refs/2025-11-28/all-wikis-summary.csv

Number of pages on dewiki with subrefs (pages_with_subrefs_count): 1 245
Number of subrefs on dewiki (subrefs_sum): 14 548

Event Timeline

awight renamed this task from Scraper: count number of sub-references on dewiki to Scraper: count number of sub-references (on dewiki).Nov 19 2025, 2:22 PM
awight renamed this task from Scraper: count number of sub-references (on dewiki) to [prirority] Scraper: count number of sub-references (on dewiki).Nov 23 2025, 5:03 PM
awight triaged this task as High priority.
awight updated the task description. (Show Details)
awight renamed this task from [prirority] Scraper: count number of sub-references (on dewiki) to [priority] Scraper: count number of sub-references (on dewiki).Nov 24 2025, 7:53 AM
WMDE-Fisch changed the task status from Open to In Progress.Nov 26 2025, 7:01 AM

Nothing to review here. The ticket is resolved when the scraper run on dewiki is done see T410251: Run scraper on dewiki, to count number of sub-references

Verifying the snapshot date, I currently see the dewiki snapshot has "date_modified" => "2025-11-21T00:38:19.054220838Z" which is very recent. It would make sense if this was also the snapshot that the scraper ran against (TODO: log the snapshot date at the beginning of each run), but it doesn't explain why the discovered subref article count is so low.

Next step could be to get current subref pages directly from the category (done in T409945: Measure sub-reference additions by individual edit) and read the latest revisions. As a diagnostic, maybe compare the latest revision from the snapshot dump against revisions from the history.