One of our project metrics is:
The number of human-performed edits adding at least one sub-reference reaches 2000 on dewiki. (Baseline: 0)
We'll get an estimate of this number by simply counting the sub-reference tags, in T409944: [priority] Scraper: count number of sub-references (on dewiki). But we also care about the number of edits and how these articles evolved, so we'll do some one-off additional work which analyzes as follows:
- Produce a list of all articles on German Wikipedia with at least one sub-reference.
- Iterate through the revision history of these articles and fetch the Parsoid rendering for each.
- Compare these revisions pairwise, counting the number of subrefs and keeping a tally of the revisions in which sub-references were added or removed.
- Work in progress code: https://gitlab.com/wmde/technical-wishes/scrape-revision-history/-/merge_requests/1
- Build a table of edits which added sub-references.
- Move revision-crawling logic into the scrape-wiki-dump repo.
- Emit outputs to EventGate.
- This should be done by adapting the page summary schema, preferably adding a metadata dimension to keep it separate from snapshot scraping results.
There was some related work in T400013: VisualEditor deletes list-defined references if there's a reference containing an ISBN and magic linking is enabled and T404421: [Bug] List-defined refs only used inside of a template are removed by visual editor which introduces a repository https://gitlab.com/wmde/technical-wishes/ref-damage to analyzes revisions looking for a particular type of edit. This is the basic approach we will take, but with a different way of targeting articles and revisions, and with much simpler detection logic.
Limitations
- Won't find pages which once had sub-references but currently do not. Potentially this can be worked-around or better yet, the categorymembers API could provide a "historical member" option. The needed historical data seems to be available in the database.
Implementation
New repository: https://gitlab.com/wmde/technical-wishes/scrape-revision-history