Page MenuHomePhabricator

Instrument {{Delete ...} template adding/removing on Commons and create a historical dataset
Open, Needs TriagePublic

Description

Data Engineering recently wrote a job to copy an XML dump into an Iceberg table [1] and found that it was feasible and not too greedy on resources. With this in mind, it seems more feasible than it did in the past to parse through wikitext for various useful information.

One such useful thing is knowing, for every revision, whether a {{Delete ...}} template was added or removed. We could combine this with the knowledge of whether that page was eventually deleted to build a metric around this workflow. Monitoring this has value for Structured Data as they implement their hypothesis in service of WE1.2 [3]. It's also generally a measure of how much of this kind of work a particular community does. And it could in theory be extended to any template or indeed any relationship between an article and some other type of on-wiki entity.

Concrete Proposal

As a proof of concept, write a job that figures out whether {{Delete ...}} is added or removed by a particular revision, for every revision in commons.wikimedia.org history. If successful, join this information with page deletion and page restore data. If the data is useful, make follow-up tasks to productionize this data product and shop the idea around with other teams to see if other workflows can benefit from such metrics.

NOTE: this kind of parsing might have been done by the research team, so check with them for wisdom.

[1] https://wikitech.wikimedia.org/wiki/Milimetric/Learning_iceberg/Copy_large_table
[2] https://commons.wikimedia.org/wiki/Template:Delete (is this it? It looks like it was nominated for deletion, lol)
[3] https://docs.google.com/spreadsheets/d/1LYH0f8NIg_U--CMq4ErG5O0WxrSqWGhnoOTzhAiuLRA/edit#gid=1446270873