Page MenuHomePhabricator

Maintenance templates: Generate dataset and analysis/modeling (first round)
Open, Needs TriagePublic

Description

In T406203: Start formal collaboration on understanding the use of maintenance templates, we identified a large set (~8K) of maintenance templates across more than 100 wikis.
The goal of this task is to

  • generate the dataset of all revisions where these templates where added/removed
  • conduct descriptive analysis (summary statistics, time evolution of use of maintenance templates)
  • start developing models to predict maintenance templates for new articles/revisions

Event Timeline

weekly update:

weekly updates:

  • no updates this week

weekly update:

  • implemented several refinements for the processing pipeline: resolving redirect names of templates, marking edits with specific template that was added/removed
  • next step is to expand the pipeline to include all maintenance templates from a single wiki

weekly update:

weekly updates

  • Extracted full dataset for 4 languages: simple, de, bn, hi. en is still pending as we need to figure out settings for resource allocation to avoid memory issues.
  • We are looking manually through a small subset of samples to spot-check any processing issues. One potential issue we have identified is that for some edits, one template is removed and at the same time another one (or more) is added. This might indicate that the former issue is not resolved but rather the latter templates provide a more specific characterization of the issue.
  • Next step: Starting analysis of basic summary stats (e.g. number of templates, affected articles) over time.

weekly update:

  • Refactored the dataset pipeline into 2 parts: i) getting the full (raw) dataset, ii) then applying the filtering step. With this, we can now also run the pipeline for enwiki. With this we are fairly confident that we can, in principle, run this on all wikis.
  • We built a dataset for an initial set of 6 languages (those were selected based on language familiarity to being able to manually check results): bn, de, en, hi, pt, simple. We created a smaller random subsample for manual investigation/verification.
  • We started to identify the main summary statistics to report a high-level overview of the dataset (number of templates, number of revisions, number of articles, time between adding/removing a template).

weekly update:

  • manual check of 100 samples in 6 languages in this spreadsheet. Overall, the data looks reliable. Though we identified some issues in parsing, such as nested templates, multi-tag templates, and some false positives probably due to reverted edits. We plan to fix those in the next iteration.
  • We are starting the analysis of high-level metrics
  • We are starting parsing of the content of templates to map them to the corresponding policy or guideline (e.g. identifying links to Wikipedia namespace). The first approach is to capture all links from the page to the Wikipedia-namespace and then do a manual filtering.