Page MenuHomePhabricator

[Data Quality] Implement wiki completeness check for MediaWiki History
Open, Needs TriagePublic5 Estimated Story Points


User Story

As a data analyst working on content and contributor metrics using MediaWiki History, I need the most recently generated snapshot to have all the production wikis that were open/active at the time of the snapshot so that I'm working with data from all wikis – including languages/projects recently graduated from Incubator – allowing me to paint a more complete picture of growth and productivity in the Wikimedia movement.



Event Timeline

mpopov renamed this task from [Data Quality] Implement completeness check for MediaWiki History to [Data Quality] Implement wiki completeness check for MediaWiki History.May 16 2024, 9:12 PM

An idea for an automated quality check step:

Using @Hghani 's site creation scraper to scrape the newest wikis from the Site Creation Log, and cross-check with canonical_data.wikis

(This could, additionally, be helpful for quality checking canonical_data.wikis itself).

lbowmaker set the point value for this task to 5.May 29 2024, 2:19 PM
lbowmaker added subscribers: gmodena, Ahoelzl, lbowmaker.

Notes from chat with @gmodena:

  • Should be easy enough for someone in the DE team to implement
  • Maybe 3 points, 1ish day for coding then reviews and then testing it out could make it run 3 days of effort - 1 week
  • We would also need to think if we wanted alerts too.

I’ll tentatively put this into Q1 plans as a 5, we wouldn’t get to this by end of June.