User Story
As a data analyst working on content and contributor metrics using MediaWiki History, I need the most recently generated snapshot to have all the production wikis that were open/active at the time of the snapshot so that I'm working with data from all wikis – including languages/projects recently graduated from Incubator – allowing me to paint a more complete picture of growth and productivity in the Wikimedia movement.
Context:
- T349743: NEW BUG REPORT 12 new wikis missing from the mediawiki_history dataset
- T329119: 13 new wikis missing from mediawiki_history
- T299548: 22 small wikis missing from the mediawiki_history dataset
- T220456: Many small wikis missing from mediawiki_history dataset
Notes
- canonical_data.wikis (sourced from https://github.com/wikimedia-research/canonical-data/blob/master/wiki/wikis.tsv) is updated pretty frequently
- In addition to that:
- There's @KCVelaga's structured list of when each Wikimedia project was created, and if applicable, the closure date as well
- And @Hghani's site creation scraper which scrapes all wikis listed on the Site Creation Log page
- Some checks already implemented T354692: [Data Quality] Implement basic data quality metrics for MW history