NEW FEATURE REQUEST: Dataset with active and non-active Wikis
A dataset that can be used to determine how many and which wikis exist - and whether they're active - over time. The use cases I'm aware of have centered around Wikipedias, although I can imagine questions around other wikis as well.

Currently, if we need to know the number of active Wikipedias, we reference this data page: . However, that only goes back to 2020. The page has more history, but getting historical data depends on going through edit revisions - and the page did not always differentiate between active and non-active languages.

External communications about the scale and scope about our work. The most recent use case is for a blog post being prepared for Diff - a technology spotlight about how we keep Wikimedia running ( We were asked, "can we get data for how the number of active wikipedias has changed over 10+ years?"

Based on the timeline and my team's bandwidth, I suggested focusing on our current number of languages. But it would be useful to have this data readily available for future public communications.


Just for anyone that grabs this, we already define "active wikis" and use it in datasets like public geoeditors, the query for a dataset would be something like:

with editors as (
 select wiki_db,
        <<parse month out of event_timestamp>> month,
        count(*) edits
   from mediawiki_history
  where <<latest snapshot and filter out deleted, reverted, etc as desired>>
  group by wiki_db, <<parse month>>
active_editors as (
 select wiki_db,
   from editors
  where edits > 4

-- active wikis:
 select wiki_db,
   from active_editors
  group by wiki_db, month
 having count(*) > 2
EChetty set the point value for this task to 5.Jan 16 2023, 4:46 PM


Do we have an existing definition of active we want to use here?

Millimetric has:

 from editors
where edits > 4


  from active_editors
 group by wiki_db, month
having count(*) > 2

Are you okay with this? And can we write this in stone somewhere?