Page MenuHomePhabricator

NEW FEATURE REQUEST: Dataset with active and non-active Wikis
Open, MediumPublic5 Estimated Story Points

Description

Data Platform Request Form

Is this a request for a:

  • Dataset
  • Data Pipeline
  • Data Feature

Please provide the description of your request:
A dataset that can be used to determine how many and which wikis exist - and whether they're active - over time. The use cases I'm aware of have centered around Wikipedias, although I can imagine questions around other wikis as well.

Currently, if we need to know the number of active Wikipedias, we reference this data page: https://commons.wikimedia.org/wiki/Data:Wikipedia_statistics/meta.tab . However, that only goes back to 2020. The https://meta.wikimedia.org/wiki/List_of_Wikipedias page has more history, but getting historical data depends on going through edit revisions - and the page did not always differentiate between active and non-active languages.

Use Case: (Please explain what this feature will be used for):
External communications about the scale and scope about our work. The most recent use case is for a blog post being prepared for Diff - a technology spotlight about how we keep Wikimedia running (https://phabricator.wikimedia.org/T323230). We were asked, "can we get data for how the number of active wikipedias has changed over 10+ years?"

Based on the timeline and my team's bandwidth, I suggested focusing on our current number of languages. But it would be useful to have this data readily available for future public communications.

Priority:
Low

Ideal Delivery Date:

Dataset Checklist
  • Provide link to CSV/GSheet example data. Link: ____
  • Provide link to the desired Table Schema. Link: ____
  • Does this data contain anything that is sensitive, PII or Private?
    • Yes
    • No
    • I don't know
  • Who will own the data (Fix issues, update descriptions & metadata etc.)?
Datapipeline Checklist
  • Do you have the transformation you like to be applied Link: ____
  • Does this data need to be linked to other data in the Data Lake?
    • Yes
    • No
Data Feature Checklist

Please link to the following if applicable.

Document TypeRequired?Document/Link
Related PHAB TicketsYes<add link here>
Product One PagerYes<add link here>
Product Requirements Document (PRD)Yes<add link here>
Product RoadmapNo<add link here>
Product Planning/Business CaseNo<add link here>
Product BriefNo<add link here>
Other LinksNo<add links here>
For Data Engineering Team to fill out:
Value CalculatorRank
Will this improve the efficiency of a teams workflow?1-3
Does this have an effect of our Core Metrics?1-3
Does this align with our strategic goals?1-3
Is this a blocker for another team?1-3

Event Timeline

Just for anyone that grabs this, we already define "active wikis" and use it in datasets like public geoeditors, the query for a dataset would be something like:

with editors as (
 select wiki_db,
        <<parse month out of event_timestamp>> month,
        event_user,
        count(*) edits
   from mediawiki_history
  where <<latest snapshot and filter out deleted, reverted, etc as desired>>
  group by wiki_db, <<parse month>>
),
active_editors as (
 select wiki_db,
        month,
        event_user
   from editors
  where edits > 4
),

-- active wikis:
 select wiki_db,
        month
   from active_editors
  group by wiki_db, month
 having count(*) > 2
EChetty set the point value for this task to 5.Jan 16 2023, 4:46 PM

@kzimmerman

Do we have an existing definition of active we want to use here?

Millimetric has:

 from editors
where edits > 4

and

  from active_editors
 group by wiki_db, month
having count(*) > 2

Are you okay with this? And can we write this in stone somewhere?