Page MenuHomePhabricator

Create a structured list of Wikimedia projects' creation and closure dates
Open, MediumPublic

Description

Create a structured list of when a Wikimedia project is created, and if applicable, the closure date as well.

Sources for project creation

Sources for project closure

  • MediaWiki git blame data for closed dbs
  • incubator import log

Details

Other Assignee
KCVelaga_WMF

Event Timeline

Learnings from the progress made during the hackathon:

Wikimedia project creation date

The git files were created on 2012-02-24, which is the reason for many wikis having their date recorded as 2012-02-24. This source can only be used for wikis created after 2012-02-24. For wikis created before, three sources are under consideration, based on the timeline.

  • 2001 to mid-2006: Wikimedia Incubator was set up in June 2006, which created a framework for new language projects to launch their own wikis. Post its creation, a site creation log was maintained. Prior to that, the least available revision timestamp from the revision and archive table may be considered as a proxy for the creation date.
  • mid-2006 to August 2010: A manually cleaned version (output as CSV file) of the incubator site creation log is used.
    • This can potentially be used beyond 2010 as well, however, the list doesn't follow a consistent HTML format to scrape from. Since the historical records don't change, a one-time manual extraction seems reasonable to fill this gap.
  • August 2010 to March 2012: The new projects mailing list has been created in 2010, where an automated message is sent every time a new wiki is created. The date of the email can be considered as a proxy for the creation date.
  • Post March 2012: The git blame date for database creation can be used.
KCVelaga_WMF renamed this task from Create a structured list of wiki_db creation and closure dates to Create a structured list of Wikimedia projects' creation and closure dates.Jun 9 2023, 3:22 PM

@KCVelaga_WMF have you had a chance to work on the last pieces of this? Totally fine if not, but if you want to hand it over, I'm happy to take it on 😊

KCVelaga_WMF updated Other Assignee, added: KCVelaga_WMF.

@nshahquinn-wmf thanks for checking on this.

I have moved the repository under wikimedia-research on GitHub: https://github.com/wikimedia-research/wikimedia_project_creation_closure_dts. I was able to complete the creation dates part, but not the closure dates. I don't think I will be able to get to it before the end of Q1, so if you can take over, that's wonderful. I am happy to collaborate, for thinking through the approaches, reviews and consolidating everything. I will go ahead and assign the task to you.

Here are some updates:

  • There was no single approach that works for all, the data is a consolidation of various sources. I thought it would be best to add a column to mention which source/approach was used. Also, if you can review the creation dates notebook, that will be helpful.
  • For the closure dates, the notebook is a mess! I was just trying a bunch of things, feel free to scrap it. Anyway, the approaches I was trying/thinking of were (which were a bunch of suggestions I got during the hackathon):
    • For wikis closed after Feb 2012, the git blame data can be used.
    • For wikis closed before Feb 2012
      • The last revision or log performed by a non-steward and some extensions/scripts ('Flow talk page manager', 'Global rename script', 'MediaWiki message delivery', 'Maintenance script')
      • The earliest recorded date for imports in the log table of the incubator, as the content usually gets imported to Wikimedia Incubator after closure.
      • There is a consistent closure message added to the wiki's main page (example), the date of the edit that added the message/template could also be a source.