Page MenuHomePhabricator

Create a structured list of Wikimedia projects' creation and closure dates
Open, MediumPublic

Description

Create a structured list of when a Wikimedia project is created, and if applicable, the closure date as well.

Sources for project creation

Sources for project closure

  • MediaWiki git blame data for closed dbs
  • incubator import log

Details

Other Assignee
KCVelaga_WMF

Event Timeline

Learnings from the progress made during the hackathon:

Wikimedia project creation date

The git files were created on 2012-02-24, which is the reason for many wikis having their date recorded as 2012-02-24. This source can only be used for wikis created after 2012-02-24. For wikis created before, three sources are under consideration, based on the timeline.

  • 2001 to mid-2006: Wikimedia Incubator was set up in June 2006, which created a framework for new language projects to launch their own wikis. Post its creation, a site creation log was maintained. Prior to that, the least available revision timestamp from the revision and archive table may be considered as a proxy for the creation date.
  • mid-2006 to August 2010: A manually cleaned version (output as CSV file) of the incubator site creation log is used.
    • This can potentially be used beyond 2010 as well, however, the list doesn't follow a consistent HTML format to scrape from. Since the historical records don't change, a one-time manual extraction seems reasonable to fill this gap.
  • August 2010 to March 2012: The new projects mailing list has been created in 2010, where an automated message is sent every time a new wiki is created. The date of the email can be considered as a proxy for the creation date.
  • Post March 2012: The git blame date for database creation can be used.
KCVelaga_WMF renamed this task from Create a structured list of wiki_db creation and closure dates to Create a structured list of Wikimedia projects' creation and closure dates.Jun 9 2023, 3:22 PM

@KCVelaga_WMF have you had a chance to work on the last pieces of this? Totally fine if not, but if you want to hand it over, I'm happy to take it on 😊

KCVelaga_WMF updated Other Assignee, added: KCVelaga_WMF.

@nshahquinn-wmf thanks for checking on this.

I have moved the repository under wikimedia-research on GitHub: https://github.com/wikimedia-research/wikimedia_project_creation_closure_dts. I was able to complete the creation dates part, but not the closure dates. I don't think I will be able to get to it before the end of Q1, so if you can take over, that's wonderful. I am happy to collaborate, for thinking through the approaches, reviews and consolidating everything. I will go ahead and assign the task to you.

Here are some updates:

  • There was no single approach that works for all, the data is a consolidation of various sources. I thought it would be best to add a column to mention which source/approach was used. Also, if you can review the creation dates notebook, that will be helpful.
  • For the closure dates, the notebook is a mess! I was just trying a bunch of things, feel free to scrap it. Anyway, the approaches I was trying/thinking of were (which were a bunch of suggestions I got during the hackathon):
    • For wikis closed after Feb 2012, the git blame data can be used.
    • For wikis closed before Feb 2012
      • The last revision or log performed by a non-steward and some extensions/scripts ('Flow talk page manager', 'Global rename script', 'MediaWiki message delivery', 'Maintenance script')
      • The earliest recorded date for imports in the log table of the incubator, as the content usually gets imported to Wikimedia Incubator after closure.
      • There is a consistent closure message added to the wiki's main page (example), the date of the edit that added the message/template could also be a source.

I just worked on a Wikipedia 25-related request from the WMF Communications department for:

  • The monthly article count for each Wikipedia during its history
  • The creation data for each Wikipedia
  • The first article created at each Wikipedia

I was able to provide the first (with a bunch of caveats around the precision of the earliest data), but not the second and third.

In the process, I ended up digging into the data that @KCVelaga_WMF prepared. Here's a current brain dump from the experience.

For terminology, see mw:MediaWiki_history.

UseModWiki-first wikis

For wikis that started off using UseModWiki, I don't think there's any structured way to get the creation date (or first article). My understanding is that the conversion process from UseModWiki (to either Phase II or MediaWiki) did not preserve history. Although some of that lost history has been restored ad-hoc, generally, I think we have to assume that is it lost.

For these wikis, the lowest revision ID will correspond to the first revision on the newer software, with old content imported later, and the lowest revision timestamp will either correspond to the same revision or to a revision imported not during the conversion process, but later using the XML import feature. That imported revision might have been made on the same wiki's UseModWiki install, but as or more likely, it was imported from a different wiki.

For these wikis, likely the only way to get the answer is manual historical work (e.g. digging through mailing lists and old wiki pages for references to the wikis being created). This would probably be able to get us the answer in most cases, but there will likely still be (1) some cases where we can't confirm the exact date and only know the range of, say, a couple weeks and (2) some ambiguity about what constitutes "creation".

Ultimately, we could probably get to sensible answers, but it would take a significant amount of historical work and judgement calls. If we opened a community discussion/call for evidence, we would probably get good answers and a lot of participation, but it would be a significant chunk of work to manage it.

Phase II– or MediaWiki–first wikis

For wikis which didn't start off using UseModWiki, I believe the answer is simpler: the time of the revision with the lowest ID. This seems to produce as good results or better results than any other method. I took 7 wikis where there was big discrepancy between the "least revision" and KC's results and found the creation date historically, and in all 7 cases, the "least revision" was accurate (and the other method was not).

Distinguishing the two groups

Unfortunately, there also doesn't seem to be an easy way to distinguish the two groups (wikis which started off with UseModWiki and wikis which didn't). When wikis were converted from UseModWiki, content was imported to the new software using the username "Conversion script". As far as I can tell, this username has not been used for any other system activity, so I thought its presence in a wiki's history was a sure sign that it started off with UseModWiki.

Alas, it turns out that it shows up in other wikis too because those conversion revisions have been imported elsewhere, and identifying imported revision is not at all easy (see T22148).

There's actually a wikiBirthday maintenance script that uses the "timestamp of minimum rev ID" method which I found to be the best option despite it being mostly wrong for UseModWiki-first wikis.

In the WMF Slack, @Michael just asked about getting wiki creation dates for some newer wikis to understand why they're missing a particular table (T414600).

This is a case where it would be helpful for us to provide the creation date for only Phase II– or MediaWiki–first wikis, even if the data has to remain null for UseModWiki-first wikis. The main blocker for that is figuring out how to identify UseModWiki-first wikis.