Page MenuHomePhabricator

Import 2001 wikipedia data
Open, LowPublic

Description

Tim was just telling me there is an xml dump of old mediawiki history from 2001 that is not imported into the live dbs in prod, we should get it!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 10 2017, 5:16 PM

In 2010, I found a backup of Wikipedia from August 2001. Aside from the usual UseModWiki database, I discovered the files diff_log and rclog which contained a previously unknown copy of every revision to the site as of that date. See this wikimedia-l post. The backup is now hosted at https://dumps.wikimedia.org/archive/2001/2001-08-17/

There were some challenges to interpreting this data. Most notably, there was no move log, but by looking at the rclog and diffs it was possible to reconstruct the move operations. UseModWiki had a checkbox in the move page action which allowed the user to update all links to the moved page -- these diffs were not represented in diff_log and also had to be reconstructed. I developed the script which can now be found at https://phabricator.wikimedia.org/diffusion/EWMA/browse/master/importUseModWikipedia.php , which contains all my discoveries and produces a MediaWiki XML backup file as output.

The one missing component was incorporating page renames from the time of the backup to the current date, but @Milimetric's recent work in reconstructing page renames should help with that.

You can easily confirm that these revisions have not yet been imported. For example, the history of the Afghanistan article only goes back to November 2001, and starts with a minor edit. In fact the first edit was on January 16, 2001.

I see that there is also a UseModWiki backup in dumps.wikimedia.org from January 2002. Unfortunately, it does not contain diff_log or rclog, so there might not be any unimported revisions in it.

Nuria moved this task from Incoming to Dashiki on the Analytics board.Jan 23 2017, 5:05 PM

I've already checked the so-called January 2002 dump ... It's just a UseModWiki version of the Nostalgia Wikipedia dump.

Also, re the actual 2001 dump, I think it'd be best to only import edits when there is no or only a trivial gap in the page history between the last edit in those dumps and the first surviving one in the Wikipedia database.

Also, re the actual 2001 dump, I think it'd be best to only import edits when there is no or only a trivial gap in the page history between the last edit in those dumps and the first surviving one in the Wikipedia database.

Why is that, @Graham87? I mean I see why a big gap might be confusing, but it seems better than an even bigger gap without those imported edits. Is there another reason I'm missing?

Why is that, @Graham87? I mean I see why a big gap might be confusing, but it seems better than an even bigger gap without those imported edits. Is there another reason I'm missing?

If the initial author in the surviving Wikipedia database uses an edit summary like "Fixed link", that would become misleading if earlier edits were imported with gaps. It would also make the history more confusing IMO ... people would be like "Wuh? Where did all *that* content come from"

That's just the principle *I* use though ... I'd be OK if consensus was against me.

Milimetric triaged this task as Low priority.May 8 2017, 2:34 PM
mforns moved this task from Dashiki to Backlog (Later) on the Analytics board.Aug 10 2017, 4:19 PM
He7d3r added a subscriber: He7d3r.Aug 16 2017, 6:46 PM
fdans added a subscriber: fdans.Oct 9 2017, 4:14 PM

We'd like to have out editing APIs in place before we import this data. Aiming to have this in Q3.

Nuria moved this task from Wikistats Production to Dashiki on the Analytics board.Jan 18 2018, 6:37 PM
Milimetric moved this task from Dashiki to Incoming on the Analytics board.Apr 2 2018, 3:32 PM
Nuria moved this task from Incoming to Data Quality on the Analytics board.Apr 5 2018, 5:15 PM

I only found out by accident, but appears XML versions of these dumps were put up on dumps.wikimedia.org last October ...
https://dumps.wikimedia.org/archive/2001-xml/

I'm planning to do my own imports from them. But it'd be nice if something like Nostalgia Wikipedia could be set up for them ... hopefully with usernames intact.

I would be planning to work with them, but I get the following error using importDump.php to import them under MW1.25 (old, I know, but I don't think an update would fix this):
A database query error has occurred.
Query: INSERT INTO logging (log_id,log_type,log_action,log_timestamp,log_user,log_namespace,log_title,log_comment,log_params) VALUES (NULL,'move','move','20010322014545',NULL,'0','PythagoreanTheorem','','Pythagorean_Theorem\n1')
Function: WikiRevision::importLogItem
Error: 1048 Column 'log_user' cannot be null (localhost)

Oh I see now from the source code for importUseModWikipedia.php: the account UseModWiki admin needs to exist first.

The import script seems to halt on the title "Vector space]" or somewhere around there, using the filtered XML dump. So it's not working quite yet.

After replacing all instances where the title was "Vector_Space]" with "Vector_Space1", the XML file imported perfectly here!

I've imported a few pages, including the page n admins an the one on Atlas Shrugged.

I've also created accounts and user pages on enwiki for Page move link fixup script and UseModWiki admin.

However the dump doesn't contain all the edits to C. Northcote Parkinson ... rc.log contains four here while the dump only contains two. Hmmm ...

A more serious omission from the dump is edits to "The Most Remarkable Formula In The World" ... the dump only has history upto the end of March 2001 but the Nostalgia Wikipedia has many other edits from July.

mforns added a subscriber: mforns.Apr 22 2019, 4:03 PM

Can we do this in a hackathon?

mforns raised the priority of this task from Low to Needs Triage.Apr 22 2019, 4:03 PM
mforns triaged this task as Low priority.
Retro added a subscriber: Retro.Jul 19 2019, 6:12 PM

I just created a Wikipedia page on this topic (https://en.wikipedia.org/wiki/Wikipedia:Starling_archive_imports) and was just about to bring it somewhere when I stumbled upon this task.

From what I've seen, a good number of these pages have uncomplicated histories with single-digit numbers of missing revisions, so they can be imported by hand (by directly modifying the XML file with the new revisions and re-importing it if I'm not mistaken). On the other hand, there are some pages such as HomePage, with 418 missing revisions, that should most probably be imported with some sort of script or bot. (I know I'm saying that like it's something easy.)

I just created a Wikipedia page on this topic (https://en.wikipedia.org/wiki/Wikipedia:Starling_archive_imports) and was just about to bring it somewhere when I stumbled upon this task.
From what I've seen, a good number of these pages have uncomplicated histories with single-digit numbers of missing revisions, so they can be imported by hand (by directly modifying the XML file with the new revisions and re-importing it if I'm not mistaken). On the other hand, there are some pages such as HomePage, with 418 missing revisions, that should most probably be imported with some sort of script or bot. (I know I'm saying that like it's something easy.)

Nope, the easiest way to do the imports is to import the XML dump into a separate MediaWiki database and use Special:Export from there to create the XML files, which can then be edited by hand, if necessary. That way there's no practical limit on the number of revisions that can be imported at a time. I've worked on a few pages this way, slowly. A few points though:
*I seem to be in the minority with this, but I really would prefer it if we only imported revisions where existing history (either on the English or Nostalgia Wikipedias) already went back to the 17th of August or earlier. I see gaps in history as potentially very confusing.
*We need to make sure usernames line up correctly on the English and August 2001 Wikipedias, including username changes (either by choice or through software changes like the "~enwiki" prefix). Local usernames should be preserved in all cases.
*We need to only import exactly the revisions we need, per T175357 (which still applies here).
*All an article's revisions should ideally be in one place ... a lot of cut-and-paste moves were made back in 2001!

I just created a Wikipedia page on this topic (https://en.wikipedia.org/wiki/Wikipedia:Starling_archive_imports) and was just about to bring it somewhere when I stumbled upon this task.
From what I've seen, a good number of these pages have uncomplicated histories with single-digit numbers of missing revisions, so they can be imported by hand (by directly modifying the XML file with the new revisions and re-importing it if I'm not mistaken). On the other hand, there are some pages such as HomePage, with 418 missing revisions, that should most probably be imported with some sort of script or bot. (I know I'm saying that like it's something easy.)

Nope, the easiest way to do the imports is to import the XML dump into a separate MediaWiki database and use Special:Export from there to create the XML files, which can then be edited by hand, if necessary. That way there's no practical limit on the number of revisions that can be imported at a time. I've worked on a few pages this way, slowly. A few points though:
*I seem to be in the minority with this, but I really would prefer it if we only imported revisions where existing history (either on the English or Nostalgia Wikipedias) already went back to the 17th of August or earlier. I see gaps in history as potentially very confusing.
*We need to make sure usernames line up correctly on the English and August 2001 Wikipedias, including username changes (either by choice or through software changes like the "~enwiki" prefix). Local usernames should be preserved in all cases.
*We need to only import exactly the revisions we need, per T175357 (which still applies here).
*All an article's revisions should ideally be in one place ... a lot of cut-and-paste moves were made back in 2001!

That is fair enough, thank you for your input. Regarding your other points:
*While I display some reluctance in that matter insofar as it applies to HomePage (which contains Wikipedia's first ever edit), I'm overall neutral as to whether holding off of creating such gaps is beneficial. Perhaps I can make a list of such pages if it would be of use.
*I've noticed that before, with such usernames as StasK, etc. Perhaps I can make a list of such usernames, if such a list would be useful and doesn't already exist.
*That's true, we don't want to accidentally import revisions that survive and/or are on Nostalgia.
*That's true.

Also, I'd like to note that subpages should be considered independently of their parent pages, given that mainspace subpages are now (thankfully) disabled.

*While I display some reluctance in that matter insofar as it applies to HomePage (which contains Wikipedia's first ever edit), I'm overall neutral as to whether holding off of creating such gaps is beneficial. Perhaps I can make a list of such pages if it would be of use.

I've just done the Homepage import. I've got my own list of pages to import; if you want to create your own, go ahead.

*I've noticed that before, with such usernames as StasK, etc. Perhaps I can make a list of such usernames, if such a list would be useful and doesn't already exist.

I don't think it already exists. My only problem with such a list would be that it should be restricted to people who haven't changed their username to hide their real name.

Also, I'd like to note that subpages should be considered independently of their parent pages, given that mainspace subpages are now (thankfully) disabled.

Yes, and some titles at subpages got moved to non-subpage names.