Page MenuHomePhabricator

Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages
Open, LowestPublic

Description

I propose to add an option to have the Internet Archiver (and/or other robots) retrieve raw wikitext. This way, if a wiki goes down, it will be possible to more easily create a successor wiki by gathering that data from the Internet Archive. As it is now, all that can be obtained are the parsed pages. That's okay for a static archive, but one might want to revive the wiki for further editing. Also, Archive should be opened

I would say that someone should write a script to convert the parsed pages retrieved from the Internet Archive back into wikitext, but that will run into problems with templates and such, unless it's designed to identify them and recreate them. It would be a much easier and cleaner solution to just make the wikitext available from the get-go.


Version: 1.23.0
Severity: enhancement

Details

Reference
bz62468

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 2:56 AM
bzimport set Reference to bz62468.
bzimport added a subscriber: Unknown Object (MLST).

I was going to say, there should also be an option to let the Archiver access Special:AllPages or a variant of it, so that all the pages can be easily browsed; currently it seems like, when browsing archived pages, it's often necessary to find the page one is looking for by going from link to link, category to category, etc.

Theoretically, you could put something in your robots.txt allowing the Internet Archiver to index the edit pages: https://www.mediawiki.org/wiki/Robots.txt#Allow_indexing_of_edit_pages_by_the_Internet_Archiver

I'm not sure how well the particular implementation suggested there works, though; from what I can tell, it doesn't. Also, most archived wiki pages I've seen haven't had an "Edit" link.

It make little sense to "archive" wikitext via action=edit, there is action=raw for that. But the IA crawler won't follow action=raw links (there are none) and as you say there is no indication that fetching action=edit would work.
I propose two things:

  1. install heritrix and check if it can fetch action=edit: if not file a bug and see what they say, if yes ask the IA folks on the "FAQ" forum and see what they say;
  2. just download such data yourself and upload it to archive.org: you only need wget --warc and then upload in your favorite way.

I suspect most MediaWiki installations probably have robots.txt set up, as recommended at [[mw:Manual:Robots.txt#With_short_URLs]], with

User-agent: *
Disallow: /w/

See for example:

*https://web.archive.org/web/20140310075905/http://en.wikipedia.org/w/index.php?title=Main_Page&action=edit

So, they couldn't retrieve action=raw even if they wanted to. In fact, if I were to set up a script to download it, might I not be in violation of robots.txt, which would make my script an ill-behaving bot? I'm not sure my moral fiber can handle an ethical breach of that magnitude. However, some sites do allow indexing of their edit and raw pages, e.g.

https://web.archive.org/web/20131204083339/https://encyclopediadramatica.es/index.php?title=Wikipedia&action=edit
https://web.archive.org/web/20131204083339/https://encyclopediadramatica.es/index.php?title=Wikipedia&action=raw
https://web.archive.org/web/20131012144928/http://rationalwiki.org/w/index.php?title=Wikipedia&action=edit
https://web.archive.org/web/20131012144928/http://rationalwiki.org/w/index.php?title=Wikipedia&action=raw

Dramatica and RationalWiki use all kinds of secret sauces, though, so who knows what's going on there. Normally, edit pages have a <meta name="robots" content="noindex,nofollow" /> but that's not the case with Dramatica or RationalWiki edit pages. Is there some config setting or extension that changes the robot policy on edit pages? Also, I wonder if they had to tell the Internet Archive to archive those pages, or if the Internet Archive just did it on its own initiative.

IA doesn't crawl on request.
On what "Allow" directives and other directives do or should take precedence, please see (and reply) on https://archive.org/post/1004436/googles-robotstxt-rules-interpreted-too-strictly-by-wayback-machine

(In reply to Nemo from comment #5)

IA doesn't crawl on request.
On what "Allow" directives and other directives do or should take
precedence, please see (and reply) on
https://archive.org/post/1004436/googles-robotstxt-rules-interpreted-too-
strictly-by-wayback-machine

I might reply to that, as more information becomes available. Today, I set my site's robots.txt to say:

User-agent: *
Disallow: /w/

User-agent: ia_archiver
Allow: /*&action=raw

So, I guess a few months from now, I'll see whether the archive of my wiki for 12 March 2014 and thereafter has the raw pages. If not, that's a bug, I think.

(In reply to Nathan Larson from comment #6)

So, I guess a few months from now, I'll see whether the archive of my wiki
for 12 March 2014 and thereafter has the raw pages. If not, that's a bug, I
think.

Did you include dofollow links to action=raw URLs in your skin?

(In reply to Nemo from comment #7)

Did you include dofollow links to action=raw URLs in your skin?

I put in MediaWiki:Sidebar **{{fullurl:{{FULLPAGENAMEE}}|action=raw}}|View raw wikitext

As a backup, I also added a sidebar link to Special:WikiWikitext (per instructions at [[mw:Extension:ViewWikitext]]) just to be sure. Of course, most people won't want to have that on their sidebar. I started a page on this at [[mw:Manual:Internet Archive]].