Page MenuHomePhabricator

Automated archiving of URLs
Closed, DuplicatePublic

Description

Various mechanisms are in place on different wikis to address link rot, e.g. parameters like "archiveurl=" and "archivedate= " in templates like cite_web and cite_news on enwp that allow to link to archived versions of the cited URL, or https://ru.wikipedia.org/wiki/%D0%A3%D1%87%D0%B0%D1%81%D1%82%D0%BD%D0%B8%D0%BA:WebCite_Archiver , which creates and adds archival links automatically.

It would be good if Citoid would provide similar functionality, so as to address link rot, and in a way that is more consistent across wikis.

Event Timeline

Daniel_Mietchen raised the priority of this task from to Needs Triage.
Daniel_Mietchen updated the task description. (Show Details)
Daniel_Mietchen added a project: Citoid.
Daniel_Mietchen moved this task to IO Tasks on the Citoid board.
Daniel_Mietchen subscribed.
Mvolz triaged this task as Medium priority.Feb 16 2015, 3:30 PM
Mvolz set Security to None.

I like archive.today better, in terms of predictability of URLs (might be faster since we don't have to wait for the URL to be generated to link to it- archive.org uses the timestamp to a too precise degree to be predictable, probably) and also that it ignores robots.txt (as it is allowed to do because the archiving is being done as a the result of a direct request from a user, and not a crawler,) but on the downside it's younger and privately funded. Archive.org is more established and a non-profit, and probably more reliable.

Thoughts? @mobrovac? @Jdforrester-WMF, is this something we should consult legal for?

I recommend checking out https://perma.cc/ as well which is specifically designed for this purpose and backed by DPLA, Internet Archive and others. @Yana

If we use something like perma.cc or archive.org for citations on wikis, we probably need some sort of clear warning that the actual site is no longer available so that contributors can verify that the site was not removed precisely because the statement it is cited for on Wikipedia is no longer true.

@Yana, I'm not sure exactly what you're requesting? We would leave the original url in, and this would allow users to verify the current existence or non-existence of the original url themselves. Doing this would not change the user's experience of the site in terms of being able to verify why a link is no longer available; we are just doing something automatically that is typically done manually or by a bot, see: https://en.wikipedia.org/wiki/Wikipedia:Link_rot

Note that archive.org has reached out to us before about this and would be happy to be an active partner in this. I'd be happy to set up the meeting there.

Note to everyone that I am 70% done in development of an archive bot for enwiki which makes use of archive.org aka the wayback machine. Current features of the bot include, testing to see if the link is dead, requesting wayback to archive pages that have no archived copy yet, and of course linking a source to an archive when the live link goes dead. A BRFA is currently open.

@Cyberpower678 awesome! CC-ing @Ocaasi who will be happy to hear that!

Note to everyone that I am 70% done in development of an archive bot for enwiki which makes use of archive.org aka the wayback machine. Current features of the bot include, testing to see if the link is dead, requesting wayback to archive pages that have no archived copy yet, and of course linking a source to an archive when the live link goes dead. A BRFA is currently open.

For posterity, the bot is up at https://tools.wmflabs.org/iabot, where there is also an option to add archive parameters to a specific page's citations before they have a chance to die.

Hey czar! Is there a place on https://www.mediawiki.org/wiki/Citoid/Enabling_Citoid_on_your_wiki (or elsewhere) where you could document something about the bot maybe, especially if you welcome/need community input? Starting from the blurb it already features would be ok, I think.

@Cyberpower678 Is there a good place to document iabot on mediawiki.org? I'm not familiar with where the iabot repo is stored or whether it's capable of running on other mediawiki installations

I meant documenting it in relation to Citoid and citing needs on related pages - thanks for the links in the meantime.

I'm not as familiar with Citoid as you guys are. But my API can provide some of the fastest responses for the URLs IABot has encountered. If tools can be integrated with the API, it could create a network of clients that collectively work to improve IABot's DB.

If anybody can provide details around the relationship between this tool and Citoid, they are welcome to do so at the page linked above. TY.

I can, if there is a place on Meta perhaps, but there is no inherent connection between IAbot and Citoid to generalize on MediaWiki (and there isn't a good place on enwp's Citoid page either)