Page MenuHomePhabricator

Investigation: Migrate dead external links to archives
Closed, ResolvedPublic8 Estimated Story Points

Description

See Fixing dead links on Meta. Still some questions to answer.

  • Existing contacts between the WMF and Internet Archive that we could pursue?

Alex (@Sadads) and others have been working with Internet Archive; they're very interested in helping.

Alex says: " We have made significant progress with en:User:Cyberbot II adding links to archiveurls, but there needs to be a good technical way to store. Talked with Jdforrester (WMF) about building it into citoid at WikiCon USA. Internet archive was there, and expressed an interest in pushing their API's to the limit, to fix the 404 and other errors on Wikipedia."

Also investigate how an ecosystem of link-fixing could work--multiple approaches.

Related:
T89438: Automated archiving of URLs
T115224: On URL submission, look up the archived page in the Internet Archive's index and add to the return data

Event Timeline

DannyH raised the priority of this task from to Needs Triage.
DannyH updated the task description. (Show Details)
DannyH added a project: Community-Tech.
DannyH moved this task to Needs Discussion on the Community-Tech board.
DannyH subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
DannyH edited a custom field.
DannyH edited projects, added Community-Tech-Sprint; removed Community-Tech.
kaldari renamed this task from Investigation: Migrate dead links to Wayback Machine to Investigation: Migrate dead links to Wayback Machine (or other archiving service).Dec 8 2015, 6:38 PM
kaldari updated the task description. (Show Details)
DannyH triaged this task as Medium priority.Dec 10 2015, 6:25 PM
DannyH renamed this task from Investigation: Migrate dead links to Wayback Machine (or other archiving service) to Investigation: Migrate dead links to Wayback Machine.Dec 16 2015, 1:28 AM
kaldari moved this task from In Development to Ready on the Community-Tech-Sprint board.

Adding seth, he wrote a bot for dewp, who can submit pages to wayback.

The method I use with CamelBot is quite simple:

Given $from_id, every 60 seconds CamelBot runs

SELECT `el_id`, `el_from`, `el_to`, `p`.`page_title` AS `page`
FROM `externallinks`
LEFT JOIN `page` AS `p`
	ON `p`.`page_id`=`externallinks`.`el_from`
WHERE `el_id` >= $from_id
ORDER BY `el_id`
LIMIT 1000

and updates $from_id at toolserver to collect all new external urls.
These urls are filtered in perl to exclude links to wiki[mp]edia, wikibooks, wikidata, a.s.o., and to reduce redundancy. If there remain some new urls, they will be given to https://web.archive.org/save/...
additionally/alternatively www.webcitation.org/archive?email=$email_address&source=$wp_page_url&url=$encoded_url could be used.

I think the main issue is getting links to the archived copies back into Wikimedia projects - the Internet Archive themselves already scrape our pages for new external links (c.f. https://www.mediawiki.org/wiki/Archived_Pages#IA_is_crawling_Wikipedia_outlinks)

Yes: the blocker is definitely the bringing them back into WikiSpace,
Internet archive archives by default everything that persistantly leaves an
external link.

@Fhocutt, I got info from @Trizek-WMF on how the French WP link archiving works.

The user who's running it is User:Kelson -- https://fr.wikipedia.org/wiki/Utilisateur:Kelson

He's running a service called Kiwix, which is creating offline versions of Wikipedia. Kiwix archives some external links to archive.wikiwix.org. You can see examples in the references here, for the links marked [archive]:

https://fr.wikipedia.org/wiki/Mod%C3%A9lisme_ferroviaire#R.C3.A9f.C3.A9rences

Here's some more information on Kiwix, from the Wikimedia blog:

http://blog.wikimedia.org/2014/09/12/emmanuel-engelhart-inventor-of-kiwix/

@Fhocutt, there may also be a bot on German WP to look at.

From this discussion:
https://meta.wikimedia.org/wiki/Talk:2015_Community_Wishlist_Survey/Status_report_1

User:° said: "While the dead link bot is a good thing for small wikis, where the community does not act on dead links, in the case of german wikipedia there is a bot (GiftBot) active, that does something different: Dead links are marked on the talk page and an archive link is proposed, so that a human editor can decide, what to do (use a new link, delete the link, use the provided archive link, use an archive link, but a different archived version)."

Here's the link for GiftBot on de.wp: https://de.wikipedia.org/wiki/Benutzer:GiftBot

I used Google Translate on that page -- if I'm reading it right, it looks like the archive linking is one feature among many that GiftBot does.

This looks like the relevant phrase on that page:

  1. dwl*.{sh,tcl}: Finden und Melden von defekten Weblinks (Find and report faulty links)

So it's worth checking out what they're doing too.

Some thoughts on @Legoktm's Lua module proposal (https://www.mediawiki.org/wiki/User:Legoktm/archive.txt). This is a similar idea to T115224, but instead of looking up and adding the archive url at the time of citation creation, it would be looked up and added for all citations at the time the article is saved (during template parsing). This might create a heavy amount of traffic to the Internet Archive's API, so Lego suggests creating a local cache or mirror of the Internet Archive's archive data. Personally, I think T115224 is probably a better route to pursue (as it won't require a local mirror), although it's downside is that it only fixes new citations, not existing ones. The other advantage of the citoid solution is that the editor could theoretically check the archive URL before the page is saved. (Not all IA archives are actually valid.) I also worry about page save times being affected by the URL look-ups, although this would also be mitigated by having a local mirror. Another down-side of the Lua module approach is that after a few years (when almost all articles have been saved at least once with the new Lua module in place), it's functionality will be completely redundant to the citoid functionality.

In summary, I think T115224 is probably the best solution for new citations and a bot or gadget is probably the best solution for existing citations. Other opinions are welcome.

DannyH renamed this task from Investigation: Migrate dead links to Wayback Machine to Investigation: Migrate dead external links to archives.Feb 11 2016, 12:18 AM
DannyH updated the task description. (Show Details)
kaldari moved this task from In Development to Ready on the Community-Tech-Sprint board.

I dug up some useful info about the French tool. It's basically a combination of http://archive.wikiwix.com/, which archives all the external links in the French Wikipedia (not sure by what mechanism), and a default gadget called ArchiveLinks. All the default gadget does is add an archive link to all the citations, and by that, I mean it just takes the existing URL and sticks http://archive.wikiwix.com/cache/?url= at the front of it. It doesn't check to see if the archive actually exists or works. For example, here's an archive that is broken (from an actual citation). It's also important to note that it doesn't actually change the Wikitext of the article. The citation links are inserted after each citation dynamically on the client-side when you load the article. If you turn off the gadget in your preferences, all the archive links disappear.

The code for the gadget can be seen at https://fr.wikipedia.org/wiki/MediaWiki:Gadget-ArchiveLinks.js.

Oh, now that makes sense. I couldn't figure out why I couldn't find the edit where the Wikiwix bot inserted the archive links. That's kind of disappointing.

We've collected information on this page: https://meta.wikimedia.org/wiki/Fixing_dead_links

There are currently bots running on English, German, Italian and Spanish that link to Internet Archive. On French, there's a default gadget -- Liens archives (ArchiveLinks) which link to archived pages at Wikiwix (archive.wikiwix.com). More info on the Meta page.

I'm closing this investigation ticket. There's more work going on in various tickets connected to T120433.