Page MenuHomePhabricator

Investigation: Automatic archive for new external links
Closed, ResolvedPublic5 Estimated Story Points

Description

Wish #8 on the 2016 Community Wishlist Survey is "Automatic archive for new external links" -- to automatically keep a snapshot of an external reference that's cited in Wikipedia articles, and then add a link to the archived page in the reference template.

https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Bots_and_gadgets#CW2016-R008

Some of the requested functionality is already happening: Internet Archive is already archiving all new external links added to Wikipedia. Cyberbot II looks at all of the existing links, and replaces dead links with a link to the archive.

The elements of the proposal that aren't being done: automatically generating an archive link of the specific access date when the link was added, and putting that archive link in the reference template before the link goes dead.

Mark G at Internet Archive is interested in this, and wants to talk about it.

This investigation is to determine whether these elements of the proposal are worth doing, and what we can do to make them happen.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
kaldari set the point value for this task to 5.Oct 17 2017, 11:10 PM
kaldari triaged this task as Medium priority.Oct 17 2017, 11:30 PM

Proposal analysis

I did an analysis of the proposal and the discussion that followed to understand better what exactly was being asked for. Here's the summary of the discussion:

Problem: Web pages disappear and we are left with broken links. Adding a permanent link is more work for editors.
Proposed solution:

  1. Automatically add a link to the corresponding page in Internet Archive if an url and an access-date is provided in a cite.
  2. Automatically add access-date to cites, if they are not provided, when an edition is saved.
  3. Automatically request archival of web pages in Internet Archive if they are not available there. - This is already achieved by the monitoring service that IA maintains which archives all new outgoing links from Wikipedia.

The proposer states:

I have no preference as to whether it should be done in real time or by a bot, but it would be nice if it is done either way. My main points are to reduce the workload of editors that provide references, and to ensure that references can be accessed permanently, regardless of what the editor does. I do not think that I should prescribe the specific implementation, I am sure that people knowledgeable about bots and wikimedia processes would know what is the best or easiest way to accomplish this.

and some questions they posed when they were told about the existence of IA bot with Cyberpower's responses:

Q1. Why do we wait for the link to die before offering a link to IA?
A1. That decision is up to the community, this option can be controlled via an on wiki configuration page IABot uses to determine how to run on the wiki.

Q2. Wouldn't it be better to add the link to IA as soon as it is available, so it points to the version of the web page that was used as a reference?
A2. This can be done regardless of whether it was added immediately or later.

Q3. Web pages do change with time, and the information that is being used as a reference could disappear in later versions. Should we consider automatically adding access dates if they are not included in the cite?
A3. IABot does this for cite templates missing access dates, or when converting to access dates. It can extrapolate the access date, if it's a plain link.

Q4. Are we making sure that the version linked to in IA is the one corresponding to the access date?
A4. This again can be controlled via the config page. On enwiki, it's set to get a snapshot as close to the access-date as possible, or whatever snapshot is already saved in IABot's DB of URLs.

Based on Q3/A3, I think we can pretty much rule out the #2 in the proposed solution. If the bot is able to extract that info and add it when it parses the page, it does not seem worth the effort to do it separately.

Proposed solution

There are two ways in which we could solve this. They are not mutually exclusive, however, #1 is self-sufficient while #2 is not.

  1. Let IA bot work as it currently does with one additional change - that it adds archive-urls for both dead as well as live links.
    • Pros:
      • Easy to implement - community consensus and a config change to IA bot.
      • Sufficiently takes care of the proposal request
    • Cons:
      • The bot is slow and it will take a long time to cover all the reference links on a big wiki.
      • The bot is not running on all wikis currently.
      • People would still potentially have to see dead references because it can be a while in between the url going dead and the bot restoring the archive-url.
  1. Find a way to add links to archive-url as soon as the external link is added. One possible way to implement this would be to hook it into the service that monitors new outgoing links from Wikipedia on the InternetArchive end and trigger it to use a bot account to add the archive-url as soon as the link gets archived,
    • Pros:
      • This would be an ideal case scenario where users would never have to look at dead reference urls at all, once they all have archive-urls.
      • This would reduce the dependency of the project on the IA Bot as it currently runs. Looking at it technically, it seems like a benefit if we reduce the dependence on that bot. It is slow and consumes huge amounts of database storage. Also, parsing the same pages over and over seems like a waste.
      • Doing this would also solve for the ask about adding access-dates.
    • Cons:
      • This would take some development time and effort. Possibly from the folks at Internet Archive who wrote the monitoring service or Cyberpower or us. It could potentially use IABot's account for making the edits but it will definitely take some work.
      • Not a self-sufficient method. The bot will have to add archive-urls for all the already existing urls on the wiki but once that is done, the bot will be redundant.
Other things that were brought up in the discussion
  1. Using archive.is alongside IA.
    • Archive.org URLs sometimes stop working, such as when a domain changes hands and the new owner puts up a restrictive robots.txt file. Archive.is archives aren't subject to this. Perhaps it's a better choice or both should be used.
    • There *are* problems with archive.org getting information deleted, partly because they are based in the USA and thus subject to DMCA takedown requests and other types of governmental interference (they are working on a Canadian disaster-recovery site but that is NOT anywhere near operational). There are also technological limitations with respect to certain filetypes, or certain pagetypes. Archive.is handles the interactive javascript-heavy pages the best, and has already attempted to help with wikipedia a few years back, but were turned down for sociological reasons (worry that they would disappear being one of the main ones if memory serves). Webcitation.org has some advantages, in a few corner cases.
  2. Making sure that the IA url being added is not a 404 itself. (see above bullet)
  3. Making sure the bot works on all language wikipedias and Wikidata and Commons.

In the above list, #2 is possibly the most low-hanging fruit. This would need some changes to IA Bot as it runs currently.
#1 is not very feasible here because it does not already automatically archive all of the outgoing links from Wikipedia as IA does. Maybe we can add an additional link to archive.is if the link from IA is dead. In that case it's related to #2.
#3 is in progress. Wikidata and Commons will come with their own set of challenges since the bot so far is only designed to work on Wikipedias. I'm not sure what changes will be needed but this is a work in progress, so we can leave it be and come back later to look at it when the bot is running on all (most) Wikipedias.

@Niharika: Thanks for the thorough analysis! It seems the best way forward is:

  • Start a discussion on English Wikipedia about enabling your proposed solution #1 (allowing IABot to add archive URLs for non-dead links).
  • If solution #1 is adopted, talk to the Internet Archive about adding proposed solution #2 (getting the Internet Archive to trigger IABot whenever a new external link URL is added to the Internet Archive).

As far as the other things brought up in the discussion, they all involve doing additional development of IABot, so I'll defer to @Cyberpower678 on those.

Note that access-date must not be added by a bot, quoting https://en.wikipedia.org/wiki/Template:Cite_web :

Full date when the content pointed to by url was last verified to support the text in the article

Which means that this is an editorial decision.