Page MenuHomePhabricator

[3.7] Instant caching of linked resources via Internet Archive
Closed, ResolvedPublic

Description

Output
We will deepen a partnership with the Internet Archive to facilitate the immediate and widespread caching of resources linked from Wikimedia projects, and to prioritize external efforts to digitize sources cited in Wikimedia projects.

Target
Compared to last year, archive links from all Wikipedia languages.

Reduce the time between link creation and link archiving to less than 1 minute.

Increase by 50% the volume of links recovered in Wikipedia outside of the English language edition.

Dependencies
T199189

Primary team: Community Programs

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
DarTar triaged this task as Medium priority.Jul 12 2018, 3:30 PM

I'm happy that the WMF has decided to do this, since it will have benefits beyond the Wikimedia community. However, given that the Internet Archive does have its limitations, I think there are issues with only having the targets mentioned in the task description. (One minute is probably unnecessary, given that the IA interface itself merges "snapshots" where a page was archived more than twice within an hour.) My post in T199701 (edited and expanded slightly) is below.


Some websites, like YouTube, OpenStreetMap and Google Maps, can't be archived using the Internet Archive because of their use of JavaScript or other technology. Others, like large parts of the Hong Kong government's website, unintentionally, passively or intentionally refuse connections from the Internet Archive's Save Page Now tool; others can't be archived because of their robots.txt; and still others, like the New York Times, deliberately show an error message even if the content has been loaded to prevent their paywall from being circumvented.

Increasing the frequency of the Internet Archive's IRC bot (T199193) is not as helpful if it's still unable to archive a lot of pages. It would be really, really nice if the process of archiving links (regardless of the entity running it) could have fallback methods for archive attempts that don't work. (I know the Internet Archive does have a YouTube collection, but it's fairly difficult to actually get content into it,* it seems to be only for copyleft/public domain videos, and I don't think the bot is capable of adding to it.)

w:en:Help:Archiving a source lists five web archiving services, including the Internet Archive. All five of them have drawbacks which prevent them from being used to archive everything.

  • Save Page Now of IA's Wayback Machine is great, but has the aforementioned drawbacks which limit it to a minor but significant degree (YouTube and Facebook are two of the four websites more popular than Wikipedia, but neither can be archived through Save Page Now).
  • Archive.is has a limit of about 30 pages per minute, and actively tries to prevent mass archival. I know this because they've probably blocked me because I tried to archive too many pages (don't worry about it; it only happened after tens of thousands of pages). However, it has several fairly good fallback mechanisms to archive pages, including the Google cache and some other websites. It's also permanently logged in to a Facebook account, and is the only one of the five to have a .onion address on the Tor network.
  • Webcitation.org is fairly bad at archiving CSS, in my experience, and also fails on some Hong Kong government websites. It doesn't really seem to have any benefits over web.archive.org.
  • Perma.cc is only used in 62 English Wikipedia articles as of this writing. I don't know much about it but it has a limit of 10 pages per month per account. Perhaps @Green_Cardamom might know more.
  • Webrecorder is only used in 16 articles as of this writing. It's a nonprofit which was created this year, and as far as I'm aware is the only archiving service of the five which can save YouTube videos* (try loading the comments – you'll get to exactly where I stopped scrolling), OpenStreetMap's "slippy map",** and other interactive/dynamic content. It also has virtual machines, presumably for viewing outdated HTML. However, it would be more difficult to fix dead links with it because all archive links contain the name of the account which was used to create them (i.e. URLs can't automatically redirect like they do at web.archive.org). It would also probably be fairly difficult to crawl sites without support from the site owners.

Having a ninety-something percent archival success rate is still not really ideal if it's out of millions of links per year, many of which will be gone within a few months of their addition to articles. The main reason I'm posting this here is that there is currently a Village pump (idea lab) thread on the English Wikipedia proposing that the WMF runs their own archiving service. While I personally think this is somewhat unnecessary and would be an inappropriate use of WMF resources, especially when there are other non-profits which are already dedicated to web archiving, the current options are not satisfactory for archiving everything that should be archived.

(* – the YouTube archival on Webrecorder's website doesn't totally work because YouTube's links generate some sort of error and not all of the video quality settings can be chosen; for YouTube specifically IA has custom software that displays separately-downloaded YouTube videos within captures, but usually videos are not saved along with the rest of the page, and many videos are not saved at all because it is much more difficult to download and upload a YouTube video than it is to archive a normal static page)
(** – unfortunately OSM resets the link to the default location because I didn't set a marker; after the page loads, paste that link into the address bar again and London should show up)

archive.orgarchive.iswebrecorder.iowebcitation.orgperma.cc
YouTubesort ofnoyesnono
Facebookyesyesyesnono
mapsnonot reallyyesnono
avoiding some user agent blocksyesyesyesnono
avoiding soft paywallsnoyesyesnono
avoiding some loginsyesnoyesnono
API of some sortyesyes, for WPnonot reallyno
usage limitsN/A?50 GB/acct?10 pg/mo/acct
ownerIA???other 501(c)this personother 501(c)
  • YouTube is important for both culture-related Wikipedia articles and for writing about more recent events.
  • Facebook is useful for some primary sources, although it's not really necessary.
  • Google Maps is used in more than 60,000 English Wikipedia articles, although this is not necessarily a good thing.
  • I recently had to fix some dead links which were "killed" because the Hong Kong government's website structure was modified. Save Page Now does not work on the government's website, but the Internet Archive's internal crawlers seem to be able to archive pages. The various departments also might not archive their own information, and in general, small websites like those made by contractors for infrastructure projects generally go dark soon after the projects are finished and are very rarely found by clicking links on other websites (and thus are unlikely to be archived). That they are linked to from the government's website might make them more likely to be archived if the Wayback Machine could find the links but I don't know how IA's internal crawlers work.
  • While it's not necessarily a good thing to get past paywalls, logins and other software limits, if it's not in some other database it may be difficult to recover it in the future, and it's always useful to have some sort of verification for information (even if the only things that will ever look at it again are some future AI systems).
  • Obviously, it would be difficult and a waste of time and resources to have actual humans manually archive every link that has a robots.txt on it.

Update, 20 April 2019: Edited table to reflect capabilities of beta Save Page Now.

Webrecorder.io uses Open Source technology https://github.com/webrecorder/pywb . The webrecorder.io website is sort of a showcase for the technology and for small amounts of archiving. It would be possible for this technology to be used in-house at Wikipedia. The British Library archive is converting, and I imagine others are looking as well. It can do things no one else can that modern websites require.

I think the things the website can't handle right now are mostly things like newer video codecs (which would explain the YouTube limitations), DRM (which we probably don't need to care about), and the thing where the YouTube JavaScript changes the URL in the address bar. I haven't tested the software that's on GitHub.

Another thing that would be nice would be the ability to regularly archive some URL or some URL format, like "archive [these charts for every country available from two days ago] on spotifycharts.com, every day" or "archive [these two rotating lists of the top fifty thousand websites] on alexa.com every day between 1800 UTC and 1600 UTC the day after". (I have spent a lot of time trying to do both of these and some more on my own computer, and I was recently asked to stop doing it to prevent my computer from dying young by someone who knows better than I.) A lot of data sets, in bulk, could be useful for Wikidata at some point, even if that time isn't now; but it's no good waiting until it's out of copyright (unless the website allows it, which is quite unlikely) because Alexa wipes their data every day and Spotify doesn't make public charts from more than about two years ago.

Pyweb on GitHub is the software package behind webrecorder.io -- like MediaWiki is the software behind Wikipedia.

(The meaning was I hadn't downloaded the software and tested it myself; the website might have some unrelated issues that caused those problems.)

Js - small correction, archive.is does have an API - not sure is meant by unlimited archival but they will archive everything on Wikipedia which seems unlimited for our purposes. I asked them about mass archival and they said no problem but to setup a feed of links to them so they can better manage resources.

By "unlimited" I meant "the Internet Archive hasn't tried to stop me even though I've bot-archived a few million URLs".

Oh, I'd also note that Wikimedia Deutschland has in some capacity apparently already worked with Rhizome, the organization which runs Webrecorder, but for unrelated reasons:

The gathering [the WikibaseNYC conference] was made possible through a generous grant by the Sloan Foundation. It was hosted and sponsored by one of the early adopters of this technology, Rhizome (documented in our Many Faces of Wikibase blog series), and co-organized with Wikimedia Germany (Deutschland).

A few months ago, the Internet Archive released a "beta" Save Page Now, which has a larger feature set (including archiving all links on a page) and is capable of archiving content that the original Save Page Now either can't archive (e.g. Facebook, gov.hk) or refuses to archive (e.g. Snopes, although content can't be viewed). I've updated the table to reflect this.

The beta Save Page Now does require JavaScript, but presumably a specialized caching feed would use the internal IA archiving software.

A clarificatory note: The "API" table row is somewhat misleading, since it doesn't really refer to actual APIs, but it essentially tries to reflect whether it would be possible to script archival without the website operators' explicit permission. Of course, for archival projects with Wikimedia backing, this would probably be a non-issue.

  • The Wayback Machine allows for fairly easy scripting, since the original Save Page Now is served with HTML (with the URL format https://web.archive.org/save/https://example.com). Both list-based and recursive archival can be performed using even the command line (I've archived a large amount of content with one-liners and lists of URLs). The new Save Page Now would be more difficult to script, since both POST requests and browser JavaScript support would be required.
  • archive.is/archive.today can prevent scripted archival by leaving requests stuck at the /submit page, but it is possible to script both list-based and recursive archival, although this is still limited compared to IA/Wayback.
  • Webrecorder's software is open-source and would allow for in-house archival, although it is also possible to script archival by accessing URLs in the format https://webrecorder.io/record/https://example.com through a modern JavaScript-supporting browser. The latter approach is, again, more limited than IA/Wayback, and the website structure would make it difficult for others to find the archived content.
  • WebCite could probably be scripted, but I would avoid doing so because the site just doesn't work very well.
  • Perma.cc has such a small per-user limit that scripting archival would be almost pointless.

All: what is left to be done on this task? (I'm for now removing Research and will add back if we define goals for the next 3 months.)

Samwalton9-WMF claimed this task.

As far as I know this is operational.