Page MenuHomePhabricator

The Internet Archive can't archive everything
Closed, DuplicatePublic

Description

Some websites, like YouTube, OpenStreetMap and Google Maps, can't be archived using the Internet Archive because of their use of JavaScript or other technology. Others, like large parts of the Hong Kong government's website, actively or passively refuse connections from the Internet Archive's Save Page Now tool; others can't be archived because of their robots.txt; and still others, like the New York Times, deliberately show an error message even if the content has been loaded to prevent their paywall from being circumvented.

Increasing the frequency of the Internet Archive's IRC bot (T199193) is not as helpful if it's still unable to archive a lot of pages. It would be really, really nice if the process of archiving links (regardless of the entity running it) could have fallback methods for archive attempts that don't work. (I know the Internet Archive does have a YouTube collection, but it's fairly difficult to actually get content into it, it seems to be only for copyleft/public domain videos, and I don't think the bot is capable of adding to it.)

w:en:Help:Archiving a source lists five web archiving services, including the Internet Archive. All five of them have drawbacks which prevent them from being used to archive everything.

  • Save Page Now of IA's Wayback Machine is great, but has the aforementioned drawbacks which limit it to a minor but significant degree (YouTube and Facebook are two of the four websites more popular than Wikipedia, but neither can be archived through Save Page Now).
  • Archive.is has a limit of about 30 pages per minute, and actively tries to prevent mass archival. I know this because they've probably blocked my ISP because I tried to archive too many pages (don't worry about it; it only happened after tens of thousands of pages). However, it has several fairly good fallback mechanisms to archive pages, including the Google cache and some other websites. It's also permanently logged in to a Facebook account, and is the only one of the five to have a .onion address on the Tor network.
  • Webcitation.org is fairly bad at archiving CSS, in my experience, and also fails on some Hong Kong government websites. It doesn't really seem to have any benefits over web.archive.org.
  • Perma.cc is only used in 62 English Wikipedia articles as of this writing. I don't know much about it but it has a limit of 10 pages per month per account. Perhaps @Green_Cardamom might know more.
  • Webrecorder is only used in 16 articles as of this writing. It's a nonprofit which was created this year, and as far as I'm aware is the only archiving service of the five which can save YouTube videos* (try loading the comments – you'll get to exactly where I stopped scrolling), OpenStreetMap's "slippy map", and other interactive/dynamic content. It also has virtual machines, presumably for viewing outdated HTML. However, it would be more difficult to fix dead links with it because all archive links contain the name of the account which was used to create them (i.e. URLs can't automatically redirect like they do at web.archive.org). It would also probably be fairly difficult to crawl sites without support from the site owners.

Having a ninety-something percent archival success rate is still not really ideal if it's out of millions of links per year, many of which will be gone within a few months of their addition to articles. The main reason I'm posting this here is that there is currently a Village pump (idea lab) thread on the English Wikipedia proposing that the WMF runs their own archiving service. While I personally think this is somewhat unnecessary and would be an inappropriate use of WMF resources, especially when there are other non-profits which are already dedicated to web archiving, the current options are not satisfactory for archiving everything that should be archived.

(* – the YouTube archival doesn't totally work because YouTube's links generate some sort of error and not all of the video quality settings can be chosen)

archive.orgarchive.iswebrecorder.iowebcitation.orgperma.cc
YouTubesort ofnoyesnono
Facebooknoyesyesnono
mapsnonot reallyyesnono
avoiding paywalls and loginsnosometimesyes?nono
API of some sortyesnot reallynonot reallyno
unlimited archivalyesnonomaybeno

Event Timeline

Jc86035 added subscribers: RoySmith, Izno, Beetstra.
Jc86035 added a subscriber: Samwalton9.
Jc86035 updated the task description. (Show Details)
Jc86035 updated the task description. (Show Details)

(I see an ongoing discussion on one Wikimedia wiki so I hope that the discussion will remain central there and not take also place in this Phab task in parallel.)

https://meta.wikimedia.org/wiki/Mission comes to my mind.

@Aklapper Surely if there is a Knowledge Integrity WMF project, this would be somewhat relevant to it? (I don't think a few enwiki users like me are suddenly going to come up with a working software solution, though I agree the discussion should stay there for now.)

This is some great information @Jc86035, but it might make more sense as a comment on a task like T199193, since this seems more like a discussion point than an actionable task.

Should I merge the task and copy it over as a comment?

Should I merge the task and copy it over as a comment?

That sounds good :)