Page MenuHomePhabricator

InternetArchiveBot adds links to archives that have been excluded from the Wayback Machine
Closed, DeclinedPublic

Description

Added this (only when told to add archives to live links of course)
https://web.archive.org/web/20210317192722/https://www.azcentral.com/story/news/local/arizona-breaking/2021/03/16/jake-angeli-jacob-chansley-video-shows-storming-capitol/4716926001/


Subsequent report:

Feature summary (what you would like to be able to do):
IABot should avoid adding URLs for sites that have excluded themselves from the Wayback Machine.

Steps to reproduce (a list of clear steps to create the situation that made you report this, including full links if applicable):

  • Run the bot on any Wikipedia page with a reference to a site that has excluded itself from the Wayback Machine (Snopes.com is one)
  • Alternatively, look at this edit
  • Click the archived link to verify that it only shows "Sorry. This URL has been excluded from the Wayback Machine."

Use case(s) (describe the actual underlying problem which you want to solve, and not only a solution):
Prevent adding useless, empty archive links to references.

Event Timeline

I do not know which ones of these are not on black list, but they are all banned on web.archive.com and quite a few of them are still added by the bot.

azcentral.com
snopes.com
desmoinesregister.com
nationalpost.com
spacenews.com
11alive.com
9news.com
argusleader.com
armytimes.com
citizen-times.com
clarionledger.com
courier-journal.com
defensenews.com
idahostatesman.com
kare11.com
psychologytoday.com
statesmanjournal.com
tennessean.com
wtsp.com
wkyc.com

I don't understand the "bug" you are reporting.

URLs may be archived even if their playback is "excluded" from the Wayback Machine.

Also... snopes.com is not excluded.

This is a hard problem as the status can flip back and forth. As noted by Mark Graham above (Director Wayback Machine) archives that show excluded is like a curtain, the archive still exists in the Wayback Machine and could flip back to active in the future based on policy decision. The reason they are being added into wiki anyway is because IABot has a separate cache database and when it first detected that URL it was active. As a friend recently noted, one of the hardest things in computing is keeping accurate caches. The design of IABot is to use caching and not querying the WaybackMachine for every URL it encounters, which has pros and cons.

Harej renamed this task from add blocked pages as archives to InternetArchiveBot adds links to archives that have been excluded from the Wayback Machine.Mar 29 2022, 12:13 AM
Harej subscribed.

There is no feasible solution to this problem. Our caches are built in such a way that we assume that, so long as we are aware of an archive link, we assume that archive link will always exist (this is generally how the Internet Archive operates). And, indeed, even when an archive is hidden from public view, the archive still technically exists. To feasibly solve this problem at scale we would need to regularly seek updates on the visibility status of each archive, since it can change according to policy as highlighted above. Even if we had a hot cache of known "invisible" archives, this would create an additional operational strain: the bot would have to go through and remove the archives that are no longer visible, and then re-add those archives if a subsequent policy decision re-enables their access.

In the future we plan on supporting multiple archive providers per underlying URL. This would allow editors to select an alternative archive provider if the archive.org archive was somehow deficient.