Page MenuHomePhabricator

InternetArchiveBot incorrectly replacing archive URL
Closed, ResolvedPublic

Description

The Pandora Archive is a web-archiving service run by the National Library of Australia that contains some URLs that are not on the Wayback Machine. It is used in thousands of articles on Wikipedia:
https://en.wikipedia.org/w/index.php?title=Special:LinkSearch/pandora.nla.gov.au&limit=500&offset=10000&target=http%3A%2F%2Fpandora.nla.gov.au

Internet Archive Bot replaced the Pandora link with a dead link in this diff:
https://en.wikipedia.org/w/index.php?title=Kieran_Modra&diff=771512151&oldid=761483969

I've disabled the bot to stop it from doing something like this in the future.

Event Timeline

Graham87 created this task.Mar 22 2017, 2:39 AM
Restricted Application assigned this task to Cyberpower678. · View Herald TranscriptMar 22 2017, 2:39 AM
Restricted Application added a project: Internet-Archive. · View Herald Transcript
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Cyberpower678 triaged this task as High priority.Mar 22 2017, 2:48 AM

This will take some doing. The newest update, force validates archive URLs to make sure they are actually archive URLs. As I continued going around Wikipedia researching ways to improve the reliability of IABot, I see too many cases where the archiveurl parameter is getting misused. The bot recognizes numerous archiving services, I'll work on adding the ones I missed in the first round of development.

On a side note, they get ignored if the bot doesn't see a need to archive URLs to the original.

Cool. Could you also check to see if the bot has accidentally overwritten any other archive pages in this way? I've checked other pages about Australian Paralympians, where I've used the Pandora Archive often, but can't find any more examples.

Krinkle added a subscriber: Krinkle.EditedMar 22 2017, 3:50 AM

Internet Archive Bot replaced the Pandora link with a dead link in this diff:
https://en.wikipedia.org/w/index.php?title=Kieran_Modra&diff=771512151&oldid=761483969

Not exactly a dead link.

Both of these links "work" and contain an archived copy of what was visible at URL http://www.ausport.gov.au/olym96/paracycl.html at some point in time.

The problem is that at some point between 2000 and 2006, the original page ceased to exist and started to respond with a "Page Not Found" page. Archive.org did crawl this url in 2006 and it did archive it, however it archived a copy of the "Page Not Found" page.

http://www.ausport.gov.au/olym96/paracycl.html
"Page Not Found" as it looks today (2017)

To prevent this, the Wikipedia bot would need to verify that, in addition to Archive.org having a copy, that it is not a "Page Not Found" kind of copy. The bot can use the internally provided HTTP status code to verify this (without needing to inspect the page itself).

However this won't work for this particular example, because ausport.gov.au did not have their server configured correctly back in 2006. They were serving the "Page not found" page with a 200 OK (Success) status code (instead of 404). This is a common mistake in servers when they do have a page, it is just a placeholder page to mean there is no page. The current version of ausport.gov.au fixed this and does have the Page Not Found page internally marked as a real 404, so the bot correctly will not try to use the newer copies.


Cool. Could you also check to see if the bot has accidentally overwritten any other archive pages in this way? I've checked other pages about Australian Paralympians, where I've used the Pandora Archive often, but can't find any more examples.

I couldn't find any other instances. You switched the bot off soon after it was turned back on.

Internet Archive Bot replaced the Pandora link with a dead link in this diff:
https://en.wikipedia.org/w/index.php?title=Kieran_Modra&diff=771512151&oldid=761483969

Not exactly a dead link.

Both of these links "work" and contain an archived copy of what was visible at URL http://www.ausport.gov.au/olym96/paracycl.html at some point in time.
The problem is that at some point between 2000 and 2006, the original page ceased to exist and started to respond with a "Page Not Found" page. Archive.org did crawl this url in 2006 and it did archive it, however it archived a copy of the "Page Not Found" page.

To prevent this, the Wikipedia bot would need to verify that, in addition to Archive.org having a copy, that it is not a "Page Not Found" kind of copy. The bot can use the internally provided HTTP status code to verify this (without needing to inspect the page itself).
However this won't work for this particular example, because ausport.gov.au did not have their server configured correctly back in 2006. They were serving the "Page not found" page with a 200 OK (Success) status code (instead of 404). This is a common mistake in servers when they do have a page, it is just a placeholder page to mean there is no page. The current version of ausport.gov.au fixed this and does have the Page Not Found page internally marked as a real 404, so the bot correctly will not try to use the newer copies.


The bot already does that, and actually instructs the Wayback Machine to only deliver 200/203/206 content in the snapshot. The fact that it was a 404, means the site wasn't setup properly at that time.

Wow, NLA certainly has a bunch of different formats for the same snapshot.

Cyberpower678 closed this task as Resolved.Mar 22 2017, 8:32 PM

Updated the archive validation sub routines.