Page MenuHomePhabricator

InternetArchiveBot adds unavailable links (robots.txt restricted?)
Closed, DeclinedPublic

Description

https://en.wikipedia.org/w/index.php?title=Eversource_Energy&diff=756943309

Is an example of a problematic edit from InternetArchiveBot. It adds 5 links to archive.org that were previously marked as deadlinks, but all of the links are nonfunctional. They lead to "Page cannot be displayed due to robots.txt."
It does not seem that the bot should be adding links in this case.

Furthermore, the bot messes with the links in a way that removes information from the reader.
In this case the prior links were bare links, e.g.:

<ref>http://www.nu.com/investors/corporate_gov/default.asp {{Dead link|date=December 2016}}</ref>

and they were changed to:

<ref>{{cite web|url=http://www.nu.com/investors/corporate_gov/default.asp |title=Archived copy |accessdate=2007-09-19 |deadurl=yes |archiveurl=https://web.archive.org/web/20071030022411/https://www.nu.com/investors/corporate_gov/default.asp |archivedate=2007-10-30 |df= }}</ref>

That means the formatted display has changed from:

http://www.nu.com/investors/corporate_gov/default.asp[dead link]

to:

"Archived copy". Archived from the original on 2007-10-30. Retrieved 2007-09-19.

This is manifestly less useful. Having the link anchor change from the link itself (which conveys some information -- that this is a page about corporate governance) to "Archived copy", which conveys no informaiton at all, is not useful. I recognize that bare URLs without titles are not the best practice in wikipedia. But since they are allowed, IAB should not make them worse.

It should especially not do both of these things together.

Thanks.

Related Objects

Event Timeline

Restricted Application added a project: Internet-Archive. · View Herald Transcript
Restricted Application added subscribers: Cyberpower678, Aklapper. · View Herald Transcript

You can use https://tools.wmflabs.org/iabot/index.php?page=manageurlsingle to alter archive data for URLs by searching for the original URL and removing/changing its associated archive snapshot. As for converting external links to cite templates, you should get a consensus to change the way the bot handles these. My current impression regarding this is that it's a welcomed feature of the bot.

You're saying it is "correct" for the bot to replace a dead link with
a link it claims is live but isn't really? That can't possibly be right!

You're saying it is "correct" for the bot to replace a dead link with
a link it claims is live but isn't really? That can't possibly be right!

No, I'm saying you can fix those with the link I provided above.

I should not have to do manual work to clean up after a bot. That means the bot is broken. It has made the page worse than it was without the bot. That's a bug.

Archive snapshots can go down too. Sometimes the Wayback API delivers a snapshot that no longer works. The bot does check them when getting them initially. If they die in the interim, you can't really blame the bot for that. Best thing to do is to help the bot and tell it the snapshots are bad, and it stops using them. If you don't want to do that, then just revert the bot, but I can't singlehandedly handle every bad archive report I get, which is why this tool exists.

If it makes you feel any better, I am planning a DB maintenance for very soon, which does some major cleanup on the DB, and will filter bad archive snapshots.

Cirdan added a subscriber: Cirdan.

T16720 is about MediaWiki/Wikimedia robots.txt configuration issues.