Page MenuHomePhabricator

Double {{webarchive}}
Closed, DeclinedPublic

Description

In some situations it adds two duplicate {{webarchive}} templates at the same time:

https://en.wikipedia.org/w/index.php?title=Little_Scioto_River_%28Scioto_River%29&type=revision&diff=781996058&oldid=738891165


In other situations (more common) it adds a {{webarchive}} where one already exists:

https://en.wikipedia.org/w/index.php?title=Apollo_Global_Management&type=revision&diff=782230258&oldid=778748506

This is because the first {{webarchive]} was added by Cyberbot II when its edit policy was to add to end of ref:

https://en.wikipedia.org/w/index.php?title=Apollo_Global_Management&type=revision&diff=706435401&oldid=705307144

IABot policy changed to adding to right after the link. So now it adds a double. Can IABot search the ref for duplicate archive URL before adding a new {{webarchive}}. Or better, add the new ref in the right place but delete the old one (this is what Medic does).

Event Timeline

I see a lot more complexity here. In the first edit, it sees two links, so it adds 2 archives for each link. In the second, it sees an archive after a different unrelated link and bound those two together. So it added that archive to the beginning. This will start to add new layers of complexity to the bot. Do you suppose it's worth the cost and effort?

In the first diff, it added the same {{webarchive}} twice, even though the source URLs are different.

Medic has been fixing these for a long time, it's able to detect and remove duplicates. They're not as common now because Medic has been cleaning them up but they still occur. Mostly it's because Cyberbot II had a different edit policy and all those will get fixed if Medic continues following behind IABot eventually it will get to them all because IABot adds a duplicate next to the source URL, and Medic detects a duplicate and removes the one at the end of the ref. Sort of a round-robbin.

If we want to retire Medic eventually it should take over because duplicates remain a problem. Editors will add the {{webarchive}} somewhere else in the ref. Maybe an easy solution is just detect for duplicate before adding {{webarchive}}.

Cyberpower678 moved this task from Inbox to v1.4 on the InternetArchiveBot board.
Cyberpower678 moved this task from Unsorted to Bugs on the InternetArchiveBot (v1.4) board.

In regards to the first diff, it's because the wrong archive is associated with the URL. I have now fixed it in the DB. The other issues are are because of inconsistent archive URLs associated with URLs.