Page MenuHomePhabricator

IABot breaking functionality of links to pages within a PDF
Closed, DeclinedPublic

Description

If an archive-url ends with “pdf#page=” followed by an integer, then the bot should not remove “id_” from a Wayback archive-url, because it breaks the link to a specific page within the PDF. In fact, if the bot isn’t adding “id_” to archive-urls ending with “pdf#page=” and then an integer, then the bot is breaking the spirit, if not the letter of WP:CITE, which says that “No editor is required to add page links [to links to Google Books citations], but if another editor adds them, they should not be removed without cause”.

Links that contain “pdf#page=” should not remove “id_” from the archive-url because without it, the Wayback Machine serves the archived PDF with the Wayback toolbar, and this mixed-content (HTML–PDF) hybrid prevents the “pdf#page=” part of the IRI from sending the user to the referenced page.

In this before and after of this diff (https://en.wikipedia.org/?diff=856918530), you can see the problem:
• archive-url before IABot, sending the user to the 225th page of the archived PDF: https://web.archive.org/web/20180825193304id_/energy.gov/sites/prod/files/2017/09/f36/EIS-0527_FEIS_CH11.pdf#page=225
• archive-url after IABot, sending the user to the 1st page of the archived PDF: https://web.archive.org/web/20180825193304/http://energy.gov/sites/prod/files/2017/09/f36/EIS-0527_FEIS_CH11.pdf#page=225

Note that the absence of the http:// before energy.gov in the old URL doesn’t change the link’s functionality; it isn’t related to this bug (even if the old URL were https://web.archive.org/web/20180825193304id_/http://energy.gov/sites/prod/files/2017/09/f36/EIS-0527_FEIS_CH11.pdf#page=225, it’d’ve worked fine).

A less urgent problem, is the access-date parameter in the same citation in the same diff:
• the access-date was added to a citation that didn’t have one
• the access-date was added to the middle of a quote parameter’s content

An even less urgent problem is the addition of blank df parameters to every citation the bot touches, which at best adds nothing useful for the reader, editor, or software.

While the bugs are addressed, I will revert the parts of the diff in question.

Many thanks for this priceless service you’re providing,

LLarson

Event Timeline

So having consulted with devs of the Wayback Machine. IABot is correctly normalizing the URLs, but it turns out to be a bug in their software. The fix will be on their end.