Page MenuHomePhabricator

InternetArchiveBot does not handle URLs of the form http://music.cbc.ca/!#!/... correctly
Closed, InvalidPublic

Description

I don't think this is very standard, but there are several such URLs on the Joni Mitchell article, which lead to old CBC music blog posts. InternetArchiveBot archives them to a completely different blog post (always the same one):
https://en.wikipedia.org/w/index.php?title=Joni_Mitchell&diff=777412220&oldid=776441987

The Wayback Machine doesn't seem to handle these URLs well either, because when I searched for one of them, http://music.cbc.ca/#/blogs/2013/6/Exclusive-Joni-Mitchell-talks-to-Jian-Ghomeshi-about-death-hippies-art-and-getting-Banffed, I got http://music.cbc.ca.

The URLs on the talk page notice are also truncated:
https://en.wikipedia.org/w/index.php?title=Talk:Joni_Mitchell&diff=777412224&oldid=759256939

Event Timeline

Graham87 created this task.Apr 27 2017, 8:20 AM
Restricted Application added a project: Internet-Archive. · View Herald TranscriptApr 27 2017, 8:20 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Cyberpower678 triaged this task as Normal priority.Apr 27 2017, 3:33 PM
Cyberpower678 moved this task from Unsorted to Bugs on the InternetArchiveBot (v1.3) board.
Cyberpower678 raised the priority of this task from Normal to High.Apr 27 2017, 3:40 PM

This needs to be reported to the developers of the Wayback Machine. I'll forward this to them.

Actually, I honestly don't know how this is going to be fixed. That URL is complete violation of the rules of URLs. While IABot's URL recognition system is already very flexible to URLs that bend these rules, # are usually an explicit indicator of something that points the browser to a page anchor.

Looking at those URLs. They all go to the same page in the web browser. I'm guessing the URL itself is screwed up. Any service that uses # in their URLs needs a new web developer. I don't think this will be fixed.

Encoding the URL to http://music.cbc.ca/%23/blogs/2013/6/Exclusive-Joni-Mitchell-talks-to-Jian-Ghomeshi-about-death-hippies-art-and-getting-Banffed takes me to a 404 Not Found page. So these URLs are just useless. I'm not sure the effort to support this is even worthwhile considering the URLs, if they're actually supposed to go somewhere, are illegal.

Cyberpower678 closed this task as Invalid.Apr 27 2017, 6:09 PM

Those URL's don't work now, but certainly did in the past (I distinctly remember reading at least one of the pages there). Lemme see if I can find anything useful there; I'll report back if I do. Yes,
that web developer needs to be fired, if he/she wasn't already.

Those URL's don't work now, but certainly did in the past (I distinctly remember reading at least one of the pages there). Lemme see if I can find anything useful there; I'll report back if I do. Yes,
that web developer needs to be fired, if he/she wasn't already.

I'm not sure how it was setup in the past, but my web browser interprets the URL the same way the bot, and the Wayback machine does. It sees a root URL with a large fragment. Properly encoding the pound symbol takes me to a 404.

I did searches for "Joni" and "Banff", two terms in the first URL in the edit to Joni Mitchell linked above, in the Wayback Machine for http://music.cbc.ca. FWIW it coughed up the following URLs, neither of which actually work:
http://music.cbc.ca:80/blogs/blogpost.aspx?modPageName=&year=2013&month=6&title=Exclusive-Joni-Mitchell-talks-to-Jian-Ghomeshi-about-death-hippies-art-and-getting-Banffed&permalink=/blogs/2013/6/Exclusive-Joni-Mitchell-talks-to-Jian-Ghomeshi-about-death-hippies-art-and-getting-Banffed
http://music.cbc.ca:80/blogs/2013/6/Exclusive-Joni-Mitchell-talks-to-Jian-Ghomeshi-about-death-hippies-art-and-getting-Banffed

I wonder if the least bad thing to do when the bot encounters URLS with "/#/" in them would be to just mark them as dead on site. It really shouldn't try to link to another post, as it did in the diff above.