Page MenuHomePhabricator

IABot API - truncates at %20 with modifyurl
Closed, InvalidPublic

Description

When submitting a modifyurl where the URL contains a %20 the URL gets truncated at the %20

Example for ID 545593

urlid=545593&overridearchivevalidation=1&archiveurl=https://web.archive.org/web/20061222222052/http://www.battleshipiowa.org/Mare%20Island%20Naval%20Shipyard.htm

It ends up in the database as

http://www.battleshipiowa.org/Mare


Example for 578485

urlid=578485&overridearchivevalidation=1&archiveurl=https://web.archive.org/web/20110724021411/http://bic.cass.cn/english/infoShow/Arcitle_Show_Forum2_Show.asp?ID=320&Title=The%20Humanities%20Study&strNavigation=Home-%3EForum&BigClassID=4&SmallClassID=8

It ends up as

https://web.archive.org/web/20110724021411/http://bic.cass.cn/english/infoShow/Arcitle_Show_Forum2_Show.asp?ID=320&Title=The


Using '+' instead of %20 seems to work but it should be able to accept %20

Event Timeline

Restricted Application added a project: Internet-Archive. · View Herald Transcript

Your encoding them wrong. Your giving the API a URL in a URL. %20 will be received as " " and it will truncate.

Ugh that's a small oversight. Fortunately it won't be difficult to go back and rerun all the ones that had a % in the URL. For some reason I was intentionally decoding the URL before encoding, I don't remember why, but probably works to do a single encoding with no pre-decode.

Some of them are not taking.

ID 311991

Source URL (Greek-language characters):
https://archive.is/20130217191756/http://arenalarissa.gr/archives/24525λα-στην-αελ.html

URL Encoded as part of the archive:
https://archive.is/20130217191756/http://arenalarissa.gr/archives/24525%CE%BB%CE%B1-%CF%83%CF%84%CE%B7%CE%BD-%CE%B1%CE%B5%CE%BB.html

URL Encoded a second time for API post:

--post-data='action=modifyurl&urlid=311991&overridearchivevalidation=1&archiveurl=https%3A%2F%2Farchive%2Eis%2F20130217191756%2Fhttp%3A%2F%2Farenalarissa%2Egr%2Farchives%2F24525%25CE%25BB%25CE%25B1%2D%25CF%2583%25CF%2584%25CE%25B7%25CE%25BD%2D%25CE%25B1%25CE%25B5%25CE%25BB%2Ehtml

The API returns "Error" and the Database shows the archive URL containing the original Greek characters. Maybe it doesn't matter.

Archiveis is probably giving it that URL instead. If it loads, I would ignore it.