Page MenuHomePhabricator

Converting + to %20
Closed, ResolvedPublic

Description

Diff:

https://en.wikipedia.org/w/index.php?title=Internal_ballistics&diff=prev&oldid=774935163

The conversion from '+' to '%20' broke the URL.

This is a common point of confusion. The '+' means "+" when in the path. In the query portion (after the '?') the '+' means space.

Example:

http://example.com/blue+light%20blue?blue%2Blight+blue

In this case the + in the path portion is a literal '+' while space is encoded as %20. In the query portion space is encoded as '+' however it can be substituted for %20 though generally for Wayback URLs it might be safer to preserve the + in the query.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Medic now has a check and fix for these cases as it comes across them but hopefully he problem can be identified at the source.

I discovered an encoding protocol mismatch during the URL sanitization process. Matched them up and now the pluses in the paths get converted to %2B.

http://www.cabelas.com/story-123/boddington_short_mag/10201/The%2BShort%2BMag%2BRevolution.shtml, which work with IA urls.

Archive validation subroutine was improperly handling the URLs. It's now been loaded off to the URL sanitizer entirely to properly normalize the URL.

The database has URLs with "+" instead of "%20" due to the above bug.

https://en.wikipedia.org/w/index.php?title=Joe_Gqabi_District_Municipality&type=revision&diff=776808688&oldid=774194824

According to the management interface this URL is used in 265 articles. It will be a difficult problem to fix as it's not clear when a URL should use + or %20 only tell by testing which works. I have some ideas how it might be done but wanted to re-open the ticket as WaybackMedic is picking up lots of broken links.

It would certainly seem that %20 is more commonly used than + is. So I converted the sanitizer to encode as %20 instead of + in the query.

https://github.com/wikimedia/DeadlinkChecker/pull/18

Cyberpower678 changed the task status from Open to Stalled.May 1 2017, 2:16 AM

My data confirms %20 in query is most common. There are still some cases where + is the only that works.

The bigger problem of database corruption. It might be possible to discover which could have a problem by comparing the original/source URL against the archive URL, matching for differences in +, %20 and %2B. If differences are found. do a web page check and try various combinations till find URL that works. I have the code for this in Medic. It uses 6 variations:

%20 to + in path
%20 to + in query
%20 to + in path and query
+ to %20 in path
+ to %20 in query
+ to %20 in path and query

If you want a command line util that does this let me know, pass it the original URL and will return the working URL. For use in a database cleanup script.

My data confirms %20 in query is most common. There are still some cases where + is the only that works.

The bigger problem of database corruption. It might be possible to discover which could have a problem by comparing the original/source URL against the archive URL, matching for differences in +, %20 and %2B. If differences are found. do a web page check and try various combinations till find URL that works. I have the code for this in Medic. It uses 6 variations:

%20 to + in path
%20 to + in query
%20 to + in path and query
+ to %20 in path
+ to %20 in query
+ to %20 in path and query

If you want a command line util that does this let me know, pass it the original URL and will return the working URL. For use in a database cleanup script.

How fast is it? Also this is prone to false positives. Labs is blacklisted from lot's of domains.

It's in C so fast, but async so slow. But I was thinking for use in a script not IABot, it can take as long as it needs. The blacklisting is not a problem because it's only checking archive sites, mostly archive.org .. the original link to the website wasn't sanitized right? Only the archive link?

Both links are sanitized. But when the sanitizer changes, the original URL becomes inaccessible and a new entry is created.

Should still work, the script checks the archive URL combos of + %20 and if it finds a working match extract the source URL from the archive URL and update the source. Just means it will have to process every link in the database containing a %20 or + .. my data shows 3% of links (from a random selection of 25,000 links). How many links total in the database?

There are over 2.1 million links in the DB with archives associated.

4% would be 84,000 links, it's reasonable using GNU parallel over a night. If you want, send me a list of archive links containing +, %2B or %20 and I'll check them with my home computer rather than Tools.

I don't see any pluses in paths getting converted. Like you mentioned + == %20 in queries while + != %20 in the paths.

Apparently it is sometimes significant on the Wayback machine (and always on archive.is). The problem might be Wayback Machine treats the entire URL as a path .. no query. Or maybe it depends, an older version of Wayback software did it that way and newer versions are able to differentiate but depends on how it got added to their database.

One solution, if an archive URL is preexisting in the wikitext and there is nothing in the database, retain the existing encoding used in the wikitext because that is probably going to be accurate (human checked).

We should probably take this to the Wayback devs. The URLs are technically the same.

Why does it need to sanitize the + and %20? These are problematic.

Why does it need to sanitize the + and %20? These are problematic.

Remember when you asked me to filter out the :80 in the URLs? Well the URL sanitizer in the CID does that, and so I feed the URLs into the sanitizer. I'd really rather not have to build a custom function to clean the URLs up.

Seems straightforward to create rawurldecode-v2 that traps certain edge cases (+ in query) and passes the rest through rawurldecode(). This gives you total control over edge case problems that come up.

Does rawurldecode() include a safe value to tell it not to encode certain characters? Python has this.

Seems straightforward to create rawurldecode-v2 that traps certain edge cases (+ in query) and passes the rest through rawurldecode(). This gives you total control over edge case problems that come up.

Does rawurldecode() include a safe value to tell it not to encode certain characters? Python has this.

See https://github.com/wikimedia/DeadlinkChecker/blob/master/src/CheckIfDead.php#L391 for how the sanitizer works. BTW, check your Slack.

Cyberpower678 changed the task status from Open to Stalled.May 8 2017, 6:04 PM

Since this will be resolved on the Wayback end, I'll stall this ticket for now.

There are 20+ other archive services. In my experience they are not as flexible as Wayback when it comes to interchangeable interpretation of encoding.

There are 20+ other archive services. In my experience they are not as flexible as Wayback when it comes to interchangeable interpretation of encoding.

You are probably right. I'll remove the sanitizer in the other archiving services.

I modified the archive handlers to instead sanitize the original URL. I think that was a hidden bug to begin with. The original URL needs to be consistent else it will not work with the DB. I'll test it later on the interface.

Is it going to modify the original URL in the wikitext as in the |url= field? That is the URL used to create snapshots at other archive providers and to find snapshots there. If the snapshot was created years ago, then the source URL is later modified, won't be able to find it (in cases where sanitization occurs).

No it's not. It's just for saving in the DB, and making it more easily accessed.

Sample ongoing bot wars over the %20/+ in query string.. not a complete list

https://en.wikipedia.org/w/index.php?title=2009_flu_pandemic_in_the_United_States_by_state&action=history 2009 flu pandemic in the United States by state
https://en.wikipedia.org/w/index.php?title=Academi&action=history Academi
https://en.wikipedia.org/w/index.php?title=In_Absentia&action=history In Absentia
https://en.wikipedia.org/w/index.php?title=All_Nippon_Airways_Flight_60&action=history All Nippon Airways Flight 60
https://en.wikipedia.org/w/index.php?title=Played-A-Live_(The_Bongo_Song)&action=history Played-A-Live (The Bongo Song)
https://en.wikipedia.org/w/index.php?title=E.Digital_Corporation&action=history E.Digital Corporation
https://en.wikipedia.org/w/index.php?title=Lord_Tweedsmuir_Secondary_School&action=history Lord Tweedsmuir Secondary School
https://en.wikipedia.org/w/index.php?title=Li_Jiawei&action=history Li Jiawei

I know the answer is to wait for IA to update but there are a couple issues

  1. It may take a while and in meantime bots are battling
  2. It still may break other archive services that don't support %20/+ interchangeability
  3. Even if it's working at Wayback today, if the URL has to be moved to a different provider in the future.

I think there is enough confusion and non-standard practices in the real world that IABot should try to maintain whatever is in the wikitext and not canonicalize + to %20. The out of the box PhP rawurlencode() doesn't have a "safe" argument to tell it not to encode a list of custom characters. Python has this so does Nim and others often called quote().

A simple solution:

https://github.com/wikimedia/DeadlinkChecker/blob/master/src/CheckIfDead.php#L464

Prior to the line 468 explode(), convert all "+" to "_NOPLUSIABOT_" .. and after line 482 convert back to '+'.

Number 2 isn't an issue as I mentioned a while back that the sanitizer has been removed from the archiving services, so the URLs are not being sanitized except for WebCite and IA.
Since the bot's are quibbling over small stuff, maybe tagging those sources with cbignore may be helpful. IABot respects flags to stay away above all else. I'm not following with number 3.

As for your proposed solution, there is another point to this sanitizer. It helps to keep the differently formatted URLs going to the same page consistent. Before I implemented the sanitizer, the DB was 37% larger with duplicate URLs of a different formatting. This caused issues with reporting dead links or false positives, not to mention making the URLs more difficult to look up. This centralizes that and makes the collected pages where the URLs where found on more reliable and easier to access.

Number 3 is a variant of 2 because when bare links are converted to wayback they get sanitized and if the wayback link stops working in the future one needs to extract the original URL from the wayback URL to search other services and since it was previously sanitized it may not be found at the new service. This is mostly true for archive.is since they crawled Wikipedia saving URLs as they were found at the time of the crawl.

Thanks for explaining the reason for doing the sanitize that helps. I have some ideas how to solve it on the database side but pretty complex and won't go there.

I'll manually add cbignore to the most egregious edit war cases and wait for IA to change the surt library and later can re-evaluate once this part is cleared up.