Page MenuHomePhabricator

Do not replace = by %3D, implement convert_archives_encoding=0 functionality
Closed, ResolvedPublic

Description

Why would you want to do that anyway? It doesn't make the URL prettier.
https://nl.wikipedia.org/w/index.php?title=Arjen_Robben&diff=next&oldid=49465457

Event Timeline

It's not about pretty in many cases. It's about ensuring functionality by making sure the proper encoding is being sent. This is a normal, properly conforming, URL.

It's not about pretty in many cases. It's about ensuring functionality by making sure the proper encoding is being sent. This is a normal, properly conforming, URL.

The first URL doesn't get encoded that way. = is only encoded in the path. This is done because = isn't actually legal in the URL and is only used to define values to parameters in the queries, based on RFC standards.

The second URL is being normalized to ensure operation of the archive.

The first URL doesn't get encoded that way. = is only encoded in the path. This is done because = isn't actually legal in the URL and is only used to define values to parameters in the queries, based on RFC standards.

OK

The second URL is being normalized to ensure operation of the archive.

Why? The archive link already works.

The first URL doesn't get encoded that way. = is only encoded in the path. This is done because = isn't actually legal in the URL and is only used to define values to parameters in the queries, based on RFC standards.

OK

The second URL is being normalized to ensure operation of the archive.

Why? The archive link already works.

The biggest reason is to keep the DB of IABot from cluttering with duplicates. The second reason is if the archive conforms to proper standards, it makes it easier for the Wayback Machine to load it, and should a change that breaks support for improper encoding, this URL is less likely to break from said hypothetical change.

The biggest reason is to keep the DB of IABot from cluttering with duplicates. The second reason is if the archive conforms to proper standards, it makes it easier for the Wayback Machine to load it, and should a change that breaks support for improper encoding, this URL is less likely to break from said hypothetical change.

I would prefer to keep the original URL behind the https://wayback.archive.org/web/XXXXXXXX/ part for obvious reasons:

  • knowing what the exact original URL was
  • it works - don't change it unless needed
  • it's more readable

The problem is it's hard coded into the analysis routine. It's sanitized and then saved into the DB. Not sanitizing it begins the cluttering process of the DB. I wouldn't have any idea how that can be "turned off" for certain wikis while retaining the sanitized URL in the DB.

When reading the URLs, you santize the URL to store it in the database. You write that sanitized URL back to Wikipedia, but with a few exceptions:

  • use + instead of its encoded variant
  • use = instead of its encoded variant (unless you really want to comply with RFC standards)

Does this sound difficult to you?

Encoding + is also complying with standards. And yes it is, since the library of sanitization uses standard encoding protocols from an external library.

You could afterprocess it, when it comes out of the external library. I will ask others for opinions on this matter.

You could afterprocess it, when it comes out of the external library. I will ask others for opinions on this matter.

That's the next problem. Bloating the code. This also in the global parser which means it will also change on the other 5 wikis it's being run on.

We chatted about this, and we came up with an idea: if we can disable converting the encoding for existing archive.org URLs, I can live with this solution. I propose to call this "convert_archives_encoding=0" but feel free to change the parameter name.

Smile4ever renamed this task from Do not replace = by %3D to Do not replace = by %3D, implement convert_archives_encoding=0 functionality.Jul 31 2017, 6:27 PM

Implemented in v1.5beta2.

How can a wiki change the behaviour?