By reading a few phabricator tasks, code, and VCL comments, I understand the problem, solution, and current reality of path normalization as follows.
Problem
These two URLs represent the same page:
- https://en.wikipedia.org/wiki/Steve_Fuller_%28sociologist%29
- https://en.wikipedia.org/wiki/Steve_Fuller_(sociologist)
As discovered by @Catrope (T29935#317586):
Pages with parentheses and certain other special characters in their titles
have more than one correct URL: one with literal parentheses, one with
parentheses URL-encoded, and I'm sure mixes of those two are accepted as well.
However, when a page changes, Squid only purges the urlencoded URL and doesn't
purge the others.
We need URLs to be converted to a single, univocal representation before caching and fetching objects from cache. Also, the conversion needs to happen before purging, so that if Steve_Fuller_(sociologist) is cached, a PURGE for Steve_Fuller_%28sociologist%29 invalidates the object.
The question is: given a URL, which characters should be encoded (eg: ! -> 0x21), which hex escape should be decoded (eg: 0x7e -> ~), and which characters/hex escapes should be left untouched?
The approach, in theory
RFC 3986 Section 2 splits the 256 possible byte values completely into 3 sets: Unreserved, Disallowed and Reserved:
- 66 Unreserved: 0-9 A-Z a-z - . _ ~
- 172 Disallowed: 0x00-0x20 0x7F-0xFF < > | { } " % \ ^ `
- 18 Reserved: : / ? # [ ] @ ! $ & ' ( ) * + , ; =
Unreserved and Disallowed characters do not present any issue from the point of view of obtaining a univocal representation. We decode the former, and encode the latter.
Troubles begin when Reserved characters are used. According to RFC3986 2.2:
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
Application-specific knowledge is required to choose what to do with Reserved characters:
If data for a URI component would conflict with a reserved character's
purpose as a delimiter, then the conflicting data must be percent-encoded
before the URI is formed.
For example, when it comes to MediaWiki, 0x2f can be decoded to / given that as far as MW is concerned slashes are fine in titles.
$ varnishlog -q 'ReqMethod eq "PURGE" and ReqURL ~ "Profiling_Python"' | grep ReqURL & $ curl -X PURGE 'http://127.0.0.1:3128/wiki/User:Ema%2fProfiling_Python' - ReqURL /wiki/User:Ema%2fProfiling_Python - ReqURL /wiki/User:Ema/Profiling_Python
This should not happen for RESTBase, instead, as the application uses / as a delimiter: T127387
Without application-specific knowledge, the following rules should be followed:
- Unreserved hex escapes should always be decoded: 0x7e -> ~
- Disallowed characters should be encoded to their hex escape representation: > -> 0x3e
- Reserved character (and their hex escape representation) should be left as-is
With application-specific knowledge, we can carefully normalize Reserved characters too.
Given that we are operating on the Path component, which is delimited by ? or #, those 2 characters should be left unchanged. The remaining 16 characters in the Reserved set can be encoded/decoded with application-specific knowledge:
: / [ ] @ ! $ & ' ( ) * + , ; =
We'll call this set of characters the Customizable set.
When it comes to MediaWiki, all of the 16 characters in the Customizable set can be put into a specific subset that is either always-decoded or always-encoded, giving us "complete" normalization:
mediawiki_decode_set = : / @ ! $ ( ) * , ; mediawiki_encode_set = [ ] & ' + =
RESTBase is similar to MediaWiki, but needs to accept MediaWiki titles with slashes in the %2F form, while still keeping its own functional path-delimiting slashes unencoded as mentioned earlier.
restbase_decode_set = : @ ! $ ( ) * , ; restbase_encode_set = [ ] & ' + =
When it comes to upload.wikimedia.org, instead, all file titles go through PHP's rawurlencode() when their storage URL is generated. We thus want all characters in the Customizable set to be encoded. If my understanding of [[https://github.com/wikimedia/puppet/commit/fc9faef4abb1abec475ff3ce3eadd167b19ad099#diff-85c2947ba52184e62144144065c91c9cR6|the comment above the definition of sub normalize_upload_path]] is correct, the two subsets should thus be:
upload_decode_set = upload_encode_set = : / [ ] @ ! $ & ' ( ) * + , ; =
Current reality
In order to verify all this, I've written an ignoble C program that iterates through the currently defined decoder rings and prints out what the action for any given Customizable character is:
#include<stdio.h> void print_decoder(const char *name, const size_t decoder[256]) { int i, j; int customizable[16] = { ':', '/', '[', ']', '@', '!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '=' }; for (i=0; i<256; i++) { for (j=0;j<16;j++) { if (i == customizable[j]) { if (decoder[i] == 0) { printf("%s encode %c as 0x%x\n", name, i, i); } if (decoder[i] == 1) { printf("%s decode 0x%x to %c\n", name, i, i); } if (decoder[i] == 2) { printf("%s leave 0x%x or %c unchanged\n", name, i, i); } } } } } int main() { static const size_t mediawiki_decoder_ring[256] = { // 0x00-0x1F (all unprintable) 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 0,1,0,2,1,0,2,2,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,2,0,2, //@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,0,2,0,1, //` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ <DEL> 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,0, // 0x80-0xFF (all unprintable) 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }; static const size_t upload_decoder_ring[256] = { // 0x00-0x1F (all unprintable) 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 0,0,0,2,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,2, //@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1, //` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ <DEL> 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,0, // 0x80-0xFF (all unprintable) 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }; static const size_t restbase_decoder_ring[256] = { // 0x00-0x1F (all unprintable) 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 0,1,0,2,1,0,2,2,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,0,2,0,2, //@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,0,2,0,1, //` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ <DEL> 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,0, // 0x80-0xFF (all unprintable) 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }; print_decoder("upload", upload_decoder_ring); print_decoder("mediawiki", mediawiki_decoder_ring); print_decoder("restbase", restbase_decoder_ring); }
By examining the program output, MediaWiki reality seems to currently be:
mediawiki_decode = ! $ ( ) * , / : ; @ mediawiki_encode =
In the RESTBase case, instead, and note the missing /:
restbase_decode = ! $ ( ) * , : ; @ restbase_encode =
For both MediaWiki and RESTBase above, the fact that encode is empty seems to be due to this:
- XXX However, for this first commit, we're merely switching to this new
- normalization code with minimal behavior change from the old code, which
- means not enforcing the Always-Encode set above.
For upload, the only difference between practice and my understanding of theory is that we decode /.
upload_decode = / upload_encode = ! $ & ' ( ) * + , : ; = @ [ ]
ATS
At this time we need to implement path normalization for PURGE requests only: normalization is performed on all requests at the frontend layer and, given that all non-PURGE requests go through the frontends, they are going to be normalized already when they reach ATS.
Upload normalization is enforced for requests with Host: upload.wikimedia.org, RESTBase for requests with URL ~ "^/api/rest_v1/" or Host: cxserver.wikimedia.org, while all other request paths are normalized using the rules for MediaWiki.
We need to implement either the current reality or the approach that also enforces encoding for MediaWiki and RESTBase in Lua.