This is a very rough sketch based on a conversation with @Legoktm.
The goal is to allow caching proxies to canonicalize MediaWiki request URLs.
Query canonicalization means rewriting the query strings of incoming web requests such that bogus parameters are discarded and valid parameters appear in sorted order. Canonicalizing queries at the edge cache layer would improve cache efficiency by ensuring that semantically-equivalent requests map to the same cache entry. It would also protect the application servers from cache-busting attacks that use bogus query parameters to bypass the cache.
Consider three (otherwise identical) GET requests:
- https://en.wikipedia.org/w/index.php?oldid=1087938671&title=Pizza (parameter order reversed)
- https://en.wikipedia.org/w/index.php?title=Pizza&oldid=1087938671&xyzzy=123 (adding junk parameter)
With query canonicalization, these three requests would be rewritten into the same normal form, and could be served from a single entry in the cache.
To perform query canonicalization, the caching proxy needs to be confident that the changing the order of parameters does not alter the semantics of the request, and it needs to know which query parameters are meaningful to MediaWiki. (Currently, the only way to know that is to audit all the code manually. This is because any MediaWiki code can read and parse the query string.)
I propose we do this by moving from the current, ad-hoc handling of query parameters in PHP code to a model where all query parameters must be declared in advance (during initialization, prior to any request processing) prior to being accessed. This ensures that the full set of meaningful query parameters is collected by some registry object, and can be exported via a MediaWiki API endpoint for consumption by caching proxies.
The migration to declarative query processing can be done incrementally, without breaking backward compatibility.
- Provide a mechanism for PHP advance declaration of query parameters:
- For code in core, there will be a static array.
- MediaWiki extensions will list query parameters in their manifest.
- Provide a new interface in MediaWiki for reading query parameters, which only exposes query parameters that have been pre-declared.
- Add a new configuration variable to core, $wgCanonicalizeQueries, with three possible states:
- off: Existing behavior (no query canonicalization).
- warn: Accessing undeclared query parameters (or the raw query string) causes a warning to be logged.
- Interfaces that return the raw query string (like WebRequest::getRawQueryString) return a canonicalized query string.
- Interfaces that return an array of parameters (like WebRequest::getQueryValues) return an array that contains only known parameters, in sorted order.
- If the PHP filter extension is present, MediaWiki will register a FILTER_CALLBACK that canonicalizes query data in $_GET, $_REQUEST and $_SERVER.
- Implement the API endpoint for exporting the set of meaningful query parameters for consumption by edge caches.
Questions / Concerns
- If the declaration of query parameters is separated from use (i.e., if they occur in different places in the code), how do we make sure the two are in sync? If I remove code that handles a particular query parameter, is there anything to remind me to remove the declaration? Can I even remove the declaration — i.e., can I be confident that no other code is using the query parameter?
- Query parameters are generally only meaningful for certain endpoints and in combination with certain other parameters. For example, the recursivesubmodules query parameter is only meaningful for requests to /w/api.php where action = help. The proposal above does not capture this sort of relationship between query parameters. Among other things, this means that it would still be possible to bypass caching using nonsensical combinations of query parameters (query parameters that are each individually meaningful somewhere, but not meaningful together).
A solution to some of these issues could be to find some way of tying query parameter declaration and usage together somehow, the way the MediaWiki action API does it. This is hard because MediaWiki is still partially reliant on file-based request routing. An architecture where all routes are declared to a central controller (e.g.) would make this easier. That is a much more ambitious undertaking, though. The proposal above might be a useful stepping-stone in that direction.
Related: T302459, T138093