Page MenuHomePhabricator

Advance declaration of query parameters
Open, MediumPublic

Description

This is a very rough sketch based on a conversation with @Legoktm.

What

The goal is to allow caching proxies to canonicalize MediaWiki request URLs.

Query canonicalization means rewriting the query strings of incoming web requests such that bogus parameters are discarded and valid parameters appear in sorted order. Canonicalizing queries at the edge cache layer would improve cache efficiency by ensuring that semantically-equivalent requests map to the same cache entry. It would also protect the application servers from cache-busting attacks that use bogus query parameters to bypass the cache.

Example

Consider three (otherwise identical) GET requests:

With query canonicalization, these three requests would be rewritten into the same normal form, and could be served from a single entry in the cache.

Proposal

To perform query canonicalization, the caching proxy needs to be confident that the changing the order of parameters does not alter the semantics of the request, and it needs to know which query parameters are meaningful to MediaWiki. (Currently, the only way to know that is to audit all the code manually. This is because any MediaWiki code can read and parse the query string.)

I propose we do this by moving from the current, ad-hoc handling of query parameters in PHP code to a model where all query parameters must be declared in advance (during initialization, prior to any request processing) prior to being accessed. This ensures that the full set of meaningful query parameters is collected by some registry object, and can be exported via a MediaWiki API endpoint for consumption by caching proxies.

The migration to declarative query processing can be done incrementally, without breaking backward compatibility.

  • Provide a mechanism for PHP advance declaration of query parameters:
    • For code in core, there will be a static array.
    • MediaWiki extensions will list query parameters in their manifest.
  • Provide a new interface in MediaWiki for reading query parameters, which only exposes query parameters that have been pre-declared.
  • Add a new configuration variable to core, $wgCanonicalizeQueries, with three possible states:
    • off: Existing behavior (no query canonicalization).
    • warn: Accessing undeclared query parameters (or the raw query string) causes a warning to be logged.
    • on:
      • Interfaces that return the raw query string (like WebRequest::getRawQueryString) return a canonicalized query string.
      • Interfaces that return an array of parameters (like WebRequest::getQueryValues) return an array that contains only known parameters, in sorted order.
      • If the PHP filter extension is present, MediaWiki will register a FILTER_CALLBACK that canonicalizes query data in $_GET, $_REQUEST and $_SERVER.
  • Implement the API endpoint for exporting the set of meaningful query parameters for consumption by edge caches.
Questions / Concerns
  • If the declaration of query parameters is separated from use (i.e., if they occur in different places in the code), how do we make sure the two are in sync? If I remove code that handles a particular query parameter, is there anything to remind me to remove the declaration? Can I even remove the declaration — i.e., can I be confident that no other code is using the query parameter?
  • Query parameters are generally only meaningful for certain endpoints and in combination with certain other parameters. For example, the recursivesubmodules query parameter is only meaningful for requests to /w/api.php where action = help. The proposal above does not capture this sort of relationship between query parameters. Among other things, this means that it would still be possible to bypass caching using nonsensical combinations of query parameters (query parameters that are each individually meaningful somewhere, but not meaningful together).

A solution to some of these issues could be to find some way of tying query parameter declaration and usage together somehow, the way the MediaWiki action API does it. This is hard because MediaWiki is still partially reliant on file-based request routing. An architecture where all routes are declared to a central controller (e.g.) would make this easier. That is a much more ambitious undertaking, though. The proposal above might be a useful stepping-stone in that direction.

Related: T302459, T138093

Event Timeline

This reminds me of T140664, which is a proposal from a few years ago going in a similar direction (albeit for a different use case).

In any event, establishing such a router will be a significant and non-trivial effort. Using a warning phase as you describe, will be critical to gaining confidence and identifying the long tail over time.

My thinking behind T140664 was to be able to export a static route map via the API, to potentially consume in another service. The use case I had in mind was for CDN edges to know whether something will respond with a skinned response vs something not (e.g. tell apart /wiki and other implicit index.php?action=view, from things like /wiki/Special:Export/Foo, Special:BannerLoader, and /w/index.php?action=raw/render etc) so that a potentially separate instance of MW skin at the edge can e.g. flush part of the skin template while awaiting other bits to be injected in the ouput stream. Today this is not possible as the above codepaths aren't skin HTML responses yet appear similar to them in terms of route. I do note that a number of these can (and imho, should) be deprecated and replaced with something else (e.g. api.php modules or rest.php routes).

Regarding parameter order - One thing that comes to mind from a previous experiment long ago (I don't recall specifics and couldn't find any) is OAuth verification.

OAuth is sensitive to the query string being preserved unaltered. There may be other components that similarly HMAC or otherwise verify query parameters. In addition to OAuth, other areas that might be affected are: SecurePoll, CentralAuth, and SpecialRunJobs.

ori renamed this task from Query canonicalization for MediaWiki to Advance declaration of query parameters.Jun 22 2022, 11:16 PM

Re-scoping this to be about advanced declaration of query parameters, and moving discussion of parameter ordering to T138093.

@Krinkle AIUI the OAuth 1 spec stipulates that parameters be normalized prior to computing a signature, so that should be OK. Not sure about 2.0.