Page MenuHomePhabricator

Move URL-routing logic into MediaWiki
Open, MediumPublic

Description

Wikimedia's URL-routing logic straddles five layers: Varnish, Apache, the file system, multiwiki, and MediaWiki code. This makes the URL dispatching patterns harder to discern and harder still to reason about, and it couples us tightly to a particular software stack. It's a problem for other users of MediaWiki, too: anyone who has spent a bit of time fielding questions on #mediawiki knows that users often naively expect pretty URLs to work out of the box, and they are confused when we send them away to tinker with mod_rewrite rules.

Let's replace all that cruft by picking a good URL dispatcher library and by gradually making it MediaWiki's exclusive web entry-point.

Some popular choices:

Event Timeline

ori raised the priority of this task from to Needs Triage.
ori updated the task description. (Show Details)
ori added a project: MediaWiki-General.
ori subscribed.
ori set Security to None.
Jdforrester-WMF added a project: TechCom-RFC.
Jdforrester-WMF removed a subscriber: Krenair.

How is any routing library going to help us get pretty URLs to work out of the box? Something like http://localhost/w/index.php/Main_Page already works out of the box, but convincing the webserver to call /w/index.php/ for /wiki/ doesn't seem like something we can do from PHP.

Beyond that, what exactly are the bits of MediaWiki URL routing that are spread over Varnish, Apache, the file system, and multiwiki? It's hard to comment without knowing what we're supposed to be commenting about.

For "filesystem", I guess that's referring to the pile of symlinks under /srv/mediawiki-staging/docroot/? But it looks like everything under there that gets to MediaWiki at all already winds up at the same "multiwiki" endpoint, so there's not really any routing going on.

As for "multiwiki", it would be hard to have MediaWiki decide which deployment branch to use when we have to pick a deployment branch in order to start MediaWiki. It would also be hard to have MediaWiki decide which $wgDBname and set of configuration to use when it would probably need to be configured in order to decide that.

As for T29935, we can't serve a 3xx redirect from /wiki/Foo_%28bar%29 to /wiki/Foo_(bar) for reasons already explained in that task, so any such rewriting would have to be internal-only and therefore probably pointless to do inside MediaWiki unless it's already coupled to a stack with something above MediaWiki to catch these supposed-to-be-internal redirects.

Let's keep this task limited to refactoring and centralising the url routing for MediaWiki (and potentially switching to an upstream library). The changing of the urls themselves (e.g. to use slugs, or action paths, or phasing out entry points besides index.php etc.) can be discussed later in separate tasks.

I think the work in recent months related to canonical urls and normalisation has already centralised and cleaned up a fair amount of the handling, however there are still some things that live exclusively in Varnish configuration that may or may not match MediaWiki's equivalent logic. This gets especially tricky when trying to change the handling as its very easily to accidentally cause an infinite loop or inaccessible url.

I'll let @ori clarify, but I think the intent here is to get rid of any MediaWiki-specific url normalisation and redirecting (if any) from the Varnish layer and instead let MediaWiki "own" that.

@GWicke brought up we should keep some normalisation on the Varnish level (though for all request, not specific to MediaWiki) to reduce cache duplication for fringe and odd ways of encoding parts of the url, in a way that benefits all services (including e.g. RESTBase and misc web services). However this kind of normalisation should be conservative and probably match what a real request could look like from a major browser (so that we don't rely on it). And likewise, this normalisation would strictly be a subset of what MediaWiki does so that we don't come to rely on it.

For hardcoded exceptions (such as tilde and parenthesis) we may want to update MediaWiki to purge all variations of a url from Varnish (for urls of page names that contain such characters). Or document that Squid/Varnish users must normalise those at the edge. For stock installs we'd only need MediaWiki to normalise it for file cache.

(See also T106793)

I removed the #MediaWiki-RfCs tag for now. It was added by @Jdforrester-WMF (with my support) because for a moment it seemed like this might be critical for something VisualEditor wanted to do and would thus happen sooner rather than later. That's not the case any more. I still think this should happen, but the URL routing libraries that I have found are almost all PHP 5.4+, so I am shelving that until T118932: RfC: Raise MediaWiki's PHP version requirement to 5.5 and update coding standards is settled.

Suggestion from @BBlack:

There should be some config/metadata that says which URLs are session-cookie-sensitive and which aren't.

Suggestion from @BBlack:

There should be some config/metadata that says which URLs are session-cookie-sensitive and which aren't.

I think that would require major architecture changes (basically separating the current tangle of code into business logic, UI code and controllers; and providing a way for controllers to describe themselves towards some routing layer without actually being invoked). You might try to hack up somthing with regular expressions but it won't get you far; for example, https://ja.wikipedia.org/w/index.php?title=%E6%97%A5%E6%A0%B9%E7%A5%9E%E7%A4%BE&action=raw does not need sessions (assuming that you are on a public wiki, have not enabled any fancy page visibility extension, and the page you are looking at does not have a fancy content handler), https://ja.wikipedia.org/w/index.php?title=%E7%89%B9%E5%88%A5:%E3%83%87%E3%83%BC%E3%82%BF%E6%9B%B8%E3%81%8D%E5%87%BA%E3%81%97&action=raw does.

From the endpoints, I think only index.php, api.php and img_auth.php need sessions (and the installer, but we probably don't care about that), but that's easier to hardcode than to come up with a metadata format for it.