Page MenuHomePhabricator

http://en.wikipedia.org//wiki/Main_Page raise "Redirect loop detected!" because of the double slash
Closed, ResolvedPublic

Description

Visit mediawiki.org or wikipedia.org and get this error:

Wikimedia app servers have a few errors 500 when hitting an URL having double slash at the beginning of the path. Ie:

Ex with http://en.wikipedia.org//wiki/Main_Page

Redirect loop detected!
This means the wiki got confused about what page was requested; this sometimes happens when moving a wiki to a new server or changing the server configuration.
The wiki is trying to interpret the page title from the URL path portion (PATH_INFO), which sometimes fails depending on the web server. Try setting "$wgUsePathInfo = false;" in your LocalSettings.php, or check that $wgArticlePath is correct. (500) from /srv/mediawiki/php-1.26wmf7/includes/MediaWiki.php:277

step to reproduce

$ curl http://en.wikipedia.org//wiki/Main_Page

Yields:

<!DOCTYPE html>
<html><head><title>Internal Server Error</title></head>
<body><h1>Internal Server Error</h1><p>Redirect loop detected!

This means the wiki got confused about what page was requested; this sometimes happens when moving a wiki to a new server or changing the server configuration.

The wiki is trying to interpret the page title from the URL path portion (PATH_INFO), which sometimes fails depending on the web server. Try setting &quot;$wgUsePathInfo = false;&quot; in your LocalSettings.php, or check that $wgArticlePath is correct.</p></body></html>

Details

Related Gerrit Patches:

Event Timeline

hashar created this task.May 29 2015, 12:10 PM
hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added a subscriber: hashar.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 29 2015, 12:10 PM

That is in mediawiki/core wfExpandUrl() I guess.

TTO added subscribers: TTO, an_stol.

Merged this the "wrong" way, because the description here contains more info.

Possibly to do with Wikimedia-Apache-configuration?

hashar triaged this task as Medium priority.Jun 3 2015, 12:49 PM

I can reproduce this locally (with some debug info added):

Something is normalising the "original" request URL when it shouldn't. This happens due to wfRemoveDotSegments() normalising // to /. As such, MediaWiki is unable to find any difference between the current and canonical url and refrains from entering a potential redirect loop.

hashar removed a subscriber: hashar.Mar 10 2018, 9:10 PM
Krinkle removed a subscriber: Krinkle.Mar 13 2018, 12:44 AM

For the record, it doesn't matter whether these urls will result in a page view or not. What does matter is that these urls do not produce an HTTP 5xx error, given that this is (in its current state) purely a client error for using a url pattern that we don't support (Bad Request, or Not Found; would be more appropriate).

However, in the current state we produce an HTTP 5xx error because there is in fact a server-side bug in play: MediaWiki is working with state that is corrupted mid-process. At point A, MediaWiki is aware of the real url (with two slashes) and recognises it as different from the canonical url. At point B, MediaWiki sees the current url as having one slash (normalised). Somewhere between point A and B, either it looks at different source of state, or the state is changed, e.g. by wfRemoveDotSegments().

@Krinkle added a project: MediaWiki-Platform-Team.

I'm concerned that that project add is misusing the team as a "miscellaneous MediaWiki" team, but I looked at the task anyway.

However, in the current state we produce an HTTP 5xx error because there is in fact a server-side bug in play: MediaWiki is working with state that is corrupted mid-process. At point A, MediaWiki is aware of the real url (with two slashes) and recognises it as different from the canonical url. At point B, MediaWiki sees the current url as having one slash (normalised). Somewhere between point A and B, either it looks at different source of state, or the state is changed, e.g. by wfRemoveDotSegments().

"Point B" is MediaWiki.php lines 348-349.

"Point A" doesn't really exist, although I suppose you could say step #1 below is "point A" even though it's only recognizing that double-slashes doesn't match $wgArticlePath rather than that it doesn't match the canonical URL.

What happens is:

  1. WebRequest::getPathInfo() uses $_SERVER['REQUEST_URI'] to pass to PathRouter, which tries to recognize it based on $wgArticlePath. In the normal case it finds a match for WebRequest::interpolateTitle() to interpolate, while in the double-slash case it doesn't.
  2. MediaWiki::parseTitle(), if it finds no title in WebRequest, falls back to using the main page.
  3. MediaWiki::tryNormaliseRedirect(), in the normal case, finds that the passed $title deriving from step 2 matches the interpolated title from step 1, and so never reaches point B. In the double-slash case, on the other hand, it doesn't match and so continues on to reach point B.
  4. At point B, it turns out that WebRequest::getGlobalRequestURL() removes multiple leading slashes to avoid confusion with protocol-relative URLs.

We could probably fix it by having WebRequest::getPathInfo() use WebRequest::getGlobalRequestURL() instead of using REQUEST_URI directly. That would result in it treating the double-slash URL as being the same as the normal URL, no redirect attempted. I don't know if that would have any other negative side effects, such as screwing up somehow when REQUEST_URI isn't set and it falls back to HTTP_X_ORIGINAL_URL or SCRIPT_NAME+QUERY_STRING.

Or we could make a version of WebRequest::getGlobalRequestURL() that does less changing (and probably partially duplicates wfExpandUrl()) to use at point B. That would result in any double-slash URL (e.g. http://en.wikipedia.org//wiki/Not_the_Main_Page) being redirected to the wiki's main page.

@Krinkle added a project: MediaWiki-Platform-Team.

I'm concerned that that project add is misusing the team as a "miscellaneous MediaWiki" team, but I looked at the task anyway.

I don't see Platform as a catch-all for misc things. I added the tag because this bug doesn't seem to be about a misc feature to me. Rather, things like interpolateTitle are imho very central to core infrastructure. To me, as "platform" as it gets.

Regarding the tag, I do think Platform Team is the best team to decide what is in-scope for Platform Team to handle :) – and to reject or delegate elsewhere if out-of-scope. In this particular case, the following two conditions were met in my mind: 1) Existing docs don't identify an obvious owner; 2) It is not a misc feature, but rather a central piece of infrastructure, which might be in Platform's scope.

[..]
We could probably fix it by having WebRequest::getPathInfo() use WebRequest::getGlobalRequestURL() instead of using REQUEST_URI directly. That would result in it treating the double-slash URL as being the same as the normal URL, no redirect attempted. I don't know if that would have any other negative side effects, such as screwing up somehow when REQUEST_URI isn't set and it falls back to HTTP_X_ORIGINAL_URL or SCRIPT_NAME+QUERY_STRING.
Or we could make a version of WebRequest::getGlobalRequestURL() that does less changing (and probably partially duplicates wfExpandUrl()) to use at point B. That would result in any double-slash URL (e.g. http://en.wikipedia.org//wiki/Not_the_Main_Page) being redirected to the wiki's main page.

Current state: The interpolateTitle uses raw url, the redirect comparison uses normalised url. Result:

  • Double slashes are visible to the router (making it not match), but invisible to the canonicalising redirect.
  • No redirect happens given it (wrongly) thinks that redirecting would go the same url as the current url (infinite loop).
  • Intended page is not viewed, given the Uncaught Exception (HTTP 5xx).
  1. Change interpolateTitle (WebRequest::getPathInfo) to use normalised url. Result:
    • Double slashes are invisible to all code. (normalise always)
    • No redirect.
    • Intended page is viewed.
  2. Change redirect comparison from normalised to raw url, matching interpolateTitle. Result:
    • Double slashes are visible to all code. (raw always)
    • Redirect happens.
    • Intended page is not viewed, because interpolateTitle will not match any route (fallback: Main_Page).

I'd like to propose a third option:

  1. Change interpolate title to use normalised url, change redirect comparison to use raw url. Result:
    • Double slashes are invisible (normalised) to router, but visible (raw) to the canonicalising redirect.
    • Redirect happens.
    • Intended page is viewed, after redirect.
tstarling added a subscriber: tstarling.EditedJul 6 2018, 4:48 AM

I'd like to propose a third option:

  1. Change interpolate title to use normalised url, change redirect comparison to use raw url. Result:
    • Double slashes are invisible (normalised) to router, but visible (raw) to the canonicalising redirect.
    • Redirect happens.
    • Intended page is viewed, after redirect.

If we normalize the entire path in PathRouter then we won't be able to have double slashes in article titles. There are 967 such titles in the English Wikipedia, for example https://en.wikipedia.org/wiki/.hack//Sign . We really need PathRouter to normalize only the parts that match prefixes (and other literals), not parameters.

We really need PathRouter to normalize only the parts that match prefixes (and other literals), not parameters.

I couldn't think of a way to do exactly this, it's very ambiguous. We already have multiple patterns with an order of precedence, ideally normalisation would only be tried when every pattern fails to match against the non-normalised path. Matching a string literal against the normalised version of the path at a certain offset would require looking ahead to the end of the string. So I went with a simplified algorithm, which is to just run the whole path parser against the non-normalised path, and if that doesn't match, to normalise it and run the parser again. Now working on the second part, "change redirect comparison to use raw URL".

Change 444803 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Avoid a redirect loop when the request URL is not normalized

https://gerrit.wikimedia.org/r/444803

We really need PathRouter to normalize only the parts that match prefixes (and other literals), not parameters.

I couldn't think of a way to do exactly this, it's very ambiguous. We already have multiple patterns with an order of precedence, ideally normalisation would only be tried when every pattern fails to match against the non-normalised path. Matching a string literal against the normalised version of the path at a certain offset would require looking ahead to the end of the string. So I went with a simplified algorithm, which is to just run the whole path parser against the non-normalised path, and if that doesn't match, to normalise it and run the parser again.

As for "change redirect comparison to use raw url": I tried this, but in fact the redirect does not happen by simply making this change. tryNormaliseRedirect() returns early since $title matches the title extracted from the URL. I uploaded the patch anyway since a 200 response is better than a 5xx. I need to think some more about whether redirecting is feasible and ideal.

If we normalize the entire path in PathRouter

Note that the redirect loop only occurs when the path begins with multiple slashes, because PathRouter doesn't normalize that to one / while WebRequest::getGlobalRequestURL() does.

If there can be double slashes elsewhere in the routed portion of the path, e.g. if $wgArticlePath is /foo/bar/$1, then for something like http://example.com/foo//bar/Title WebRequest::getGlobalRequestURL() will preserve the mid-path double slashes and therefore MediaWiki won't see a loop when trying to redirect to http://example.com/foo/bar/Main_Page with a single slash in the middle. We might prefer that redirects to http://example.com/foo/bar/Title instead of to the wiki's main page, but at least it's not a weird 5xx error.

Other things such as http://example.com/./foo/bar/Title have the same redirect-to-main-page behavior.

We really need PathRouter to normalize only the parts that match prefixes (and other literals), not parameters.

Would it work to go the opposite direction and have PathRouter denormalize the regex instead of normalizing the input path? i.e. near the start of PathRouter::extractTitle() do something like

// Ignore double-slashes and dot path components
$regexp = preg_replace( '#/#u', '/(?:\.?/)*?', $regexp );

Handling .. that way would be difficult, though.

Change 444803 merged by jenkins-bot:
[mediawiki/core@master] Avoid a redirect loop when the request URL is not normalized

https://gerrit.wikimedia.org/r/444803

Anomie closed this task as Resolved.Jul 24 2018, 4:28 PM

The fix should be deployed to Wikimedia wikis with 1.32.0-wmf.15 or later.