Page MenuHomePhabricator

Random page redirects are incorrectly performed with HTTP 302
Closed, DeclinedPublic

Description

Author: gutza

Description:
Patch for includes/OutputPage.php

When a user requests a random page, the MediaWiki software responds with a HTTP 302 "Found" status code. According to RFC 2616, HTTP status code 302 is meant to be used when a resource has been temporarily moved. As such, "Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests." Of course, this is not appropriate for random pages -- and the result of this implementation is that spiders improperly index random pages.

I have changed the code as to provide the proper response, i.e. HTTP status code 303 "See other", which seems a lot more appropriate: "This method exists primarily to allow the output of a POST-activated script to redirect the user agent to a selected resource. The new URI is not a substitute reference for the originally requested resource. The 303 response MUST NOT be cached, but the response to the second (redirected) request might be cacheable."

I'm attaching patches to this bug. The diffs are made against the latest versions of the files in the SVN repository (OutputPage.php Revision 57608, SpecialRandomPage.php Revision 55188).


Version: unspecified
Severity: minor

attachment op[1].diff ignored as obsolete

Details

Reference
bz21179

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 10:47 PM
bzimport set Reference to bz21179.
bzimport added a subscriber: Unknown Object (MLST).

gutza wrote:

Diff for includes/OutputPage.php

Attached:

gutza wrote:

Diff for includes/specials/SpecialRandompage.php

Attached:

gutza wrote:

This was a manual diff as I don't have a working copy checked out, so I decided to provide a bit more context for patch to work with in case someone makes changes in the repository before this patch makes it in. Given that patch works fine with both formats, and that I wouldn't include SVN-specific stuff anyway, would it really help if I provided unified diffs in addition to the current ones?

Are you aware of anything that specifically treats a 303 differently from a 302?

I'm a little worried that I don't see clear evidence of actual useful support in a couple quick web searches (like say a page from Google's webmaster guidelines), while the spec itself says that a 302 is safer and does the same thing:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

Note: Many pre-HTTP/1.1 user agents do not understand the 303
status. When interoperability with such clients is a concern, the
302 status code may be used instead, since most user agents react
to a 302 response as described here for 303.

gutza wrote:

The RFC you're quoting is almost 10 years old, I'm sure all compatibility wrinkles have been sorted out by now. As for Google or any other piece of software following standards, I don't think we should be concerned with that -- we should, regardless of what they're doing.

Having said that, I think the Googlebot actually does make the difference, at least judging by the fact that they're documenting the two distinctly here: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40132 ("The server returns this code when the requestor should make a separate GET request to a different location to retrieve the response.")

That page just mirrors the RFC and doesn't describe any distinction in Googlebot's crawl behavior or indexing behavior. In the absence of documented evidence that search indexes treat these differently, dropping as WONTFIX.

gutza wrote:

That's a surprising decision. You know our implementation is incorrect, you know how to fix it, you have the actual code to fix it, and yet you refuse to fix it because you have no evidence Google would make the difference between the incorrect implementation and the correct implementation. This should be fixed even if we had evidence that Google DIDN'T make the difference -- I simply don't understand why you'd make a voluntary decision not to follow standards based on the assumption that following them might produce the same results as not following them.

On the contrary, the spec specifically says our behavior is both correct and more compatible. Please provide actual arguments in favor of a change if you wish to continue.

gutza wrote:

Fair enough, let's analyze the specification. But first, let's define the concepts we're using. When you access the URI for Special:Randompage you don't identify a specific resource, but rather make a request for a service (random redirection). As such the URI for Special:Randompage MOST NOT be associated with whatever resource ends up being served as a result; in that specification's context we can exchange "MUST NOT be associated" with "MUST NOT be cached", since they're basically the same thing.

The specification for 302 reads "The requested resource resides temporarily under a different URI. Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests." When I access Special:Randompage, did the resource at Special:Randompage move temporarily under a different URI? As such, should the client keep using the Special:Randompage URI in order to reach the (random) resource it ended up with? (The former is obviously incorrect, and the latter is Google's current behavior, which forces us to use robots.txt in order to disallow access to random pages.)

The spec for 303 on the other hand reads "The response to the request can be found under a different URI and SHOULD be retrieved using a GET method on that resource. This method exists primarily to allow the output of a POST-activated script to redirect the user agent to a selected resource. The new URI is not a substitute reference for the originally requested resource." This is altogether more appropriate for our purpose: the specification itself says the request is processed by a script and provides a redirection (precisely what we're doing); in addition the new URI is explicitly said not to be a substitute reference for the original request -- and that's exactly what we're after.

If that wasn't enough, the spec for 302 reads "This response is only cacheable if indicated by a Cache-Control or Expires header field", whereas the spec for 303 specifically states "The 303 response MUST NOT be cached".

Regarding your concern with the compatibility note, that reads "When interoperability with [pre-HTTP/1.1 user agents] is a concern, the 302 status code may be used instead". Which pre-HTTP/1.1 agents are you concerned about? Also, even if we could conceivably find some archaic HTTP 1.0 clients still in sporadic use today, we'd only hinder their access to the Random page functionality, which is in no way a core functionality in Mediawiki -- so basically we'd be concerned with a negligible minority being negligibly affected.

The spec's not at issue.

Really what we need here is some evidence that 1) they actually behave differently so there's some benefit to changing and 2) client support is consistent enough (including in hacked-together client-side bot tools) that there's no downside.

gutza wrote:

In your previous reply you asserted the specification "specifically says our behavior is both correct and
more compatible" -- now the spec's not an issue. I'm not sure what I should address, if not the issues you're raising.

Regarding benefits, your logic is fallacious: the fact that Google or whoever else does or does not abide to standards doesn't mean that OUR abiding to standards isn't beneficial -- those two statements are not dependent on one another. Let's say Google completely disregards status codes (which it doesn't, we know there's a difference between the way it treats 301 and 302, we're just not sure between 302 and 303). If everybody uses 301, 302 and 303 properly then Google will take note, even if it hadn't so far. And since MediaWiki is widely deployed, this is one of the places where we actually have some leverage to push for standards. And then again -- this is only IF Google disregards the difference, which we don't know for sure.

As for your other argument, I'm sorry, but that's nothing short of preposterous. We'll never find a study regarding the behavior of "hacked-together client-side bot tools" in respect to HTTP status code 303 -- but if you do have a study regarding said tools' behavior in respect to 302 I'd love to read it. Of course you don't, but if we were to continue this discussion you'd say they worked so far -- but by that rationale we'd be forced never to change anything, which I'm sure you'd disagree with in other respects. Lack of proof for something unprovable is not an argument.

I'm not reopening this -- we've been running in circles for a couple of exchanges and the fact that you're asking me to prove something which is utterly impossible to prove tells me this conversation is useless.

(In reply to comment #10)

Regarding your concern with the compatibility note, that reads "When
interoperability with [pre-HTTP/1.1 user agents] is a concern, the 302 status
code may be used instead". Which pre-HTTP/1.1 agents are you concerned about?
Also, even if we could conceivably find some archaic HTTP 1.0 clients still in
sporadic use today, we'd only hinder their access to the Random page
functionality, which is in no way a core functionality in Mediawiki -- so
basically we'd be concerned with a negligible minority being negligibly
affected.

IIRC, Wikimedia's squid servers still use HTTP/1.0. It wouldn't be very standard compliant for them to respond with "HTTP/1.0 303 See other" as this status code doesn't exist in 1.0 :)

gutza wrote:

That's a legitimate concern -- Wikimedia's servers currently respond "HTTP/1.x 302 Moved Temporarily", which I assume means the response is intended to be backwards-compatible. As such, switching the whole thing to HTTP/1.1 might involve more complications than I had anticipated.

Bogdan, what I mean by "the spec's not at issue" is that we don't disagree about what the spec says -- we're disagreeing on the importance and relative benefits/risks of using an HTTP 1.1-only feature that is more semantically correct.

Being that we interop with both HTTP 1.0 (really, HTTP 1.0 plus the Host header...) and "real" 1.1 clients, using an HTTP 1.1-only response doesn't make much sense to me without a clear benefit. The spec explicitly says that 302 is a correct and backwards-compatible usage, so in the absence of a practical benefit I believe sticking with 302 is the best behavior.

gutza wrote:

I wasn't aware of Wikimedia's server's current behavior -- now I have seen the light and I completely understand this is not worth the hassle in that particular context (it may not worth it even if Google treats it properly, since the problem is easily avoided altogether with robots.txt).

Having said that, MediaWiki is a stand-alone product -- how difficult would it be to implement this on an optional basis, as to allow other installations to benefit from the improvements in HTTP 1.1 in this regard?

gutza wrote:

For the record, I tested this on my own MediaWiki installation -- Brion's intuition was right, Google doesn't discriminate between 302 and 303. That's quite surprising for me, but it's the way it is.

gutza wrote:

Update: I brought this up explicitly on the Google Webmaster Central forum: http://www.google.com/support/forum/p/Webmasters/thread?tid=7f545a23e5276203&hl=en

While I haven't (yet) received a definitive answer on how Google treats status code 303, it appears the most useful approach is to use rel=canonical in order to specify the proper URL. Interstingly, MediaWiki already does that for page redirects -- but it only does it for page redirects (search for "canonical" in includes/Article.php). Why isn't that included unconditionally?