Page MenuHomePhabricator

/w/rest.php/v1/page/:title/ should not follow indirect redirects
Closed, ResolvedPublic3 Estimated Story Points

Description

MediaWiki generally treats redirect chains longer than 1 as invalid. That is, only direct redirects are valid, indirect ones are invalid and not followed.

History sidebar: Prior to MediaWiki 1.39 in 2022, this feature was configurable via $wgMaxRedirects. However, even then the default was 1, and it was 1 on WMF wikis, and it turned out to not work correctly with values higher than 1 and so was removed in T296430.

Steps to reproduce:

  • A wiki page "C" exists with some content.
  • A wiki page "B" redirects to "C".
  • A wiki page "A" redirects to "B" (it does not matter in which order you create these. Each page is independent and stateless).
  • Request A from the REST API, e.g. example.org/w/rest.php/v1/page/A/with_html

Other information

When you visit C normally in the browser (eg. example.org/wiki/C), you get the content of C.

When you visit B (eg. example.org/wiki/B), you get the content of C, with a note that we followed a redirect from B.

When you visit A (eg. example.org/wiki/A), you receive a placeholder page indicating that "A" is a redirect to B, but the redirect is not followed (because it is invalid, because chains longer than 1 are not valid).

The REST API represents wiki-style redirects as native HTTP redirects. This "native" HTTP redirecting is different from how MediaWiki pageviews and the MediaWiki Action API work, which is to respond with the destination and a JSON or other header indicating that we normalized and/or redirected the title. This native HTTP format, combined with a lack some sort of lookahead validation, means that A>B>C actually works. What that could be an interesting feature in the future, this is currently inconsisistent with the rest of MediaWiki and thus unexpected.

Actual:

A redirect response with HTTP 30x and a Location header to B.

Expected:

A JSON response that represents the page "A" indicating that is a redirect (but not followed).

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedBPirkle

Event Timeline

Security release is due this week... I'm guessing this may not be fixed to be included in it?

HCoplin-WMF set the point value for this task to 3.

Security release is due this week... I'm guessing this may not be fixed to be included in it?

I'm about to start looking into this, but can't guarantee when it will be complete/reviewed/merged.

Summarizing to make sure I understand the details. Assuming the A => B => C redirect chain from the task description:

  1. requesting A should result in response indicating that A is a redirect, but not an HTTP response that would cause a client to actually redirect
  2. requesting B should result in an HTTP redirect (unchanged from current behavior)
  3. requesting C should result in a response based on C (unchanged from current behavior)

For #1, the response will be a 200, with a response body as if "redirect=false" had been sent by the client.

Determining if #1 is needed can be implemented by lookahead. If the page being requested is a redirect, then look at the redirect target. If that is also a redirect, then do #1. The RedirectStore already injected into PageRedirectHelper can be used for this.

After this change, clients that receive a response of type #1 could choose to inspect the details of that response and follow the chain, but would not automatically do so from an HTTP perspective. In other words, clients would have to actually implement logic to intentionally do this, rather than accidentally doing so because they are coded to follow all 3xx responses.

How should normalizations be handled? Absent any specific direction otherwise, I'll retain the existing code to first check for normalization redirects and, if present, return a 301. Redirect chains will only be checked after that. In other words, it is possible we'll do a normalization redirect then another redirect.

As an aside, I think we should have a look at how we compute the ETag for redirect responses. The ETag should represent the response body, which in our case is empty. So perhaps we could just detect if we are going to return a redirect early, and if we do, omit the etag, or return one that is cheap to compute, e.g. the title of the redirect target.

The web unterface handles a redirect chain like A -> B -> C by returning the rendered contents of B when the user requests A. The browser is "redirected" to a URL of B that has redirect=no. I'd expect the REST API to do the same: just include redirect=no in the target URL of wiki redirects. No "look-head" needed.

The web [interface] handles a redirect chain like A -> B -> C by returning the rendered contents of B when the user requests A. The browser is "redirected" to a URL of B that has redirect=no. I'd expect the REST API to do the same: just include redirect=no in the target URL of wiki redirects. No "look-head" needed.

This sounds good to me, and would indeed more closely match the rest of MediaWiki.

I do see a potential downside and want to a offer compromise between what Bill proposed (a new JSON response format for redirects) and Daniel (redirect to target1+redirect=no).

The downside of redirecting to target1 with redirect=no is that this can look like a bug, with an undesirable workaround for that "bug". In the web interface we don't actually HTTP-redirect "A" or "B", but respond directly with target1 contents, an annotation about where the problem is, and a silent non-submitting address bar text replacement. This means that the response is immediate, and you can get to the "source" of the problem by clicking on "Redirected from A".

I can see API consumers potententially getting confused by why it redirects at all when we know it is invalid. An API consumer going through their logs would notice and may wonder abou this, and "fix" it by stripping redirect=no, since there is no obvious reason for why this is appended and it works when you remove it.

A possible compromise here that might work better is to have A in the REST API respond directly and the same as A?redirect=no. The API caller asked for HTML and they get HTML. The HTML conveys the not-followed redirect, just as they'd get for any invalid redirect page today, and the same as they would get (after 1 redirect) in Daniel's proposal. But, we'd give it to them directly on "A". This reduces confusion, I think, and also removes the misunderstood "workaround" potential, since there wouldn't be a direct reference to the REST URL for "B". Instead of wondering why "A" works but then the REST API does something "wrong" for "B", they'll instead have either no surprise (the response is transparently just HTML as they'd expect, no mentions of redirect=no in their logs, can be served directly to a user) or at most, their end-user might wonder why "A" displayed a placeholder page. And... that's exactly the queston they should be asking! After all, when A>B>C exists, the place that needs fixing is "A", not "B" (B is fine).

This too would be consistent with the MediaWiki web interface, in that it viewing "A" presents the user with placeholder HTML, with attribution to the source of the problem. On the web that attribution is "Redirect from A". On the API that would be the current title more generally. I worry that first following one level, hides this information from the API end-users, which makes it harder to understand why it is broken. I'd expect a bug report along the lines of "viewing B is broken in the API but works on the web", since there'd be no mention of "A", and the end-user would not know per-se that something originally directed them at "A". This is an inherent difference between web and API.

I think we should consider this for pageviews as well. But I think it would be fine to apply in the API first, this wouldn't be inconsistent since it plays to the strength of the format. The fact that the source is hidden in API requests is an inherent difference and we should not ignore that. Unlike redirect loops or new JSON formats, either of these is backwards-compatible and addresses the bug in the task description.

Up to you!

I see your point about potential confusion, especially the fact that in the API, there's no equivalent of "redirected from". But I still prefer HTTP redirects. Some thoughts:

If we stop responding for A with an HTTP redirect if there is a redirect chain A -> B -> C , we should stop responding with an HTTP redirect if there's just A -> B (and set the Content-Location header). Following redirects only sometimes seems even more confusing. I don't like the consequences of that however:

  • Serving the target page's content from redirect pages may break existing clients' expectation (including clients written for RESTbase).
  • We'd be serving entirely different content for the same URL depending on whether it has redirect=no specified.
  • We'd duplicate the page content in the CDN cache.
  • It would be inconsistent with the web UI's behavior - when you visit A, you end up viewing B. The fact that this is done with JS magic rather than an HTTP redirect is invisible and irrelevant to the user.

That leaves us with the question how to avoid confusion when API clients sometimes see redirect pages, instead of being redirected to the target page. I honestly think the potential for confusion there is less than what we'd get with the alternative, but we should still think about it.

The only mechanism I can see that would allow us to implement something like "redirected from" would be the Referrer header - but clients apparently don't set that when following redirects - not even browsers, it seems.

So it boils down to a product decision.

For comparison, here are results of some testing, using a local A => B => C chain, in both Desktop and Action API:

NOTE: we shouldn't ignore precedent set by Desktop and Action API, but we should also recognize that caller expectations of a REST API may be different.
WARNING: Be careful about trying to reproduce the following results on a public wiki, as your test pages may change unexpectedly. I initially tried testing using Daniel's chain on mediawiki.org. But it had been by Revibot to be A => C, B => C instead of A => B => C and therefore gave unexpected results. That was very confusing before I realized what had happened.

Browser

In all cases below where the page content included something like "↳C", the "C" part was a link.

Url I hit: http://localhost:8080/wiki/A

(Redirected from A)
Redirect page
↳C 
This is the intermediate step in an indirect redirect chain.

Url I hit: http://localhost:8080/wiki/B

(Redirected from B)
This is the final step in an indirect redirect chain.

Url I hit: http://localhost:8080/wiki/C

This is the final step in an indirect redirect chain.

Url I hit: http://localhost:8080/wiki/A?redirect=no

Redirect page
↳B 
This is the initial step in an indirect redirect chain.

Url I hit: http://localhost:8080/wiki/B?redirect=no

Redirect page
↳C 
This is the intermediate step in an indirect redirect chain.

I doubt it matters, but all browser tests were performed on MacOS 14.5 using Brave Version 1.76.74 Chromium: 134.0.6998.89 (Official Build) (arm64)

Action API

Not sure if this is the most useful/representative way to invoke Action API in this context, but here's what I did. All calls made through API Sandbox.

Url I hit: http://localhost:8080/w/api.php?action=query&format=json&titles=A&redirects=1&formatversion=2

  • http code: 200
  • response:
{
    "batchcomplete": true,
    "query": {
        "redirects": [
            {
                "from": "A",
                "to": "B"
            },
            {
                "from": "B",
                "to": "C"
            }
        ],
        "pages": [
            {
                "pageid": 557,
                "ns": 0,
                "title": "C"
            }
        ]
    }
}

Url I hit: http://localhost:8080/w/api.php?action=query&format=json&titles=B&redirects=1&formatversion=2

  • http code: 200
  • response:
{
    "batchcomplete": true,
    "query": {
        "redirects": [
            {
                "from": "B",
                "to": "C"
            }
        ],
        "pages": [
            {
                "pageid": 557,
                "ns": 0,
                "title": "C"
            }
        ]
    }
}

Url I hit: http://localhost:8080/w/api.php?action=query&format=json&titles=C&redirects=1&formatversion=2

  • http code: 200
  • response:
{
    "batchcomplete": true,
    "query": {
        "pages": [
            {
                "pageid": 557,
                "ns": 0,
                "title": "C"
            }
        ]
    }
}

Url I hit: http://localhost:8080/w/api.php?action=query&format=json&titles=A&formatversion=2

  • http code: 200
  • response:
{
    "batchcomplete": true,
    "query": {
        "pages": [
            {
                "pageid": 559,
                "ns": 0,
                "title": "A"
            }
        ]
    }
}

Url I hit: http://localhost:8080/w/api.php?action=query&format=json&titles=B&formatversion=2

  • http code: 200
  • response:
{
    "batchcomplete": true,
    "query": {
        "pages": [
            {
                "pageid": 558,
                "ns": 0,
                "title": "B"
            }
        ]
    }
}

Url I hit: http://localhost:8080/w/api.php?action=query&format=json&titles=C&formatversion=2

  • http code: 200
  • response:
{
    "batchcomplete": true,
    "query": {
        "pages": [
            {
                "pageid": 557,
                "ns": 0,
                "title": "C"
            }
        ]
    }
}

As I understand things, we have two proposals on the table for REST API behavior. Assuming a redirect chain of A => B => C, and the existing with_html response format:

Option One: HTTP redirects (per task description, modified slightly per Daniel comment):

  1. Caller requests A. Caller receives a 307 to B with redirect=no. Following that gives caller a 200 and content of B.
  2. Caller requests B. Caller receives a 307 to C, with redirect=no. Following that gives caller a 200 and content of C.
  3. Caller requests C. Caller receives a 200 and content of C. Straightforward.

In #1, where caller is redirected to B and receives the content of B, caller may choose to examine the json response and use the redirect_url field to continue on to C. Or not, if the caller just wants the content and doesn't care.

In #2, the "redirect=no" doesn't really do anything. It is mostly there for consistency, and to eliminate the need for lookahead.

Option Two: No HTTP redirects (per Timo comment):

  1. Caller requests A. Caller receives a 200 and content of A (including a redirect_url field in the json indicating that A redirects to B).
  2. Caller requests B. Caller receives a 200 and content of C.
  3. Caller requests C. 200 and content of C. Straightforward.

In no case are 307s involved. In #1, the caller may choose to follow that redirect. Or not, if the caller just wants the content and doesn't care. A lookahead is required to know whether to do case #1 or case #2.

Did I understand both those options correctly? I'm not certain I faithfully represented Option Two, case #2.

It's getting late here and I'm getting fuzzy, so maybe I'm missing something, but I wonder if things would have all been simpler if we had made "redirect=no" the default behavior when no redirect query parameter was provided. That would give:

  1. Caller requests A. Caller receives a 200 and content of A, plus a redirect_url field in the json indicating that A redirects to B.
  2. Caller requests B. Caller receives a 200 and content of B, plus a redirect_url field in the json indicating that B redirects to C.
  3. Caller requests C. 200 and content of C. Straightforward.

This would be very RESTish. We're giving the caller exactly what they asked for. What they do with it (including possibly following redirects is up to the caller.

That wouldn't prevent us from allowing a "redirect=yes" mode that returned 307s, which we'd have to sort out the details of. But I wonder if that would really be necessary, or if the above would be sufficient.

For callers like PCS who are currently following redirects, this be a breaking change, necessitating a new API version and a deprecation period.

It doesn't match desktop behavior. But I'm not convinced that's a requirement.

Caller requests A. Caller receives a 200 and content of A, plus a redirect_url field in the json indicating that A redirects to B.

This is only possible if the client actually requested JSON. If the client requested plain HTML, they wouldn't notice that it's a redirect. We could put something into the HTML <head>, but then we'd require the client to part the HTML, which may be expensive for large responses (and useless for most cases). Or we could include a header (just Location, perhaps). The client would have to know to look for it.

It seems odd to invent a new way to represent redirects in HTTP responses, when HTTP already has a standard way to represent them.

This would be very RESTish. We're giving the caller exactly what they asked for. What they do with it (including possibly following redirects is up to the caller.

True, though expressing redirects with HTTP vocabulary seems even more RESTish to me, especially if the redirect response also contains the rendered representation of the redirect page (not the target page). The client can then choose to folow the recirect or use the rendered redirect page directly.

After sync discussion, the product decision, pending agreement from Traffic, is to proceed with the "HTTP redirects" option, per the above description.

The question for Traffic is mostly about cache behavior regarding including "redirect=no" on all 307 responses, and whether that would be problematic. I'll reach out to them, maybe after I post a comment on this thread summarizing the original issue, proposed solution, and remaining concern as it affects them.

(We noted in the sync discussion that it would be possible, at the expense of slightly more complicated code and some additional database queries, to only include "redirect=no" on a subset of 307s. That remains a possible optimization depending on Traffic's response.)

tl;dr: we're considering adding a query parameter certain MW REST API redirect responses. Our proposed change may have caching implications. We'd like to know if Traffic has any objection.

Summary for Traffic:

Wiki pages sometimes redirect to each other. The MediaWiki REST API (aka rest.php) handles these redirects differently than the desktop interface or Action API. Desktop and Action API handle everything in the context of one request, MW REST API, by design, instead uses 307 redirects. Callers must make multiple requests if they wish to follow redirects.

An example request looks like this:
https://en.wikipedia.org/w/rest.php/v1/page/Foo/with_html

The current MW REST API implementation returns 307s without limit. So as an example, if there's a redirect chain like A => B => C, callers who make a request for A will receive a 307 redirecting them to B. If they follow that redirect, they'll receive a 307 redirecting them to C. This is undesirable for several reasons, and callers who automatically follow http redirects may be redirected many times. By convention, we should instead only support one level of automatic redirection

NOTE: title normalization redirects are handled in MW REST API via 301s. There will never be multiple 301s, so they are not considered in this task. It is possible for the initial request to result in a 301 followed by a 307. We are considering this normal behavior and do not intend to change it.

MW REST API already supports a "redirect" query parameter. If this parameter is falsey, then an API request that would otherwise return a 307 will instead return a 200 and a json response including the contents of the redirect page. For example:

https://en.wikipedia.org/w/rest.php/v1/page/Foo/with_html?redirect=no

In this case, the json response includes a "redirect_target" allowing interested callers to follow the redirect if coded to do so at the application level. We see that as not problematic, because any such callers would be making additional requests intentionally.

Our proposed solution for limiting redirect chain length (Option One in this comment) is to always include the "redirect=no" query parameter with 307 responses. So for an A => B => C chain:

  • Caller requests A. Caller receives a 307 to B with redirect=no. Following that gives caller a 200 including content of B.
  • Caller requests B. Caller receives a 307 to C, with redirect=no. Following that gives caller a 200 including content of C.
  • Caller requests C. Caller receives a 200 and content of C. (unchanged from current behavior)

Our concern is that adding a query parameter may be undesirable from a caching perspective. Future usage of the MW REST API is hard to predict, but we don't want to code anything that would be problematic if traffic increased significantly.

So, the question for Traffic is whether the proposed change is acceptable.

@BBlack @KOfori

I don't think this seems very problematic from our perspective. If anything, it will reduce traffic from agents which formerly followed the whole 307 chain, and now may stop earlier on a 200.

I don't think this seems very problematic from our perspective. If anything, it will reduce traffic from agents which formerly followed the whole 307 chain, and now may stop earlier on a 200.

Thank you!

Change #1135473 had a related patch set uploaded (by BPirkle; author: BPirkle):

[mediawiki/core@master] REST: limit non-normalization redirects (307s) to one level

https://gerrit.wikimedia.org/r/1135473

Change #1135473 merged by jenkins-bot:

[mediawiki/core@master] REST: limit non-normalization redirects (307s) to one level

https://gerrit.wikimedia.org/r/1135473