Page MenuHomePhabricator

Support following MediaWiki redirects when retrieving HTML revisions
Closed, ResolvedPublic0 Estimated Story Points

Description

For clients like mobile apps and API-driven frontends, it would be useful if RESTBase supported following MediaWiki's #redirects. However, VisualEditor in particular also needs the continued ability to retrieve the redirect page itself, so that it can be edited to change the redirect target, or converted into a regular content page by removing the redirect.

Issues

Poor redirect support in browsers

Browsers have traditionally not exposed redirect information to JavaScript, which lead to some sites relying on this for XSS protection. For this reason, even the newer fetch spec hides the response body from JS, even for same-origin requests with redirect: 'manual' option set.

Newer browsers do expose a [responseURL attribute](https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/responseURL), which can be used to detect that a redirect happened. However, older but still supported browsers and IE lack support for this, so relying on it is problematic.

An alternative is to indicate the response location in a Content-Location header. This can be used to detect redirects in all browsers. There were some early issues with Content-Location being used as base href, but I think no browser implements this any more.

Option 1: Status 302 & location header, but return the HTML; rely on client not following redirect

When encountering an internal redirect (same wiki, possibly wikimedia project), RESTBase will respond with a 302 temporary redirect status & a location header pointing to the destination page. It will also return the HTML body & etag as with current responses. In this model, the client receives both redirect information and the body data. This means that it can choose to follow the redirect, or edit the page using the returned body.

At present, the redirect destination is only available inline as a meta tag. This means that we'd need to extract this meta tag from the HTML content, which is certainly quite hacky. In the longer term, it would be desirable for Parsoid to return information about the redirect target in accompanying page metadata (JSON blob). There is some discussion of this in T105845: RFC: Page components / content widgets.

Issue: Browser clients cannot access redirect responses

Sadly, this solution is basically unusable for browser clients, as they can't access redirect responses for security reasons (see Issues above).

Option 2: Status 302 for normal requests, method to avoid redirects (URL param or header), Content-Location header

As in Option 1, but provide a means to avoid being redirected by either

  1. an URL parameter, for example ?redirect=false, or
  2. a custom HTTP header, for example Redirect: false.

Option 1) is more discoverable, and does not conflict with CORS restrictions. The downside is that responses would be hard to purge, so we'd disable caching in those cases. However, the request volume with this parameter set is expected to be low, which means that the performance impact should be negligible. With Xkey support in Varnish we could make the no-redirect response cacheable, but hit rates would very likely be low, as most accesses would be to the redirected variant.

For clients like VisualEditor, the request flow would be:

  1. Perform a regular request & get a cached response.
  2. Check if Content-Location changed. If it changed, a redirect was followed.
  3. If the goal is to edit the redirect, repeat request from 1) with ?redirect=false; get an un-cached response.

Alternatively, the first request could specify ?redirect=false, at the cost of foregoing caching in the normal (non-redirect) case.

Option 3: Status 302 for normal requests, URL parameter to request not following temporary redirects

Like Option 2, but with a URL parameter only.

Additionally, implement Varnish logic to

  • match the query parameter, strip it & remember it as a flag on the request, and
  • if this flag is set, massage 302 responses post-cache by
    • replacing the status with 200, and
    • stripping the location header.

Advantages

  • Simple yet performant: Clients like VisualEditor would always ask for redirects to be disabled, but would still get to share regular cached responses.
  • No need to detect that a redirect happened.
  • Query strings can be reliably preserved across unrelated redirects (ex: title normalization).

Disadvantages

  • Some extra (but generic) Varnish logic.

See also

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke raised the priority of this task from Medium to High.Nov 17 2015, 5:01 AM
GWicke set Security to None.

I think for clients is good to obtain always a response, and flag it as redirected or not redirected in JSON.

I think a most common API is action=query to fetch the article.
In this case I would like to have the redirected handled by default, and return a JSON like (pseudo-code here, not json syntax):
[query][pages]:{ [7955] : {[page] : {DNA html is here}, [redirect] : false }
[query][pages]:{ [7956] : {[page] : {DNA html is here}, [redirect] : true }
if not redirect is specified:
[query][pages]:{ [7956] : {[page] : {Dna is redirect}, [redirect] : false } }, [redirects] : { [7956] : { Final redirected Page goes here, or alternatively a whole list of redirect pointing to final redirect } }

@Gg4u, I don't think we want to wrap all HTML responses in JSON. This would be a drastic change, and would introduce some non-trivial overheads. The proposal is to use regular HTTP semantics by sending a HTTP redirect, along with the regular response body. This means that the full HTML of redirects *is* included.

Clients can decide whether to follow the redirect automatically or not, by setting the corresponding flag in curl / fetch / urllib / $HTTP_LIB. Many will want to, but others like VisualEditor will disable automatic following to edit the redirect itself.

Some implementation notes based on a conversation @Pchelolo & I had in the office earlier today:

  • The MediaWiki API already returns a boolean 'redirect' flag indicating that the *latest* revision of this title is a redirect (it's only stored in the page table). We already store this, so can use it to find redirects in *current* revisions.
    • We can't really trust this for old revisions, so might need to check the content for now.
  • We need to add a per-revision redirect boolean in the upcoming restrictions table.
  • We discussed whether we should redirect requests asking for a specific revision as well. On balance, it seems more consistent to do so (and support ?redirect=no).

@BBlack, what is your take on the Varnish part of option 3? Our goal is to share cache entries between redirected responses (302, location header, body) & no-redirect variants (200, no location header, body). Concretely, we would do something like the following for requests to /api/rest_v1/:

  • Match and strip ?redirect=no in the URL; set a no_redirect flag on the request.
  • In the response (cached or not) path, if status 302 and no_redirect flag is set:
    • Replace status with 200
    • Strip Location header

Does this sound sane to you?

I hate it, but I don't see any better way to solve the problem at present.

To reword to be sure I have the intent:

  1. On the vcl_recv side, for /api/rest_v1 requests, strip and note (via req.http.Foo as an internal request flag) the ?redirect=no argument. We'll need to support that in all its forms without screwing up other arguments - we've done that before elsewhere in VCL. I think the best example is in templates/varnish/analytics.inc.vcl.erb, in function analytics_provenance_recv_ where we strip the wprov argument from any position in the argument list by using a pair of regexen.
  1. On the vcl_deliver side, if the flag from above is set && status == 302, we'll overwrite the status as "200 OK" and delete the Location header.

@BBlack: Yes, that captures the idea very well.

I'm also not too fond of the VCL part of this, but couldn't come up with anything better.

We'll initially deploy this without caching for the ?redirect=no requests, so there is no rush to implement the VCL bits. We'll probably look into that later this quarter.

We'll initially deploy this without caching for the ?redirect=no requests, so there is no rush to implement the VCL bits. We'll probably look into that later this quarter.

The mailing list post mentioned ?redirect=false. Will any falsey value work with this parameter?

I have some vague memory that the value of some URL parameters, when passed to MediaWiki's api.php, will change the response behavior regardless of the parameter's value. The URL parameter being present in the request URL is all that's needed, as I recall. I'm wondering if matching that behavior here would be great or terrible.

The mailing list post mentioned ?redirect=false. Will any falsey value work with this parameter?

We've added redirect=no support to RESTBase to match MW syntax about redirects, but basically it's a false.

In RESTBase false is a string false, FALSE, 1 and now no. However, we don't want to advertise all of these options, as for caching it's better to have just one, so we will probably stick with false in all the documentation/mailing lists/etc.

In MW api the parameters which just need to be present have a semantics of true with a false default, however here we have the opposite, so it's better to be explicit here.

I have some vague memory that the value of some URL parameters, when passed to MediaWiki's api.php, will change the response behavior regardless of the parameter's value. The URL parameter being present in the request URL is all that's needed, as I recall. I'm wondering if matching that behavior here would be great or terrible.

Right... from https://en.wikipedia.org/w/api.php#main.2Fdatatypes:

Boolean parameters work like HTML checkboxes: if the parameter is specified, regardless of value, it is considered true. For a false value, omit the parameter entirely.

What happens with ?redirect=yes or other truthy values?

What happens with ?redirect=yes or other truthy values?

It's considered to be true and the redirect happens. However here redirect is the default behaviour, so there's no point to be explicit.

I guess the question remains, though: if the Varnish redirect-stripper sees ?redirect with a non-falsey value, should it still strip the parameter?

I guess the question remains, though: if the Varnish redirect-stripper sees ?redirect with a non-falsey value, should it still strip the parameter?

Yes. We care only about the false-y case.

Pchelolo claimed this task.

The new redirect handling code has been deployed in production.