Page MenuHomePhabricator

Consider changing wikipage redirects to be proper HTTP redirects
Open, MediumPublic

Description

This RFC has been scheduled for public discussion on June 28 (Wednesday) at 21:00 UTC (2pm PDT, 23:00 CEST). As always, the discussion will take place in the IRC channel #wikimedia-office

Problem statement

Currently /wiki/Redirect responds with the Target page, with three notable differences:

  • A "Redirected from .." message is rendered.
  • A piece of JavaScript is inserted that will jump to the destination #Section (if any).
  • A piece of JavaScript is inserted to normalise the address bar from /wiki/Redirect to /wiki/Target#Section.

Internals:

  • Page markup (wikitext) refers to the title as specified. – [[Redirect]] is saved as-is, no pre-save transformation)
  • Page rendering refers to the title as specified – HTML: <a href="/wiki/Redirect">
  • When clicking on such link within the wiki, or otherwise navigating "to" the redirect, the server responds with a modified version of the response for the target of the redirect. – /wiki/Redirect responds the same as /wiki/Target, including a <link rel=canonical> specifying Target as canonical, and a special header informing the reader "Redirected from Redirect".
  • Once the page is rendered, JavaScript does two things:
    • Jump to the intended section (if any).
    • Swap the address bar from /wiki/Redirect to /wiki/Target. This encourages further social sharing to share the canonical link, and makes it so that those other people will not needlessly see "Redirected from Redirect" - which doesn't apply to those readers.

Benefits:

  • Parser cache: Changing the redirect does not require invalidating parser cache of incoming links.
  • Performance: "Following" a redirect responds quickly (no HTTP redirect).
  • Usability: Users know when they followed a redirect (header message "Redirected from"), and the message doesn't show for others when you share the link.
  • Search engines: No indexing of duplicate content. (due to canonical url).

Problems:

  • Redirect to heading: When redirecting to a heading on a destination page (e.g. Topic redirecting to General#Topic), the browser does not natively jump to this heading because it's rendering content at /wiki/Topic not Topic#Topic or General#Topic. This is currently worked around with JavaScript. While this technically works it is bad in two ways:
    • Performance: The jump happens very late. Sometimes 5-10 seconds after the first paint, because it waits for all content to arrive and all JavaScript to arrive and execute.
    • Fallback: In our Grade C experience for older (but supported) browsers, the jump never happens.
    • Fallback: It also means that on supported browsers, the jump doesn't happen if there were intermittent connection issues.
    • Fallback: There is no recovery for the user when the jump doesn't happen. There is no manual way to get to this information (it's hidden in invisible JSON data)

Solution 1

Change the response of /wiki/Redirect to be an HTTP redirect to /wiki/Target?rdfrom=Redirect#Section. When viewing /wiki/Target?rdfrom=Redirect#Section, client-side code normalises the address bar to /wiki/Target#Section.

This addresses all the problems.

  • Drawback: Users will be subject to an HTTP redirect.
  • Benefit (compared to Solution 2): Redirects are instantly up-to-date (no need to wait for job queue).

Solution 2

Same as solution 1, but in addition, optimise the common case by changing the Parser to resolve to this url ahead of time.

  • Benefit: Faster user experience for the common case.
  • Drawback: Updating of urls depends on job queue, similar to template transclusion updates.
Original task description:

With Javascript disabled/not available:

*Links to sections, works correctly, e.g. Albatross#Biology
*Links to sections via a redirect, works correctly, e.g. Beiruit#History -- ([[Beiruit]] is a redirect to [[Beirut]]).
*Links to sections via a redirect to a section, do Not work correctly, e.g. Camden Market Fire takes you to the top of the target page but not to the section -- ([[Camden Market Fire]] is a redirect to [[Camden Market#Incidents]])

Fixing this may partially fix the section links issue in T53122: Visual Editor: wrong URL when opening red links and section links from editor

See also:

Details

Reference
bz51736

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I've updated the task description to provide a summarised problem statement.

@petr.matas Let's explore adding a query parameter. There are multiple kinds of query parameters we could consider, but let's start by considering a query parameter to indicate where we redirected "from".

Rendering of page links in the content would change to link to Target#Section instead of Redirect, and we'd also add a query parameter that says we came from Redirect.

  • Page rendering: <a href="/wiki/Redirect would instead become <a href="/wiki/Target?rdfrom=Redirect#Section.

Impact:

  • Usability: The order of information is slight confusing. However, this seems like a small trade-off and probably unavoidable. At least for JavaScript-enabled users we'll be able to hide it, similar to how we already replace the address bar to avoid the redirect for social sharing.
  • Parser cache: We'll have to start purging parser cache of all pages with incoming links when editing a redirect. This is a small cost we'll pay internally, but nothing we can't handle, and won't affect the user.
  • Purging: Also, we need need to purge the exact url Target?rdfrom=Redirect whenever Target is edited. Right now we purge Redirect because /wiki/Redirect currently renders the content of Target (+ header message). We'll need to update that logic to also purge Target?rdfrom=Redirect.

One problem this also exposes is that we currently discourage use of query parameters on "pretty" urls. E.g. they're supposed to be used on /w/index.php, not /wiki/. So perhaps it's better if we link to /w/index.php?title=Target&rdfrom=Redirect#Section. The problem with that is that that makes rewriting to /wiki/Target harder because it changes directory from /w to /wiki which causes bugs when resolving relative links on the page. (e.g. ./Foo becomes /wiki/Foo or /w/Foo).

In addition, using query parameters may add the expectation of it being "okay" to use them in a different order. E.g. rdfrom first: /w/index.php?rdfrom=Redirect&title=Target#Section. Although that should be fine, given we already have that with action=raw and we only purge the official order. The other ones will become stale. In fact, it might actually be useful to consider putting rdfrom in that it makes the url nicer (it puts Target and #Section next to each other). But that should be invisible to most users, and it's better to be consistent with other urls and always put title first.

Lastly, when JavaScript fails or otherwise unavailable, in the current situation, the fallback is rendering (and sharing) of /wiki/Redirect which isn't a bad url. The only downside is that people may see "Redirected from" which is fine. The same would've happened if Redirect wasn't a redirect at the time of sharing and only became a redirect later. The url pattern is consistent and transparent to the user. However, when we change redirects to /w/, query parameters the fallback is showing /w/index.php?title=Target&rdfrom=Redirect#Section which is a url we'd prefer not to spread. But again, perhaps it doesn't matter?

All in all, it seems we can deal with the impact without any significant downsides.

May I ask for an explanation [..] of the URL parameter in HTTP redirect target solution's cons, please? [..] What does the stripping in Varnish mean and what are its implications?

The "stripping in Varnish" refers to a minor optimisation of the query parameter proposal. If we remove the "Redirected from" message from "/wiki/Redirect" and instead change pages to link directly to /wiki/Target?rdrom=Redirect, that would work fine, but has the minor down side of increasing the number of different versions of Target we need to store in Varnish (our page caching system). If we configure Varnish to redirect Article?rdfrom=... exactly the same as /wiki/Article, we would only have 1 thing to cache per article, instead of "two" versions of it. Then, in JavaScript, if you view the page with rdfrom (which would be ignored by the server), we can re-create the "Redirected from .." message and add it to the top of the page.

@Krinkle Thank you for your thorough explanation. I just thought that the rdfrom parameter's purpose was to be able to provide the "Redirected from .." message while using a HTTP 302 redirect:
If /wiki/Redirect responds with something like
302 Found; Location: /wiki/Target?rdfrom=Redirect#Section, we will:

  • not need to purge /wiki/Redirect upon modification of Target,
  • not need to alter <a href="/wiki/Redirect"> in rendered pages,
  • not need to purge pages linking to Redirect upon its modification,
  • have to purge /wiki/Target?rdfrom=anything upon modification of Target.

Naive question: What would the analytics impact of this be - would we continue to register a webrequest / pageview for the redirect page, or would this change to the target page? (Cf. T121912)

In the past, we discussed adding the "redirected from" thing in client-side javascript.

It's unclear (to me) from this remark whether you spotted that the purpose of this issue is, at least in part, to handle cases where client-side JavaScript is not present (e.g. a browser that does not support JavaScript, or that has JavaScript disabled). Probably you already understood that, but I'm adding this clarificatory remark just in case not ;)

To clarify: I was talking about the "redirected from" notice, which at least the current description does not call out as needs-to-work-without-JS. While fundamental functionality like linking clearly shouldn't rely on JS, it might make sense to rely on JS for less critical features like small notices aimed primarily at editors, especially when that enables better cache performance.

Naive question: What would the analytics impact of this be - would we continue to register a webrequest / pageview for the redirect page, or would this change to the target page? (Cf. T121912)

In case of HTTP 301/302 I think it should not be difficult to process the pair of requests as a single event and account for it in any way that we find appropriate.

Naive question: What would the analytics impact of this be - would we continue to register a webrequest / pageview for the redirect page, or would this change to the target page? (Cf. T121912)

In case of HTTP 301/302 I think it should not be difficult to process the pair of requests as a single event and account for it in any way that we find appropriate.

Well maybe, but it would be better to have the Analytics Engineering team confirm that it will indeed not difficult for them to adapt their code and infrastructure to this change.

FWIW, per https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews/Redirects the current infrastructure does not count 301 redirects as pageviews, so the proposal would seem to cause major changes in the pageview data. In some aspects it might actually be desirable, in others less so.

(CCing @mforns as the author of that documentation page, @Nuria as the team's lead engineer, plus @Milimetric and @MusikAnimal who IIRC have an interest in the topic too)

So perhaps it's better if we link to /w/index.php?title=Target&rdfrom=Redirect#Section.

Note that a much prettier /?title=Target&rdfrom=Redirect#Section works too and with the parameters swapped I like it even more. How about introducing something like $wgIndexPath="/"? (See T18659#215617)

Just an FYI, in mobile we show a notification (see screenshot) when a redirect happens:

Screen Shot 2017-06-27 at 8.55.52 AM.png (455×379 px, 73 KB)

Will it be possible to retain this behaviour?

In case of HTTP 301/302 I think it should not be difficult to process the pair of requests as a single event and account for it in any way that we find appropriate.

Coupling request in our system is never easy, as we have no session identifier send to the backend but regardless in this case we should not need to do that if a 301 is followwed by a 200.
Now, It is a bit more complicated for other metrics that re not pageviews, cause now cookies are being set on your initial 301 request. So cookies are set "before" a pageview happens.

Right now a Pageview for "barak_obama". Internally redirects to "Barack_Obama" and translates (from user standpoint) to a 200 request. If we follow the "canonical" http way of doing redirects this will translate to two requests (if I understand the changes proposed). A 301 from "barak_obama" to "Barack_Obama" and a 200 for "Barack_Obama". In the case of "pageview counting" we will be fine, the 200 will be counted, the 301 will be not. One pageview happened.

In the case of metrics that rely on cookies being set on the initial request (that used to be a 200) things will no longer work the same, as our assumption in this case is that "cookies are being set on a pageview-worhty-request", a 200. This will affect counting of Unique Devices. You can see a high level explanation of how this metric is computed here and why is important whether a cookie is set or updated: https://blog.wikimedia.org/2016/03/30/unique-devices-dataset/

I bet this effect of cookies being set "earlier" than they were before (on 301 request versus 200 request) might affect other things too.

This is not to say we cannot adjust on our end, we can, we just need a heads up.

In case of HTTP 301/302 I think it should not be difficult to process the pair of requests as a single event and account for it in any way that we find appropriate.

Coupling request in our system is never easy, as we have no session identifier send to the backend but regardless in this case we should not need to do that if a 301 is followwed by a 200.
Now, It is a bit more complicated for other metrics that re not pageviews, cause now cookies are being set on your initial 301 request. So cookies are set "before" a pageview happens.

Right now a Pageview for "barak_obama". Internally redirects to "Barack_Obama" and translates (from user standpoint) to a 200 request. If we follow the "canonical" http way of doing redirects this will translate to two requests (if I understand the changes proposed). A 301 from "barak_obama" to "Barack_Obama" and a 200 for "Barack_Obama". In the case of "pageview counting" we will be fine, the 200 will be counted, the 301 will be not. One pageview happened.

It's good that the total number of pageviews won't be affected (speaking as someone who tracks trends in this metric).
However, it seems that views will now counted towards the target page ("Barack_Obama") instead of the redirect page ("barak_obama"), and that's a major change. As I said above, it may have advantages and disadvantages - but definitely something to be aware of, and to alert users of this data about when the time comes. It may also be worth setting up a separate mechanism for tracking redirect views then, because that information definitely gets used (see e.g. https://en.wikipedia.org/wiki/Wikipedia:Redirect#Reasons_for_not_deleting #5 ).

Will there be anyone with UX expertise at the IRC discussion? This seems more like a user experience question than a technical one. And the UX is broken in multiple ways (for example, the "redirected from" text is at the top of the article, but when section targeting works the user does not see that message), so we might want to look for a more innovative solution.

@Tgr Do you have anyone in particular in mind? Feel free to ping anyone you think should be involved.

I can't find any tag for "this is UX relevant"... why is that?

Just an FYI, in mobile we show a notification (see screenshot) when a redirect happens:
F8538184
Will it be possible to retain this behaviour?

Depending on what we end up doing, you'll probably need to reimplement some pieces, but surely it'll be still possible. This seems functionally equivalent to the "Redirected from" notice on desktop and we're keeping that.

@Tgr Do you have anyone in particular in mind? Feel free to ping anyone you think should be involved.

I can't find any tag for "this is UX relevant"... why is that?

The closest thing is Design I guess.

@Tgr Do you have anyone in particular in mind? Feel free to ping anyone you think should be involved.

No one specific. Maybe @Nirzar or @Pginer-WMF can recommend someone who would be interested. (Is this a reading or an editing feature? Seems to be a bit of both.)

Quick UX summary: when you visit a redirect page, we display the target page with a small notice below the title (so the user understands how they ended up here after clicking on a different title)), and use Javascript to update the URL to that of the target page. When the redirect points to a specific section of the larger article (e.g. https://en.wikipedia.org/wiki/Camden_Market_Fire ), we use Javascript to scroll to the target section. This has a couple problems: with JS, the user never sees the redirection notice and might not understand what is happening. (It doesn't help that sometimes the connection between the redirect and the target section is not obvious.) Without JS, the user sees the top of the article (with the notice, but it only tells which article they were redirected from, not the target section).

@Tgr I will take a look. we did some work around redirect notice using toast messages for mobilefrontend.

So perhaps it's better if we link to /w/index.php?title=Target&rdfrom=Redirect#Section.

Note that a much prettier /?title=Target&rdfrom=Redirect#Section works too and with the parameters swapped I like it even more. How about introducing something like $wgIndexPath="/"? (See T18659#215617)

Let's keep that outside the scope for now. /w/index.php? is the established entry point for MediaWiki. The problem is that users should, as much as possible, see the canonical address (/wiki/Target) for the destination article - which is the url we want users to see and share. If we need a different url to set a query parameter, the compromise is already made. How long that "special" will be is of secondary concern, and I'd say consistency triumphs here. If we do want to adopt shorter urls for urls with query parameters, that should be, and already is, a separate proposal that will apply to everything at once.

To clarify: I was talking about the "redirected from" notice, which at least the current description does not call out as needs-to-work-without-JS. While fundamental functionality like linking clearly shouldn't rely on JS, it might make sense to rely on JS for less critical features like small notices aimed primarily at editors, especially when that enables better cache performance.

Agreed. But we already use separate Varnish cache objects for these today and I don't think any of the proposed solutions are affected by whether or not we strip the query parameter. I'd prefer we keep this optimisation as a follow-up change proposal.

One question:

Without JS, the user sees the top of the article (with the notice, but it only tells which article they were redirected from, not the target section).

Why do we need JS to actually scroll to the position? what if the source link contains the the section name and we append the section as # at the end of URL, wouldn't the browser scroll to that position on its own?

like this > https://en.wikipedia.org/wiki/Yucat%C3%A1n_Peninsula#Climate

One question:

Without JS, the user sees the top of the article (with the notice, but it only tells which article they were redirected from, not the target section).

Why do we need JS to actually scroll to the position? what if the source link contains the the section name and we append the section as # at the end of URL, wouldn't the browser scroll to that position on its own?

like this > https://en.wikipedia.org/wiki/Yucat%C3%A1n_Peninsula#Climate

The bug described here in particular is about redirecting to a section. Regular links to sections within an article work fine. A link [[Foo#Bar]] will produce a link like https://en.wikipedia.org/wiki/Foo#Bar which, as you say, will make the browser naturally jump to the anchor on that page in question.

The problem here is if you create a link to [[Foo 2017]], and then imagine Foo 2017 being a redirect to Foo events#Foo 2017. In that case your browser will follow the link as https://en.wikipedia.org/wiki/Foo_2017 which then results in the server responding with the content for the Foo_events article. There is no anchor in your address bar at this point, because it wasn't specified on the incoming link. In theory you could link to [[Foo 2017#Foo 2017]], but then again, if the editor knew Foo 2017 was a redirect it would've just linked to Foo events#Foo 2017 proper. The scenario here is a redirect. E.g. when articles are merged, or when a topic doesn't yet have its own article.

Currently, when you request an article that is a redirect, the server responds with the content of the target article and a message saying "Redirected from: Foo 2017". In addition, if the redirect page specifies a section anchor, there is also some invisible meta data that then instructs some JavaScript code to scroll to section Foo 2017 and even update your browser address bar from https://en.wikipedia.org/wiki/Foo_2017 to https://en.wikipedia.org/wiki/Foo_events#Foo_2017.

Example: https://en.wikipedia.org/wiki/Economy_of_Christmas_Island
Renders invisibly as https://en.wikipedia.org/wiki/Christmas_Island with a "Redirected from" message on top, JS jumps to #Economy, JS updates address bar.

This proposal is to get rid of that and come up with something better. Solution 1 is to remove the logic that makes the server respond with the content of the target of the redirect and instead respond with a native browser redirect. This means the browser will be told, "nothing here, go to https://en.wikipedia.org/wiki/Foo_events#Foo_2017 instead.". The browser then essentially aborts the load, updates the browser bar, and goes to make another request to the server (transparent to the user). This solves the problem, but does come at the cost of performance. (As user's devices will now sending a request, getting a respond, and then doing the ceremony a second time.)

In the past, we discussed adding the "redirected from" thing in client-side javascript.

It's unclear (to me) from this remark whether you spotted that the purpose of this issue is, at least in part, to handle cases where client-side JavaScript is not present (e.g. a browser that does not support JavaScript, or that has JavaScript disabled). Probably you already understood that, but I'm adding this clarificatory remark just in case not ;)

To clarify: I was talking about the "redirected from" notice, which at least the current description does not call out as needs-to-work-without-JS. While fundamental functionality like linking clearly shouldn't rely on JS, it might make sense to rely on JS for less critical features like small notices aimed primarily at editors, especially when that enables better cache performance.

Speaking as an editor who:

  • almost never uses JavaScript, and
  • reads and makes use of ("consumes"?) the "redirected from" notices

I would be pretty puzzled - and subsequently annoyed - if they suddenly stopped being shown, even if this bug is fixed. Frankly, if they are removed in the process of fixing this bug, then I would feel compelled to file a new bug report about their removal, asking for them to be restored.

This proposal is to get rid of that and come up with something better. Solution 1 is to remove the logic that makes the server respond with the content of the target of the redirect and instead respond with a native browser redirect. This means the browser will be told, "nothing here, go to https://en.wikipedia.org/wiki/Foo_events#Foo_2017 instead.". The browser then essentially aborts the load, updates the browser bar, and goes to make another request to the server (transparent to the user). This solves the problem, but does come at the cost of performance. (As user's devices will now sending a request, getting a respond, and then doing the ceremony a second time.)

(Emphasis mine.)

For a user (like me) who almost never uses JavaScript on Wikipedia, Solution 1 as described in your quote would not come at the cost of performance. Instead, it would get me to the relevant anchor much faster than is presently the case. This is because waiting for a second HTTP request/response to complete is almost inevitably quicker than the current process of:

  • being puzzled about why you are being shown something that does not seem to match the link you clicked
  • looking for a little notice about having been redirected
  • remembering that this bug exists (if you even know that it does, which I'm sure many editors don't)
  • trying to figure out which place in the present page your browser would have taken you to if you used JavaScript
  • manually scrolling your viewport to that part of the page

The process above typically takes several seconds for a practised user. For an unpractised user, they might never identify the intended part of the page (i.e. they might find themselves at the top of a long page whose lede does not make it obvious that it addresses the matter they intended to navigate to, and give up).

By contrast, an HTTP request/response takes fractions of a second.

So, for me at least, and other editors and readers like me, Solution 1 would be a big performance improvement :-)

[..] instead respond with a native browser redirect. [..] This solves the problem, but does come at the cost of performance. (As user's devices will now sending a request, getting a respond, and then doing the ceremony a second time.)

[..]

So, for me at least, and other editors and readers like me, Solution 1 would be a big performance improvement :-)

Agreed, definitely. Even for users with JavaScript enabled (who don't have to do any manual work) the section will appear sooner with an HTTP redirect because currently, they:

  1. Start loading the page.
  2. Wait for 2 or 3 JavaScript request.
  3. Wait for the page to finish loading.
  4. Wait for JavaScript to jump to the section.

Those 2-3 javascript request, as being http requests, take just as long as multiple http redirect requests would take. So for the use case of redirect-to-a-section, this proposal is universal improvement.

The cost I was referring to is for the more common "plain" redirects, e.g. Foo > Bar, with no section being involved. In that case you previously just follow a link to load a page, and you now follow a link, wait for the browser to receive the redirect, wait for the browser to follow that link instead, and then the page will load.

The cost I was referring to is for the more common "plain" redirects, e.g. Foo > Bar, with no section being involved. In that case you previously just follow a link to load a page, and you now follow a link, wait for the browser to receive the redirect, wait for the browser to follow that link instead, and then the page will load.

According to this page it takes 22 requests to load the Wikipedia's main page. It counted 2 CSS files and 1 Javascript file, according to your answer it might sometimes take 2 or 3 Javascript requests. Most of the requests are due to images, which can get into hundreds for more rich pages. So as I see it, adding a couple of HTTP redirects, which influence only latency but not bandwidth, does not really matter from the performance point of view. If you want to dramatically improve performance, you should be limiting the number of images per page. Note that limiting the number of requests per page will conflict with good user experience in every case.

which influence only latency but not bandwidth, does not really matter from the performance point of view.

The effect of redirects on user perceived performance is well known and documented. Perceived performance by the user is driven by the time it took to see the first content displayed, HTTP redirects -by definition- delay the 1st paint of your page as it is now happening X milliseconds after, X being the time it took to do the redirects. A page with a myriad of image requests could actually display content immediately, the number of image requests and perceived performance are not necessarily related (they could be but that will not be true for every site)

Regardless we might decide to go with the HTTP redirects for other reasons (SEO, whatever).

For reference: Note rule 11 in this document, these guidelines that are almost a decade old: http://stevesouders.com/hpws/rules.php

The rule number 1 in the document is to make fewer HTTP requests in general, but allegedly there are 2 requests for CSS and 2-3 requests for Javascript, so I don't get why you're so strict about rule number 11.

Anyway, how much longer does it actually take to load a redirect page to the same domain compared to a normal page? How much are you willing to accept?

Consider keep-alive connections and all other optimizations that make sense for this kind of redirects.

In T53736#3386679, @Tgr wrote:

[...] And the UX is broken in multiple ways

I've forked that issue to T169282: Improve the UX for redirects to sections to avoid multiple issues here.

During the RFC discussion on June 28, it was agreed that this RFC be put on Final Call: if no new pertinent concerns are raised and remain open by July 12, this RFC will be approved for implementation.

Option 1 was identified as a good first step, with the option to implement Option 2 or other optimizations later, if they prove necessary.

IRC log: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-06-28-21.00.log.html

Maybe this advantage has been mentioned as SEO already, but stating it explicitly:
Machine-understanding of HTTP redirects does not require any advanced parsing.

Since no objections have been raised during the last call period, this RFC has been approved for implementation.

It seems that there is no resourcing for this at the moment. It could perhaps be picked up by Readers or by Platform.

This was one of the older RFCs that we approved before we were stricter about ownership and commitment. The redirect system has no current owner listed on mediawiki.org.

Tagging #reading-infrastructure-team-backlog and Platform Engineering for potential interest. If claimed, you'd basically have the ability to choose to accept the change from a product perspective (and decide on its priority, doesn't need to be soon), or to decline the change (closing the task).