Page MenuHomePhabricator

Content service doesn't handle URL fragments when redirecting
Closed, ResolvedPublic

Description

Certain wiki pages are redirects to a specific section of another article. For example, [[Web 1.0]] redirects to [[Web 2.0#Web 1.0]].

However, the content service doesn't seem to preserve the section (fragment) of the redirect link:
https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Web_1.0

The MediaWiki API propagates this as a tofragment parameter in its response. Therefore, ideally the content service should also pass the fragment through in a similar way.

Event Timeline

Dbrant created this task.Oct 19 2016, 2:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 19 2016, 2:27 PM
Dbrant added a project: RESTBase.

Is MCS doing any redirect handling of its own, or is RB handling this?

bearND added a subscriber: bearND.EditedOct 20 2016, 5:51 AM

I think we got rid of any redirect handling in MCS. RB does the HTTP redirect.

@Dbrant are you referring to the redirected property in the mobileview API response? The tofragment is only available in the title search query AFAIK.
What the impact for the app?

Regarding the redirected property I have some interesting observations to share:

https://en.m.wikipedia.org/w/api.php?action=mobileview&format=json&formatversion=2&prop=text%7Csections%7Clanguagecount%7Cthumb%7Cimage%7Cid%7Crevision%7Cdescription%7Clastmodified%7Cnormalizedtitle%7Cdisplaytitle%7Cprotection%7Ceditable&onlyrequestedsections=1&sections=all&sectionprop=toclevel%7Cline%7Canchor&noheadings=true&thumbsize=1024&page=Web_1.0

{
"mobileview": {
"displaytitle": "Web 2.0",
"normalizedtitle": "Web 1.0",
"redirected": "Web 2.0#Web_1.0",
//...

The MCS response includes the redirected property, too,
http://localhost:6927/en.wikipedia.org/v1/page/mobile-sections/Web_1.0

{
"lead": {
"displaytitle": "Web 2.0",
"normalizedtitle": "Web 1.0",
"redirected": "Web 2.0#Web_1.0",

but the RB hosted response doesn't:
http://localhost:7231/en.wikipedia.org/v1/page/mobile-sections/Web_1.0 or
https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Web_1.0
do an HTTP redirect to http://localhost:7231/en.wikipedia.org/v1/page/mobile-sections/Web_2.0 and
https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Web_2.0 respectively

{
"lead": {
"displaytitle": "Web 2.0",
"normalizedtitle": "Web 2.0",
//...

I also notice that the value of normalizedtitle has been changed from "Web 1.0" to "Web 2.0".

In conclusion, there's an HTTP redirect happening in RB land, which causes the redirect property to get dropped and normalizedtitle to be changed.
I think this is by design, so that RB doesn't have to store essentially the same page multiple times.

The requests first go through RB which does the redirect and then contact MCS only for the final location (unless redirect=false is passed to it, which is never the case for the Android App).

Anchoring the response for HTML makes definitely sense, but I'm not sure that anchoring the response for JSON makes sense or can even be defined. Perhaps RB should just pass in an extra header to back-ends indicating the redirect has an anchor?

In general we don't care about the fragment in RESTBase mainly because the request to RB with and without the fragment would return the same exact content (we don't have a feature of getting just a fragment).

I'm failing to understand what's the expected behaviour here and how is it going to affect what app users are seeing?

If this is about the RB HTTP redirect, then we should probably check whether a) we send any fragment id hashes in the Location header, and b) if browsers actually support redirecting to a new fragment.

IIRC, at some point we looked into whether browsers drop the fragment from the first page when receiving a redirect without fragment. In that case, IIRC the fragment was preserved & then applied to the redirected page. If the Location header contained a hash as well, would it override the original hash?

Kaartic added a comment.EditedOct 20 2016, 4:49 PM

Can anybody compare the above results obtained for Web 1.0 with that you get for Lexical scoping. It is also a similar redirect link to the Lexical scoping section of the Scope (computer science). Unlike the Web 1.0 redirect link, the Lexical scoping link works correctly in the app. Please refer T147836 for details about this.

  • Lexical_scoping is redirected with Location: Scope_(computer_science)
  • Web_1.0 is redirected with Location: Web_2.0

Neither of these location headers contain a hash currently, which explains why the section is lost. I think Parsoid provides the hash, so *if* browsers respect a Location with a hash, then it should be possible to preserve it in RB without too much trouble.

Do you mean that Lexical scoping does not redirect correctly ?

It seems that the app correctly recognises the link and navigates to the Lexical scoping section correctly while it doesn't for theWeb 1.0 link. See the below screen shots :

Lexical scoping - Recognised as redirection link

Section displayed correctly

Web 1.0 - Redirection link not found it only suggests the Web 2.0 article.

Why's the redirection link not found as it was for Lexical scoping ?

This example with search doesn't involve RESTBase at all, I guess the bug is in the search here?

As for RESTBase side - we don't handle redirects to fragments and we don't forward fragments to MCS or react to them anyhow. Having spent some time trying to make RB preserve fragments on redirects I began to wonder if that makes any sense for the app at all?

@bearND Could you please give some light on how the app handles these anchors? Would it be indeed useful if request to mobile-sections/Web 1.0 redirected to mobile-sections/Web 2.0#Web 1.0 while the returned content would be exactly the same?

Another issue here is purging. We don't purge URIs with fragments, so it could create issues for some pages potentially.

Another issue here is purging. We don't purge URIs with fragments, so it could create issues for some pages potentially.

Browsers don't send the fragment to the server (and if they do, it would be stripped in Varnish). So, this would not be an issue. The question remains though whether redirecting to a location with a fragment would really be useful.

On reflection, I am also starting to lean towards "this would not be useful". Even if browsers accept a location header with fragment for primary page requests, this won't work for API requests pulling the content as a sub-resource. None of the browser sub-request primitives (XHR, fetch) expose the post-redirect location (for security reasons), which seems to make it impossible to correctly handle fragments in those cases. Since nobody is going to load raw RB HTML for normal browsing use cases, the API request use case is more critical.

Kaartic added a comment.EditedOct 22 2016, 6:51 PM

This example with search doesn't involve RESTBase at all, I guess the bug is in the search here?

I don't know if it's related to RESTBase (as I'm new here) but, it's definitely related to fragments.

The Lexical scoping redirect link is

#REDIRECT [[Scope_(computer_science)#Lexical_scoping]]

and the redirect link for Web 1.0 is

#REDIRECT [[Web 2.0#Web 1.0]]

Although both redirect to a particular section (fragment) of the article the former seems to be recognised by the app i.e., the app correctly navigates to the section (shown in the second screen shot in my previous comment) correctly while this is not the case with the latter.

I also should state the fact that the app was correctly navigating to the Web 1.0 section when the redirect link was

#REDIRECT [[Web 2.0#"Web 1.0"]]

I actually changed it to the one above, as I found those quotation marks in a section title weird. Now as the app doesn't seem to recognise the redirection link, I suspect that someone must have noticed this and added those quotation marks to make the link work. That's just my guess , though. I'll try reverting my edit back and seeing whether there is any difference. If the link does work after the revert then there is something wrong somewhere.

To find the problem, someone who knows which part of the app is reaponsible for loading the article must find why the results differ for the above cases.

Kaartic added a comment.EditedOct 25 2016, 6:44 PM

I did revert the edit as mentioned in my previous comment. The section related to Web 1.0 is now named "Web 1.0" (as it was before my edit) and not Web 1.0 and the redirect link is also modified accordingly. My guess seemed true.

Now the app goes to the "Web 1.0" section of the Web 2.0 article after I reverted the edit and modified the redirect link. And the bug reported in T147836 is now reproducible again as the app is now recognizing the redirection.

Weird things happening with the Web 1.0 redirection link.

but the RB hosted response doesn't:
http://localhost:7231/en.wikipedia.org/v1/page/mobile-sections/Web_1.0 or
https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Web_1.0
do an HTTP redirect to http://localhost:7231/en.wikipedia.org/v1/page/mobile-sections/Web_2.0 and
https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Web_2.0 respectively

{
"lead": {
"displaytitle": "Web 2.0",
"normalizedtitle": "Web 2.0",
//...

I also notice that the value of normalizedtitle has been changed from "Web 1.0" to "Web 2.0".
In conclusion, there's an HTTP redirect happening in RB land, which causes the redirect property to get dropped and normalizedtitle to be changed.
I think this is by design, so that RB doesn't have to store essentially the same page multiple times.

The results I get for https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Web_1.0 after I reverted the edits are :

{
  "lead": {
     //...
    "displaytitle": "Web 2.0",
    "normalizedtitle": "Web 1.0",
    "redirected": "Web 2.0#.22Web_1.0.22",
    //...

See that the normalizedtitle has changed and redirected property has started to appear.

@Kaare so it seems like the bug is in Mobile-Content-Service Let's brink attention of @Mholloway and @bearND to this issue. .22 in the redirected prop seems extremely weird since %22 is a URI-encoded ", and I have a gut feeling that 22 there is not a coincidence.

.22 in the redirected prop seems extremely weird since %22 is a URI-encoded ", and I have a gut feeling that 22 there is not a coincidence.

It is not.It's there because the redirect link https://en.wikipedia.org/w/index.php?title=Web_1.0&redirect=no is now :

Web 2.0#"Web 1.0"

after I reverted my edit.

Kaare removed a subscriber: Kaare.Oct 26 2016, 7:59 AM

The encoding issue is different from the general redirection issue. See T102209: Derive heading ids from heading name, the same way MW core does.

bearND added a subscriber: Jdlrobson.EditedAug 3 2017, 6:07 PM

Another issue here is purging. We don't purge URIs with fragments, so it could create issues for some pages potentially.

Browsers don't send the fragment to the server (and if they do, it would be stripped in Varnish). So, this would not be an issue. The question remains though whether redirecting to a location with a fragment would really be useful.
On reflection, I am also starting to lean towards "this would not be useful". Even if browsers accept a location header with fragment for primary page requests, this won't work for API requests pulling the content as a sub-resource. None of the browser sub-request primitives (XHR, fetch) expose the post-redirect location (for security reasons), which seems to make it impossible to correctly handle fragments in those cases. Since nobody is going to load raw RB HTML for normal browsing use cases, the API request use case is more critical.

This still would be useful, and it could be a blocker for mobile web. The fragment in this case is not coming from the browser but from MW. One simple but popular example is the redirect from [[Anion]] to [[Ion#Anions_and_cations]]. When I encounter a link to [[Anion]] I want the app to be able to not just go to the [[Ion]] article but also go to the Anions_and_cations section there. I don't want the user to have to scroll to the section, esp. if it's a long article.

The fragment should be in the same encoding as the section anchor to that the scroll to the right section works. Another, more elaborate example is a redirect I set up for testing MCS on testwiki: https://test.wikipedia.org/w/index.php?title=User:BSitzmann_(WMF)/MCS/Test/redirect_test3&redirect=no redirects to https://test.wikipedia.org/w/index.php?title=User:BSitzmann (WMF)/MCS/Test/redirect test3 target#Section %. Note that the fragment anchor is Section_.25. The % is anchorencoded to .25. The encoding issue could go away when Parsoid implements T152540.

The fragment in this case is not coming from the browser but from MW.

To be clear here, the MWAPI returns the value of https://github.com/wikimedia/mediawiki/blob/master/includes/api/ApiPageSet.php#L479 which returns the value of https://github.com/wikimedia/mediawiki/blob/master/includes/Title.php#L1386 is returning that value. ie. There's a bug here: https://github.com/wikimedia/mediawiki/blob/master/includes/Title.php#L1427 in that it doesn't convert Section % to Section_.25.

I know @MaxSem is working on an RFC to fix section heading ids. Hopefully that will also address this problem? If not we should patch core if we're keen to avoid the introduction of the anchorencode and locutus dependencies. This is an edge case after all...

Regarding the app use-case: redirects are handled by RESTBase and MCS has nothing to do with them. We could set the fragment portion in the Location header, that's not too hard, the question is whether the apps would be able to use it after automatically following the redirect? If yes, nothing prevents us from adding the fragment portion.

That's really a question for the app devs. Having worked on the Android app code in the past, I think it would be more convenient to have this in JSON but I believe an HTTP header like location could probably work, too. When the app devs implement this to use the location header they must check if there is a fragment in the location header value.

I would be in favour of the Location header as well. It always pays off to adhere to standards in the long term rather than re-inventing the wheel.

We've added quite a bit of custom logic around request and response headers in the recent past and I think updating to get any URL fragment from the Location header wouldn't be much trouble at all. +1 from me.

GWicke added a comment.EditedAug 7 2017, 5:41 PM

IIRC, the situation around redirect handling is sadly rather complicated, and we should make sure that we fully understand the support situation before committing.

From T118548, I remember that browsers do not expose Location header values, which makes it impossible to discover a section hash sent by the server as part of a redirect response. For non-browser clients, discovering the redirect fragment is possible, but involves manually re-implementing the redirect following logic, including redirect loop protections. For most clients, this seems more trouble than it is worth it.

For simple clients without an interest in the section hash, we should make sure that there are no regressions in usability. This means either that common client libraries reliably strip hashes before following redirects, or that we strip any erroneously sent hashes in the Varnish frontends, before any further request processing. I thought we had such hash stripping logic at some point, but I don't see it any more in the current VCL code.

GWicke triaged this task as Normal priority.Aug 8 2017, 9:42 PM

@bearND is this still an issue?

Just wondering if we need to come up with a solution

bearND added a comment.Apr 9 2018, 9:08 PM

@Fjalapeno Yes, this issue still exists.

curl -sI https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Web_1.0
HTTP/2 302 
[...]
location: Web_2.0
[...]

I think the first step is for RESTBase to expose the fragment portion in the location header or another HTTP header. Then Android may have to make some adjustments to take advantage of it.

mobrovac claimed this task.Apr 10 2018, 6:22 PM
mobrovac edited projects, added Services (doing), RESTBase-API; removed Services (next).

PR #983 fixes the issue. We should test this against MCS to make sure no breakage occurs due to this change in mobile clients (/cc @bearND @Mholloway)

Also note that we will likely need to make a special script that finds all redirects in RESTBase's storage and re-evaluates them in order to have the new (a.k.a. more complete) list of redirect targets.

Change 428706 had a related patch set uploaded (by BearND; owner: BearND):
[apps/android/wikipedia@master] Scroll to correct location when redirects have fragments

https://gerrit.wikimedia.org/r/428706

Pchelolo closed this task as Resolved.Jul 13 2018, 12:21 PM
Pchelolo edited projects, added Services (done); removed Patch-For-Review, Services (doing).

Now it does.

curl -i https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Web_1.0 | grep -i location

location: Web_2.0#%22Web_1.0%22