Page MenuHomePhabricator

File: pages of images stored on commons result in 404s
Closed, ResolvedPublic

Description

If I request a page about an image from Parsoid I get a 404. It doesn't automatically figure out to check Commons for it.
But somehow this works on the regular web site, e.g. via Media Viewer.

Example using Parsoid:
https://en.wikipedia.org/api/rest_v1/page/html/Thought
-> actual link is "./File:ThinkingMan_Rodin.jpg" or https://en.wikipedia.org/wiki/File:ThinkingMan_Rodin.jpg but if I didn't know better I would try to request that fromParsoid from the base wiki and then I get
-> https://en.wikipedia.org/api/rest_v1/page/html/File:ThinkingMan_Rodin.jpg --> 404

Using regular site:
https://en.wikipedia.org/wiki/Thought
-> actual link is pretty similar: <a href="/wiki/File:ThinkingMan_Rodin.jpg" class="image">
That one automatically has MediaView resolve it to
-> https://commons.wikimedia.org/wiki/File:ThinkingMan_Rodin.jpg --> 200

From @GWicke's comments on #mediawiki-parsoid it sound like this should be handled by RESTBase and not Parsoid.

Event Timeline

bearND created this task.Nov 10 2015, 7:10 PM
bearND raised the priority of this task from to Needs Triage.
bearND updated the task description. (Show Details)
bearND moved this task to Needs Triage on the Parsoid board.
bearND added subscribers: bearND, GWicke.
Restricted Application added subscribers: StudiesWorld, Steinsplitter, Aklapper. · View Herald TranscriptNov 10 2015, 7:10 PM

@bearND, could you tell us a bit more about your use case? Is this app users browsing to image description pages?

cscott added a subscriber: cscott.Nov 10 2015, 10:03 PM

When you do an imageinfo API request, imagerepository is one of the returned fields. That tells you whether this is a local or foreign page.

For example:
https://en.wikipedia.org/w/api.php?action=query&prop=imageinfo&format=json&iiprop=metadata&titles=File%3ASpechtensee_gegen_Westen_01.JPG

Returns imagerepository: 'shared', and then:
https://en.wikipedia.org/w/api.php?action=query&meta=filerepoinfo&format=json
tells you the scriptDirUrl corresponding to the shared repo name, which lets you know to look at commons.

There's also a descBaseUrl which can be used more directly for a redirect.

GWicke added a comment.EditedNov 10 2015, 10:06 PM

@cscott, the request is for the image description page, not the thumbnail.

To me, the least painful place to handle this would be RESTBase looking at the title on revision info storage 404, and trying the API and / or commons if the request is for a file description page. The ugly part is the need to recognize localized namespace prefixes, which in turn requires API requests to fetch the configured aliases. However, we could perhaps split out Parsoid's code for this to a shared utility library & use that in RESTBase. We could also make an imageinfo API request for each request for a commons image, but this would likely be slower than a local match & retry.

@GWicke the same mechanism is used. OCG is also using this for the image description page (in addition to the thumbnail). That's what descBaseUrl is for -- it points at the appropriate image description page for this image.

At any rate, in order to do the same thing as mediawiki-core does, you need to do a imageinfo request and parse the response versus the siteinfo for the wiki, since the imageinfo request doesn't directly name a wiki. And I agree with the suggestion that this is probably best done by RESTBase (or maybe via a special redirect issued by Parsoid) so that we don't fragment RESTBase's cache.

Oh, and the descBaseURL includes the localized namespace prefix AFAIK, FWIW.

GWicke triaged this task as Medium priority.Nov 10 2015, 11:13 PM
GWicke set Security to None.

Originally I had two issues:

  1. The Android app is looking for <a href=... class="image"> to know when the user clicked on an image inside the WebView, to then launch the image gallery and show that image there.
  2. Once viewing the image in the gallery the app also allows to view the File page (image description page).

Issue #1 is hopefully solved by the Mobile-Content-Service adding the missing image class attributes. I could use some help refining the CSS selector to find all anchors which need this class added.
So far I've got 'figure a, span[typeof^=mw:Image] a'. Feedback here or directly on https://gerrit.wikimedia.org/r/#/c/252352/ is appreciated to make sure I haven't missed anything.

Issue #2 still needs to be solved, and that's what you've been discussing so far. So, I think the discussion is going into the right direction. I think this should be fixed server side. A client should not have to deal with all the extra roundtrips.

As an aside, I'd be also interested in a solution/convenient API (npm module?) that I could use in the Mobile-Content-Service to have access to the various local namespace prefixes (File, Special, Talk seem to be the most interesting to me) and what the main page is called. In the Android app code we have a script which does something like that and compiles the results into Java code. A more elegant solution would be preferred, though.

Niedzielski added a subscriber: Niedzielski.EditedMar 14 2016, 3:44 PM

This issue is still present in the latest beta, v2.1.142.

We basically need better handling of shared repos (commons).
@GWicke Is this something that could get handled by RESTBase soon? It looks a bit ugly in the Android app to get a 404 when trying to go to the File page of an image.

We can handle it in RESTBase quite easily, so if Mobile needs a quick solution, I can do it tomorrow.

As we now have a title parsing library, we can check whether the requested page is in the File namespace. So, we would request the page from storage, and on 404 redirect to a shared repository for this wiki. The downside of this is elevated latency due to a redirect. As a little optimisation we can implement support for cache-control: only-if-cached header to avoid a roundtrip to the MW API on first request to the storage, however latency would still not be perfect.

A proper solution would be to implement it in Parsoid. As it has access to image info, so it knows that a particular file is on commons, it should use a commons URI in href and resource attributes for the image tags. This would allow mobile apps request a page from a proper location right away and avoid higher latency for file description pages. @ssastry what do you think?

@bearND If this is a high-priority blocker we can implement a RESTBase workaround quickly, and the follow-up with a proper implementation in Parsoid.

GWicke raised the priority of this task from Medium to High.Mar 16 2016, 5:10 AM
GWicke added a project: Services-next.

This one fell off my radar because it looked like the discussion had settled on RESTBase doing this. But, @cscott is there a reason not to do what @Pchelolo suggests .. i.e. use fully resolved urls for the resource and href in Parsoid's HTML? I cannot think of any right now.

@ssastry We will handle it in RESTBase anyway, because we have lot's of stored content and we need to be compatible with that stored content. But if Parsoid does that, then for newer content latency will be lower, and we'd be able to remove a hack from RESTBase at some point.

@ssastry We will handle it in RESTBase anyway, because we have lot's of stored content and we need to be compatible with that stored content. But if Parsoid does that, then for newer content latency will be lower, and we'd be able to remove a hack from RESTBase at some point.

Right, understood.

GWicke added a comment.EditedMar 17 2016, 12:13 AM

PR here: https://github.com/wikimedia/restbase/pull/556

This is now merged in RESTBase master. ETA for the deploy looks like Monday at this point.

Edit: Actually, Tuesday is more likely.

Pchelolo claimed this task.Mar 23 2016, 5:50 PM
Pchelolo closed this task as Resolved.Mar 28 2016, 9:20 PM

The change's been deployed, so mobile apps will get a redirect for these pages now. Redirecting more generally is blocked on VE issue T130757, but this one can be resolved now.

T118548: Support following MediaWiki redirects when retrieving HTML revisions discusses options for more general redirect handling, which we can then use to generally support following redirects to commons as well.