Page MenuHomePhabricator

Broken URL for track element on remote repositories
Closed, ResolvedPublic

Description

For videos with subtitles the original HTML (i.e. before manipulation by the JS player) contains track elements like

<track kind="subtitles"
 data-mwtitle="TimedText:Edward_Snowden_speaks_about_NSA_programmes_at_Sam_Adams_award_presentation_in_Moscow.webm.en.srt"
 data-mwprovider="wikimediacommons"
 type="text/x-srt"
 src="https://commons.wikimedia.org/w/index.php?title=:Edward+Snowden+speaks+about+NSA+programmes+at+Sam+Adams+award+presentation+in+Moscow.webm.en.srt&amp;action=raw&amp;ctype=text%2Fx-srt"
 srclang="en"
 data-dir="ltr"
 label="English (en) Untertitel">

(You can get this by putting [[File:Edward Snowden speaks about NSA programmes at Sam Adams award presentation in Moscow.webm]] on a page on a non-Commons site with Commons embedding enabled and previewing it.)

The URL https://commons.wikimedia.org/w/index.php?title=:Edward+Snowden+speaks+about+NSA+programmes+at+Sam+Adams+award+presentation+in+Moscow.webm.en.srt&action=raw&ctype=text%2Fx-srt returns a 404 error, note the missing namespace. It should be https://commons.wikimedia.org/w/index.php?title=TimedText:Edward+Snowden+speaks+about+NSA+programmes+at+Sam+Adams+award+presentation+in+Moscow.webm.en.srt&action=raw&ctype=text%2Fx-srt but even this isn't fully correct, as it uses text/x-wiki as mimetype, the ctype parameter is ignored.

The JS video player retrieves the subtitles using a different API request, so for users with JS enabled this bug is only noticeable in the browser console, but users with disabled JS will not see the subtitles, even if their browser supports the track element.

Event Timeline

Schnark raised the priority of this task from to Needs Triage.
Schnark updated the task description. (Show Details)
Schnark added a project: TimedMediaHandler.
Schnark subscribed.

Users won't see the track elements anyways, because browsers don't support SRT subtitles, but yeah, we should probably fix this anyways.

Hmm, for me it generates:

<track kind="subtitles" data-mwtitle="TimedText:Edward_Snowden_speaks_about_NSA_programmes_at_Sam_Adams_award_presentation_in_Moscow.webm.en.srt" data-mwprovider="local" type="text/x-srt" src="//commons.wikimedia.org/w/index.php?title=TimedText:Edward_Snowden_speaks_about_NSA_programmes_at_Sam_Adams_award_presentation_in_Moscow.webm.en.srt&amp;action=raw&amp;ctype=text%2Fx-srt" srclang="en" data-dir="ltr" label="English (en) subtitles"></track>Sorry, your

Hmm, for me it generates:

<track kind="subtitles" data-mwtitle="TimedText:Edward_Snowden_speaks_about_NSA_programmes_at_Sam_Adams_award_presentation_in_Moscow.webm.en.srt" data-mwprovider="local" type="text/x-srt" src="//commons.wikimedia.org/w/index.php?title=TimedText:Edward_Snowden_speaks_about_NSA_programmes_at_Sam_Adams_award_presentation_in_Moscow.webm.en.srt&amp;action=raw&amp;ctype=text%2Fx-srt" srclang="en" data-dir="ltr" label="English (en) subtitles"></track>Sorry, your

On Commons, indeed the correct URL is used, but on the client wikis (tested on (de|en).wikipedia) the namespace is missing. So the issue seems to be that the client wiki doesn't know about the TimedText namespace from Commons and thus omits it.

This seems specific to the Wikimedia setup. Probably because Wikimedia uses ForeignDBRepo's (not instantcommons).

For finding subtitles in remote's, The TimedTextHandler runs an internal allpages api query to find relevant subtitles. For InstantCommons it looks like:

https://commons.wikimedia.org/w/api.php?action=query&list=allpages&apnamespace=102&format=jsonfm&apprefix=Edward

result

{
    "batchcomplete": "",
    "query": {
        "allpages": [
            {
                "pageid": 29034050,
                "ns": 102,
                "title": "TimedText:Edward Snowden speaks about NSA programmes at Sam Adams award presentation in Moscow.webm.en.srt"
            }
        ]
    }
}

I think for the internal query, it's the same. For constructing the URL, both foreignDB and InstantCommons use the same function. Therefor, I'm starting to suspect that the internal api when run against the remote is the one returning the invalid title :Edward Snowden speaks about NSA programmes at Sam Adams award presentation in Moscow.webm.en.srt which then ends up in the src. Bit hard to debug this...

It seems there is a ForeignApiQueryAllPages in TextHandler, which seems the most likely culprit.

Right this fails in ApiQueryAllPages::run(). In this function, it uses Title::makeTitle( $row->page_namespace, $row->page_title ); to format the result entry. That means that it will use the Commons namespace id (102), to construct the title in the api result.

'pageid' => intval( $row->page_id ),
'ns' => intval( $title->getNamespace() ),
'title' => $title->getPrefixedText()

So getPrefixedText will generate the faulty title here, because it doesn't know that the displayname of the 102 (Commons specific) namespace, when running in the context of another wiki.

This all be quite scary....

This situation can be compared a bit with our shared filedescription pages. For those we simply do a Http:get, which we cache. Perhaps we should do that here as well, instead of trying to hook up the api to a different database than we are supposed to.

Change 289976 had a related patch set uploaded (by TheDJ):
[WIP] Rewrite discovery of TimedText tracks

https://gerrit.wikimedia.org/r/289976

Geagea subscribed.
This comment was removed by Geagea.

Still does not working.

We all know, as this bug report is open (see "Open, Normal" at the top) and the patch in T122737#2314493 is not "merged". No need to tell everybody again... :)

Hi folks. Could we get an update on the progress of solving this bug? Some of the non-English Wikipedia communities want this bug fixed before we continue to add English audio videos with subtitles to non-English Wikipedia pages.

Thanks!

Sorry guys, for the past 3 months i simply have had 0 time to work on code for MediaWiki/Wikipedia... And apparently no one else has picked this up so far. I can't make promises as to when I get around to pushing this through the system.

I could not reproduce the bug as described by the reporter. Can we get a confirmation of these steps:

  1. Go to any page on any Wikimedia wiki (I chose Commons)
  2. Enter [[File:Edward Snowden speaks about NSA programmes at Sam Adams award presentation in Moscow.webm]] and save the page
  3. View source for the page that is returned

Chrome's "view source" should not change the output of the server, nor should the browser used. If I have mistaken any other steps, please advise.

( fyi my test page was https://commons.wikimedia.org/wiki/User:MarkTraceur_(WMF)/Caption_test )

@MarkTraceur Wikimedia Commons to my knowledge has always displayed subtitles. The issue is on Wikipedia.

This is the test page I used. https://en.wikipedia.org/wiki/Arteriosclerosis

Click on the video on the right of the page. Then turn on subtitles for any language. They don't display.

Hope that helps!

@OsmoseIt yes, I see the issue on that page. So this appears to be an issue with remote repositories, not a general problem with captions. Investigating...

MarkTraceur renamed this task from Broken URL for track element to Broken URL for track element on remote repositories.Aug 26 2016, 1:35 PM
MarkTraceur updated the task description. (Show Details)
MarkTraceur set Security to None.

In an absurd twist of fate, apparently InstantCommons is not affected by this problem, so it may be that the only affected foreign repos are ForeignDBViaLBRepo instances. Which basically means I'd need to go down one hell of a rabbit hole to debug this locally, which is a pain.

I will endeavour to locate some log files in production first, maybe there's a super-helpful error message that will tell me what is happening (I know, I know, I must be new here)

We really appreciate you looking into this! :)

Sorry, I was a little slow getting up to steam this morning. The patch in Gerrit looks good, and should fix the problem. I'm pinging @TheDJ now in IRC to make sure he doesn't disagree. We should be able to get this done in the next few days.

Hey folks! I wanted to know if we have an estimated release date for this patch? International Wikipedias are beginning to remove videos from their pages because subtitles don't show up. Thanks!

Change 289976 merged by jenkins-bot:
Rewrite discovery of TimedText tracks

https://gerrit.wikimedia.org/r/289976

Change 314856 had a related patch set uploaded (by Paladox):
Rewrite discovery of TimedText tracks

https://gerrit.wikimedia.org/r/314856

Change 314862 had a related patch set uploaded (by Brion VIBBER):
WIP: Rewrite discovery of TimedText tracks

https://gerrit.wikimedia.org/r/314862

Change 314862 merged by jenkins-bot:
Rewrite discovery of TimedText tracks

https://gerrit.wikimedia.org/r/314862

The primary issue is fixed, but now we are stuck with a Cross-Origin violation for some reason...

@brion and I are looking into why.

I'm now not convinced it ever worked right before except directly on the local site... As far as I know we only ever added a cross-origin exception for the API, not for action=raw stuff, so the cross-domain XHRs would not have gone through even if they were correct.

I've revived @TheDJ's serve-TimedText-via-API patch: https://gerrit.wikimedia.org/r/#/c/232214 and tweaked it to use the API on the <track>s plus the appropriate 'origin=*' magic param on remote files, which is the signal needed for anonymous CORS. Should work over both ForeignDBViaLBRepo and ForeignApiRepo regardless of local domain whitelisting... tested in MediaWiki-Vagrant with the 'commons' role, so two separate hostnames.

If that seems reasonably sane we can clean it a little more and land it.

Deployment note: ideally this should hit Commons first so remote API requests from other sites will work, but since the current action=raw requests don't work anyway, they'll just fail in a different way until Commons gets the update too.

@TheDJ figured out what changed -- https://gerrit.wikimedia.org/r/#/c/316068 fixes the data attributes on the text track which we accidentally broke in other updates, and this makes some other hackarounds in the frontend start working again.

With the fix in, the URL in the <track> is actually not used -- it switches over to pulling from the API, over JSONP to work remote without CORS. Hah!

The switch to a dedicated API endpoint will still be superior and should get .vtt working at some point.

Change 316069 had a related patch set uploaded (by Brion VIBBER):
Repair text track attributes

https://gerrit.wikimedia.org/r/316069

Change 316069 merged by jenkins-bot:
Repair text track attributes

https://gerrit.wikimedia.org/r/316069

TheDJ claimed this task.
TheDJ moved this task from Doing to Done on the TimedMediaHandler board.

Right, we can finally call this fixed. Both the original breakages of urls in <track>s, as well as the later issues where wiki's were not able to retrieve subtitles from Commons.