Page MenuHomePhabricator

Convert TimedText to make use of ContentHandler
Open, MediumPublic

Description

We should definitely rewrite the current TimedText to make use of ContentHandler

TimedTextContent:

  • subtitle format definition
  • current format
  • current type (captions/subtitles/chapters etc)
  • language ?
  • getDataInFormat( SRT/VTT/SSA )
  • Possibly also separate wikitext block for license ?
  • Validation ?

TimedTextContentHandler

  • serialization
  • initially keep as wikitext
  • future convert to json ? Also store language, type and license ?

Editpage

  • allow you to link with a file ?
  • allow you to set language and type ? (captions/subtitles/chapters etc)

Move the current page view logic into a 'ViewAction'

Event Timeline

TheDJ claimed this task.
TheDJ raised the priority of this task from to Medium.
TheDJ updated the task description. (Show Details)
TheDJ added a project: TimedMediaHandler.
TheDJ added subscribers: TheDJ, brooke.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 236486 had a related patch set uploaded (by TheDJ):
[WIP] TimedText should use ContentHandler

https://gerrit.wikimedia.org/r/236486

ideas for a json serialization content format:

{
  "wikipage": "licenses, categories, doc pages, deletion template, redirects, etc ?"
  "track": {
    "type": "text/mw-srt",
    "kind": "captions|subtitles|descriptions|chapters"
    "language": "nl", //optional
    "label": "", // optional, to be shown in player menus ?
    "content": "subtitle file"
    "stylesheet": "" // Optional webVTT stylesheet here ? delivery ?
  }
}
  1. Currently type and language are also stored inside pagename... What will be the impact if we decouple from that ?
  2. Make track, a tracks array with language code indexing ? There are subtitle formats supporting multiple language tracks inside the same file... would make language attribute serialization a requirement.
  3. Another attribute of subtitles track behavior that is often captured is "forced", meaning that a player should always enable this specific track.
  4. Other metadata info that might be relevant: author(s), date, source url ?

I'm leery of putting the type/kind/language in the content as that either has to be duplicated in the title or we have to fetch all the content entries to do discovery. Lemme check what kind of metadata we can squeeze into a webvtt file directly as well (it'd be awesome if we can make the new style webvtt only if possible, converting just old-style pages when transitioning, but not sure how easy that'll be)

Having spent much of today poking around the manually-maintained .srt.* pages from the TimedText namespace trying to figure out why literally half of them fail to convert using the captioning/captioning composer package, my inclinations are slightly changed.

(Note there are currently about 3750 .srt pages in TimedText namespace on Commons; I haven't counted any non-Commons local subtitle pages.)

I don't think we should be exposing low-level entire files in .srt or .vtt to be edited; they're too fragile. I see extra whitespace, removed whitespace, leading zeros in milliseconds, lack of leading zeros in milliseconds. Cues with wrong numeric order declarations due to cut-n-paste errors, cues with wrong end times due to cut-n-paste errors.

I recommend that the internal format reflect WebVTT's data model (possibly as a literal text blob, possibly as a JSON object model), with output to flat WebVTT files via the PHP web API. (and SRT output if we need to keep the old mwembed player running a while longer; output is easier to manage)

If a good cue-oriented editor with player integration isn't going to happen immediately, we should at least have good input validation that points to syntax errors and prevents them from being saved into the system.

Conversion out of .srt should probably be done as part of the transition to a dedicated TimedText ContentHandler and may require manual cleanup of an automated process -- supporting multiple backend formats with automatic conversion should probably not be a goal, as it's just messy.

Open questions to resolve:

  • Should the backend blob be a flat WebVTT file or a JSON blob with cues as arrays, and possibly extended metadata? (Metadata could be encoded in WebVTT, so one doesn't lock out the other)
  • if we're dropping the .srt extension, this gives a chance to change names -- switch from TimedText:Foo.webm.en.srt to TimedText:Foo.webm/en ?

if we're dropping the .srt extension, this gives a chance to change names -- switch from TimedText:Foo.webm.en.srt to TimedText:Foo.webm/en ?

I'd like this, but I note that in theory, you can have multiple tracks in the same language. Consider subtitles vs captions, but also video's with 2 audio tracks (main audio and directors commentary). Where to leave that information in this structure ?

P.S. These concerns are part of the reason why i initially had used direct reference to the subtitle pages in the timedtext API patch. Combined with listing such timedtext in the videoinfo api. That way, we would be free to name files almost anything, as long as it was prefixed with the filename, and if the attributes (kind, format, language etc) were provided by the contenthandler.

@brion Are you going to discuss subtitles at the Developer Summit? My proposal for subtitle translation might be ill-timed if these things are going to change a lot in near future, so if you have plans we could fold these together.

Change 236486 abandoned by TheDJ:
[WIP] TimedText should use ContentHandler

https://gerrit.wikimedia.org/r/236486

TheDJ removed TheDJ as the assignee of this task.Sep 18 2017, 1:39 PM