Convert TimedText to make use of ContentHandler
Open, MediumPublic
Actions

Assigned To

None

Authored By

	TheDJ
	Sep 6 2015, 8:27 PM

Description

We should definitely rewrite the current TimedText to make use of ContentHandler

TimedTextContent:

subtitle format definition
current format
current type (captions/subtitles/chapters etc)
language ?
getDataInFormat( SRT/VTT/SSA )
Possibly also separate wikitext block for license ?
Validation ?

TimedTextContentHandler

serialization
initially keep as wikitext
future convert to json ? Also store language, type and license ?

Editpage

allow you to link with a file ?
allow you to set language and type ? (captions/subtitles/chapters etc)

Move the current page view logic into a 'ViewAction'

Details

	Subject	Repo	Branch	Lines +/-
	[WIP] TimedText should use ContentHandler	mediawiki/extensions/TimedMediaHandler	master	+521 -14

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T44364 Can't mark new timed texts as patrolled
Resolved		TheDJ	T89527 Timed Text redirect pages not accessible
Open		None	T134910 Set correct lang and dir attributes the individual subtitles fragments
Open		None	T78509 Templates on TimedText namespace of wikimedia commons don't expand
Open		None	T123232 No documentation can be added to TimedText pages
Open	Feature	None	T51409 No video on previewing TimedText changes
Open	BUG REPORT	None	T304784 <translate> tag should not be parsed in TimedText
Open		None	T111651 Convert TimedText to make use of ContentHandler
Resolved		Reedy	T145732 TimedMediaHandler should not call Article::getContent()

Event Timeline

TheDJ created this task.Sep 6 2015, 8:27 PM

TheDJ claimed this task.

TheDJ raised the priority of this task from to Medium.

TheDJ updated the task description. (Show Details)

TheDJ added a project: TimedMediaHandler.

TheDJ added subscribers: TheDJ, • brooke.

Restricted Application added a project: Multimedia. · View Herald TranscriptSep 6 2015, 8:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 236486 had a related patch set uploaded (by TheDJ):
[WIP] TimedText should use ContentHandler

https://gerrit.wikimedia.org/r/236486

Restricted Application added a subscriber: Matanya. · View Herald TranscriptSep 6 2015, 8:30 PM

gerritbot added a project: Patch-For-Review.Sep 6 2015, 8:30 PM

Jdforrester-WMF moved this task from Untriaged to Doing on the Multimedia board.Sep 8 2015, 12:32 AM

Jdforrester-WMF moved this task from Doing to Prototyping on the Multimedia board.Sep 8 2015, 3:40 PM

Jdforrester-WMF removed a project: Multimedia.Sep 21 2015, 4:00 PM

Liuxinyu970226 set Security to None.Oct 5 2015, 12:52 PM

Liuxinyu970226 subscribed.

TheDJ moved this task from To sort to Doing on the TimedMediaHandler board.Oct 20 2015, 11:55 AM

TheDJ added a parent task: T43694: Add support for multiple types of subtitle types such as closed captioning, audio descriptions and director comments.Oct 21 2015, 12:29 PM

TheDJ mentioned this in T44364: Can't mark new timed texts as patrolled.Oct 26 2015, 10:23 AM

TheDJ added a parent task: T44364: Can't mark new timed texts as patrolled.

TheDJ added a parent task: T89527: Timed Text redirect pages not accessible.Oct 26 2015, 10:42 AM

TheDJ added a parent task: T63923: Diff of image pages don't have js execute properly even though needed for file history.Oct 26 2015, 10:45 AM

TheDJ removed a parent task: T63923: Diff of image pages don't have js execute properly even though needed for file history.

Krenair subscribed.Oct 26 2015, 12:37 PM

TheDJ added a project: TimedMediaHandler-TimedText.Oct 26 2015, 8:37 PM

TheDJ added a parent task: T116154: Support WebVTT subtitling.Nov 19 2015, 9:28 AM

TheDJ added a parent task: T134910: Set correct lang and dir attributes the individual subtitles fragments.May 11 2016, 6:41 PM

TheDJ added a parent task: T145732: TimedMediaHandler should not call Article::getContent().Sep 16 2016, 11:24 AM

Reedy added a parent task: T145728: Clean up ContentHandler deprecated functions and hooks.Oct 8 2016, 12:11 PM

Reedy removed a parent task: T145732: TimedMediaHandler should not call Article::getContent().

Reedy added a subtask: T145732: TimedMediaHandler should not call Article::getContent().

Reedy closed subtask T145732: TimedMediaHandler should not call Article::getContent() as Resolved.Oct 8 2016, 4:02 PM

Paladox subscribed.Oct 10 2016, 6:04 PM

ideas for a json serialization content format:

{
  "wikipage": "licenses, categories, doc pages, deletion template, redirects, etc ?"
  "track": {
    "type": "text/mw-srt",
    "kind": "captions|subtitles|descriptions|chapters"
    "language": "nl", //optional
    "label": "", // optional, to be shown in player menus ?
    "content": "subtitle file"
    "stylesheet": "" // Optional webVTT stylesheet here ? delivery ?
  }
}

Currently type and language are also stored inside pagename... What will be the impact if we decouple from that ?
Make track, a tracks array with language code indexing ? There are subtitle formats supporting multiple language tracks inside the same file... would make language attribute serialization a requirement.
Another attribute of subtitles track behavior that is often captured is "forced", meaning that a player should always enable this specific track.
Other metadata info that might be relevant: author(s), date, source url ?

TheDJ added a parent task: T78509: Templates on TimedText namespace of wikimedia commons don't expand .Oct 20 2016, 9:37 AM

TheDJ removed a parent task: T145728: Clean up ContentHandler deprecated functions and hooks.Oct 20 2016, 9:39 AM

TheDJ added a parent task: T123232: No documentation can be added to TimedText pages.Oct 20 2016, 9:42 AM

TheDJ added a parent task: T51409: No video on previewing TimedText changes.

I'm leery of putting the type/kind/language in the content as that either has to be duplicated in the title or we have to fetch all the content entries to do discovery. Lemme check what kind of metadata we can squeeze into a webvtt file directly as well (it'd be awesome if we can make the new style webvtt only if possible, converting just old-style pages when transitioning, but not sure how easy that'll be)

Having spent much of today poking around the manually-maintained .srt.* pages from the TimedText namespace trying to figure out why literally half of them fail to convert using the captioning/captioning composer package, my inclinations are slightly changed.

(Note there are currently about 3750 .srt pages in TimedText namespace on Commons; I haven't counted any non-Commons local subtitle pages.)

I don't think we should be exposing low-level entire files in .srt or .vtt to be edited; they're too fragile. I see extra whitespace, removed whitespace, leading zeros in milliseconds, lack of leading zeros in milliseconds. Cues with wrong numeric order declarations due to cut-n-paste errors, cues with wrong end times due to cut-n-paste errors.

I recommend that the internal format reflect WebVTT's data model (possibly as a literal text blob, possibly as a JSON object model), with output to flat WebVTT files via the PHP web API. (and SRT output if we need to keep the old mwembed player running a while longer; output is easier to manage)

If a good cue-oriented editor with player integration isn't going to happen immediately, we should at least have good input validation that points to syntax errors and prevents them from being saved into the system.

Conversion out of .srt should probably be done as part of the transition to a dedicated TimedText ContentHandler and may require manual cleanup of an automated process -- supporting multiple backend formats with automatic conversion should probably not be a goal, as it's just messy.

Open questions to resolve:

Should the backend blob be a flat WebVTT file or a JSON blob with cues as arrays, and possibly extended metadata? (Metadata could be encoded in WebVTT, so one doesn't lock out the other)
if we're dropping the .srt extension, this gives a chance to change names -- switch from TimedText:Foo.webm.en.srt to TimedText:Foo.webm/en ?

TheDJ mentioned this in T44495: Subtitle translation with Translate extension.Oct 31 2016, 12:26 PM

JeanFred subscribed.Oct 31 2016, 12:45 PM

if we're dropping the .srt extension, this gives a chance to change names -- switch from TimedText:Foo.webm.en.srt to TimedText:Foo.webm/en ?

I'd like this, but I note that in theory, you can have multiple tracks in the same language. Consider subtitles vs captions, but also video's with 2 audio tracks (main audio and directors commentary). Where to leave that information in this structure ?

P.S. These concerns are part of the reason why i initially had used direct reference to the subtitle pages in the timedtext API patch. Combined with listing such timedtext in the videoinfo api. That way, we would be free to name files almost anything, as long as it was prefixed with the filename, and if the attributes (kind, format, language etc) were provided by the contenthandler.

@brion Are you going to discuss subtitles at the Developer Summit? My proposal for subtitle translation might be ill-timed if these things are going to change a lot in near future, so if you have plans we could fold these together.

Nikerabbit mentioned this in T151958: Annotations at Wikidev'17.Dec 7 2016, 3:21 PM

Change 236486 abandoned by TheDJ:
[WIP] TimedText should use ContentHandler

https://gerrit.wikimedia.org/r/236486

TheDJ removed TheDJ as the assignee of this task.Sep 18 2017, 1:39 PM

TheDJ moved this task from Doing to TimedText on the TimedMediaHandler board.May 8 2019, 11:21 AM

Jdforrester-WMF removed a project: Patch-For-Review.Jun 6 2019, 9:51 PM

Aklapper removed a parent task: T66031: [DO NOT USE] Video subtitle support [superseded by #TimedMediaHandler-TimedText].Jun 24 2019, 9:16 AM