Page MenuHomePhabricator

displaytitle page props should contain html in representation
Open, MediumPublic

Description

This information is available in data-parsoid,

Π01 class

<meta property="mw:PageProp/displaytitle" content="Π01 class" data-parsoid='{"src":"{{DISPLAYTITLE:&amp;Pi;&lt;sup>0&lt;/sup>&lt;sub style=\"margin-left:-0.5em\">1&lt;/sub> class}}","a":{"content":"Π01 class"},"sa":{"content":"&amp;Pi;&lt;sup>0&lt;/sup>&lt;sub style=\"margin-left:-0.5em\">1&lt;/sub> class"},"dsr":[0,78,null,null]}'/>

However, data-parsoid is stripped for template generated content.

'Til Death

<meta property="mw:PageProp/displaytitle" content="'Til Death" about="#mwt4"/>

See the expected results in,
https://en.wikipedia.org/w/api.php?action=query&prop=info&inprop=displaytitle&titles=1983+World+Artistic+Gymnastics+Championships%7C%27Til+Death%7C%CE%A001+class&format=jsonfm

The current content is only useful for stuff like,

IPhone

<meta property="mw:PageProp/displaytitle" content="iPhone" about="#mwt3"/>

Event Timeline

Arlolra created this task.Jan 6 2016, 6:58 PM
Arlolra raised the priority of this task from to Medium.
Arlolra updated the task description. (Show Details)
Arlolra added a project: Parsoid.
Arlolra added subscribers: Arlolra, Bianjiang.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 6 2016, 6:58 PM
ssastry added a subscriber: ssastry.Jan 6 2016, 7:02 PM

Any semantic information that is needed in the HTML should be surfaced out of data-parsoid even if it is present there .. since data-parsoid is considered private and we should retain the freedom of changing its format / contents without having to worry about breaking parsoid html clients.

cscott added a subscriber: cscott.Jan 6 2016, 7:38 PM

Do we really want to include styling in the value here? It's just being used for italics in the example, but I'm not completely convinced that <meta property="mw:PageProp/displaytitle" content="<i>'Til Death</i>"> is actually helpful here. If we were to include HTML, we'd want to include a small small subset of HTML, not blindly copy through any <span> tags corresponding to internal template markers, etc.

Stripping HTML is "as designed" here... although we can discuss whether the design should be changed.

i was using "extracting displaytitle" an an example for my statement "we constantly have backfill requirements, with more and more development happening around APIs".

From information extraction point of view, a real hard problem is for "C#" language:

https://en.wikipedia.org/w/api.php?action=query&prop=info&inprop=displaytitle&titles=C_Sharp_(programming_language)&format=jsonfm

where we do expect to get a string "C#", as it is mentioned in the {{Correct title}}. It would be better if Parsoid can help.

@cscott:
For the styling, from articles I've seen so far, style by itself is often implies semantic.
e.g. according to wikipedia's style guide [1], most italic title implies the underlying article is a "major work", instead of a general name. So it's better to have a way keep such information (e.g. if there is concern on using <i> in a plain string, maybe you can introduce an additional field (is_italic) to save it.

[1] https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Titles#Major_works

This is required for us to be able to preview the title of a document in VE, for example if I add {{italic title}} to my page, and want to preview the output using Parsoid...

LGoto moved this task from Needs Triage to Backlog on the Parsoid board.Sat, Feb 15, 9:42 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptSat, Feb 15, 9:42 PM