Page MenuHomePhabricator

Simple html formatting within Wikidata labels
Open, LowestPublic

Description

This has been mentioned several times on the French WP, last time here: simple html formatting usually used in {{DISPLAYTITLE}} for article titles should be usable for Wikidata labels. For instance a superscript should have some way to be entered in the Wikidata interface, displayed as the title of the item and in its occurrences in other items, and passed to Wikipedia for display ({{#property:}} or mw.wikibase.getEntity etc.) as <sup>.

The problem is all languages which use things such as a superscript in the standard writing of some terms, for instance in French the numbering for royalty and nobility starts "Ier" (feminine "Ire") with the "er" ('re') part in superscript. Streets called "rue François Ier" always have a superscript "er" on the street sign and there is no other way to write it in French, "Ier" without a superscript would look like a non-existing name or word "ier". There have been isolated attempts to use for Wikidata labels the special Unicode characters for (phonetic) superscripts and even for (math) italics, but they are difficult to access and anyway their rendering is bad, since for instance the phonetic superscript series comes in several parts introduced in Unicode at different times and most navigators and systems use fonts that show some differences, like the superscript e and r not positioned at the same height., so this is not a solution.

Event Timeline

The development team has indicated that they have no intention of adding that feature, See T43749: Italic do not work on item's title.
You may use characters ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛʷˣʸᶻ instead (note there're no q).

@Oliv0 and @Bugreporter: ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛʷˣʸᶻ are not meant to be used as typographic superscripts. They are meant for very specific phonetics uses as they are modifier letters.

See http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#page=36
Only those superscript or subscript forms that have specific usage in IPA, the
Uralic Phonetic Alphabet (UPA), or other major phonetic transcription systems are
encoded.

See also http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#page=38
Superscript modifier letters are intended for cases where the letters carry a specific meaning,
as in phonetic transcription systems, and are not a substitute for generic styling mechanisms
for superscripting of text, as for footnotes, mathematical and chemical expressions,
and the like.

One should not use characters ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛʷˣʸᶻ for typographic superscripts!

You may use characters ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛʷˣʸᶻ instead (note there're no q).

This is a very very very bad idea.
The goal is to improve the formatting, not to worsen it (François Ier is correct, although not perfect ; François Iᵉʳ is not correct).

I just removed all the IPA superscript codepoint whereever I found them.

To answer the request for a collection of use cases by @daniel in T43749#475002, what I had in mind so far is French abbreviations like Napoléon I<sup>er</sup>, and titles like 1922 <i>Encyclopædia Britannica</i>, but more uses of simple html can certainly be found by other users.

There are several use cases for superscript in various languages:

I propose to create a new datatype for wikitext. Therefore we can create new properties for labels in wikitext (plain labels are still needed).

I propose to create a new datatype for wikitext. Therefore we can create new properties for labels in wikitext (plain labels are still needed).

Labels are not statements, they do not have data types. They are just plain text.

Adding a wikitext (or other markup) datatype would allow you to provide statement values with markup, but would do nothing for labels, descriptions, or aliases.

I'm not convinced that the benefit this would have really outweighs the numerous complications it causes. However, if we decide to support markup in labels (and descriptions and aliases), I vote to:

  • Only allow a very restricted set of HTML tags. Perhaps <sub>, <sub>, <tt>, and <i>.
  • Definitely disallow all block level elements, all media elements, all kinds of links, all metadata.
  • Not allow any attributes, especially no style, class, or id attributes.

A big problem is escaping. Do we want it to be possible to have a label that is literally "<sub>foo</sub>", with the markup not being interpreted, but shown verbatim? If so, how? Should we use &lt;sub&gt;foo&lt;/foo&gt;? If we do that, we need to escape all occurrences of "&" in any label, which may confuse existing clients, that expect labels to be plain text.

Transitioning from "plain text" to "limited html markup" is going to be tricky. I see no clean way to do this, at least none that wouldn't break compatibility with existing clients.

An alternative to escaping would be to have a flag that tells Wikibase whether a label contains markup or not. That way, it would always be clear when escaping (and unescaping) needs to be applied. This would mean changing the JSON serialization of labels. We currently use this kind of structure:

"labels":{"eo":{"language":"eo","value":"George Washington"},
  "pl":{"language":"pl","value":"George Washington"},
  "fr":{"language":"fr","value":"George Washington"}
}

We would then need something like

"labels":{"eo":{"language":"eo","value":"George Washington"},
  "pl":{"language":"pl","value":"George Washington"},
  "fr":{"language":"fr","value":"George Washington I<sup>er</sup>","markup":"html"}
}

That's also a breaking change, and may confuse clients, but perhaps not in a totally terrible way.

@daniel My option is keep labels a plain string, and add a new wikitext datatype, so that we can create a new property "label in wikitext". raw labels are still needed for searching and displaying at pages like recent changes (just like article titles/page names).

@Bugreporter when and where would that property be used? Wikibase wouldn't know about it, and would not use it instead of the regular label.

To me that seems like a very different request: have a datatype that supports (limited) markup (wikitext or html or whatever), vs supporting markup in labels. These two may seem similar at a first glance, but they are very different from a technical perspective, and also sematically: labels are editorial content originated by the wikidata community, while statements are supposed to represent claims made by authorities documented in reliable sources. Allowing wikitext properties would thus not address this request as currently phrased.

So before discussing this any further, I suggest to make clear what semantics is desired, and what the intended use cases are.

For use cases of wikitext datatype see T141764: Add new datatypes for wikitext. For this bug, it can be used in infoboxes.

I don't think there should be markup in labels (not easy to search), and adding a wikitext datatype meets the needs.

@Oliv0 would markup in statement values (see T141764) address the use cases you had in mind when filing this? If so, we can close this task, I think. Supporting markup in labels would be *much* more disruptive than adding a new data type.

@daniel Markup in statement values is a different question (it has been raised a few times on frwiki e.g.. for image labels, which frequently use wikitext). This task is about what is displayed in Wikipedia by {{#property:}} or mw.wikibase.getEntity etc., that is labels, and about a very restricted use of html as you suggested.
If a "label with wikitext" multilingual wikitext property, when its value exists, could be used by {{#property:}} etc. instead of the label value, this could do the job but it would be a strange data structure, with the displayed label info coming both from the item label and from the value of a property.

@Oliv0 I'd prefer a Lua module to implement the "show statement value instead of label if it exists" logic. This logic could be adjusted on-wiki, if needed. {{#property:}} should be reserved for the "low level" direct access. The module call can easily be wrapped in a template, so you would just have {{pretty-prop:P1234}} in the wikitext.

I can see that limited markup in labels would be nice for some use cases, but introducing it now opens a whole can of ugly worms in terms of compatibility.

@daniel Yes, that would be satisfying for Wikipedia users, but maybe not for Wikidata users who would like to have some way to enter a label in the user interface with the proper typography (French I<sup>er</sup>, Spanish 3.<sup>er</sup> etc.) and see it displayed in the title of the item and as a property value in other items.
As a comparison for English-language users, this is as if for some technical reason they could not use in Wikidata labels a character which is used in English typography but is not really pronounced, for instance the quote character ("): labels would still be quite readable, but users would like to get this fixed.

thiemowmde triaged this task as Lowest priority.Sep 5 2016, 4:02 PM

Is this going anywhere? There is still a need to be able to display an accurate version of many labels. Chemical and mathematical formulas need subscripts and superscripts. Species names need italics. For a concrete example, see https://www.wikidata.org/wiki/Q92315294 where the title should be displayed as "Reinstatement of Ptilotus parviflorus (Lindl.) F.Muell. (Amaranthaceae)".

Could we use a qualifier like "named as" or "stated as" to make this work?

Or maybe this is "title in HTML" P6833, which I just stumbled across? Wikidata sure is opaque.