Page MenuHomePhabricator

Add new datatypes for wikitext
Open, MediumPublic

Description

We may create a new datatype for wikitext. Values may have simple html formatting (bold, italics, etc) and link (internal, external), but not templates, parser functions, magic words and most tags. The text may be:

  1. Untranslated wikitext
  2. Monolingual wikitext
  3. Multilingual wikitext (most usually)

This may be exported as three ways:

  1. Wikitext used in Wikimedia project (e.g. [[c:|Commons]] is a '''Wikimedia''' project.) (Default display in Wikidata interface; need to resolve T42128 first)
  2. Wikitext used in other 3rd-party MediaWiki project (e.g. [[commons:|Commons]] is a '''Wikimedia''' project.)
  3. HTML (e.g. <a href="https://commons.wikimedia.org/wiki/">Commons</a> is a <b>Wikimedia</b> project.)

Use cases:

  1. Several usage note properties (https://www.wikidata.org/wiki/Property:P41 for example)
  2. Pages like https://www.wikidata.org/wiki/Wikidata:Tools/External_tools , which is proposed to be converted to items
  3. File descriptions in Commons may contains links to other pages
  4. See T139573: Simple html formatting within Wikidata labels

Event Timeline

Another use case, several times discussed and used on frwiki, is image legends which frequently use wikitext on Wikipedia.

Bugreporter renamed this task from Add a new datatype for wikitext to Add new datatypes for wikitext.Feb 15 2017, 7:53 PM

As we are still waiting for T86517 to be able to add chemical nomenclature to Wikidata, that won't be possible without simple formatting.

Formatting (like <sub>, <sup>, <i> and <small>) is inseparable part of chemical systematic names and the need to use formatting is clearly indicated in the IUPAC nomenclature recommendations. Inability to add fully correct chemical names will be a significant step backwards, especially in view of the fact that many other chemistry databases provide fully correct names (cf. ChEBI database for example: (R)-methyl phenyl sulfoxide). It is understandable that label/description/aliases are not meant for this, because are just simple text, but properties should allow the addition of correct data.

It is therefore important to:

  1. use simple formatting with multilingual and monolingual text datatypes or
  2. provide different way to indicate which parts of the multilingual/monolingual values should be formatted by the end-user and how, e.g. by using regex in specific property that (a) would format the value in WD and (b) could be used by the end-user of data.

Using the second option has some limitations, e.g. right now we are losing information about which parts of the title should be italic, what is quite important in reusing the titles of scientific papers, e.g. Anti-complement Activity of Constituents from the Stem-Bark of //Juglans mandshurica// – because of this I'm not able to reuse WD data, as imported title won't be typographically correct. Seems a minor problem? Maybe, but this is a problem that decides whether to use the title or not.

I am considering implementing a prototype for this (as part of the non-WMF deployed extension MathSearch). I am worried this might cause a heavy load on the current SPARQL endpoint backed by blazegraph.

Does anyone know if one can mark a datatype as exempt from the SPARQL endpoint?

An alternative to this datatype, would be to store wikitext on the wiki in templates and develop an API to get the template parameters together with the wikidata item data. When looking at the structured data on common projects, it seems that the data stored on-wiki and in the media information are disjunct. I think it would be useful to get the image's description value and the rest of the mediainformation via one endpoint. If it were easy to read and write template parameters via an API, they would probably already exist. Thus, I think this alternative is better but harder (too hard) to implement.