Page MenuHomePhabricator

Datatype for chemical formulae on Wikidata
Closed, DeclinedPublic

Description

In the discussion about a datatype for mathematical expressions on wikidata, we also mentionnned chemical formulae.
Chemical formula are now differtiable from mathematical formulae
see for example https://en.wikipedia.org/w/index.php?title=Coefficient&type=revision&diff=704647842&oldid=704112401
After the experience with the math datatype it would be relatively straight forward to add a data type for chemical formulae as well.

Event Timeline

Physikerwelt raised the priority of this task from to Low.
Physikerwelt updated the task description. (Show Details)
Physikerwelt added projects: Math, Wikidata.

Change 270476 had a related patch set uploaded (by Physikerwelt):
WIP: Chemistry datatype for Wikidata

https://gerrit.wikimedia.org/r/270476

done https://www.wikidata.org/wiki/Property_talk:P274#mhchem I hope this will be seen.
@mkroetzsch related to the discussion in our F2F meeting today: Do you have an idea how to identifiy the significant experts on this topics early in the process?

I really wonder if the introduction of all kinds of specific markup languages in Wikidata is the right way to go. We could just have a Wikitext datatype, since it seems that Wikitext became the gold standard for all these special data types recently. Mark-up over semantics. By this I mean that the choice of format is focussed on presentation, not on data exchange. I am not an expert in chemical modelling (but then again, is anyone in this thread?), but it seems that this mark-up centric approach is fairly insufficient and naive.

I am also missing the requirements analysis. How many infoboxes are currently using any chemical formula with this special markup at all? If you look at a page like https://en.wikipedia.org/wiki/Ethanol, you see there is no such markup in the whole page. Neither is there in https://en.wikipedia.org/wiki/Photosynthesis. Who really needs this in Wikidata? Aren't there many other forms of notation in chemistry (and biology) that would be equally important?

There are some really fundamental datatypes currently missing, notably multi-lingual texts and geo shapes. This is the level on which datatypes are useful. Presentational things can be done by gadgets, as we already have for URL links shown with IDs. There is no need to codify this in the data model. Communities can solve these simple problems already without changing datatypes of existing properties (which is costly since existing applications and tools need to be updated each time).

It is also notable that https://www.wikidata.org/wiki/Property:P274 has better format documentation than what is proposed here (at least there is no documentation of what the proposed format actually consists in in this thread). They even define a regular language for the possible content. Their direct, text-based formatting is preferable in many cases.

I had a look at Ethanol and found C2H6O, Photosynthesis also had chemical formulae but no info boxes. The intention of the ce tag and alos of this new datatype is to introduce more semantics. With the extra information that C is part of a chemical formula it is clear that C stands for https://en.wikipedia.org/wiki/Carbon which is not clear if that's just a string data type.

The advantage of a data type vs. a property is that a service can enhance the input data with additional information and which can thereafter be used by third party services.

Moreover, I do not get the point about the datatypes that are really needed. Can you provide a like that justifies the mulitlingual text for example?

Re chemical markup for semantics: this is true for Wikitext, where you cannot otherwise know that "C" is carbon. It does not apply to Wikidata, where you already get the same information from the property used. Think of P274 as a way of putting text into "semantic markup" on Wikipedia.

In general, one should not confuse the task of adding semantic markup to a wiki text with what we do in Wikidata. We did the former in Semantic MediaWiki, and this approach never made it into Wikipedia. The communities in Wikipedia prefer readability of the source text over semantic markup, and the decision therefore was to move "semantic" information to a separate place, Wikidata.

The advantage of a data type vs. a property is that a service can enhance the input data with additional information and which can thereafter be used by third party services.

You can do this in any case. Services already run on all kinds of property values on Wikidata. If you are talking about an enhanced UI that provides in-value annotation, then I don't see what exactly you refer to. Is anybody developing/designing/planning such a UI? I don't think this is needed for chemicals, where automated entity recognition would be fairly trivial to do automatically, so I would not spend any effort on this.

Can you provide a like that justifies the mulitlingual text for example?

Missing word in sentence? Answers for both possible interpretations:

  • Use case: needed for all translated texts (e.g, slogans/mottos of organisations, quotes of people, usage notes for properties, ...)
  • Technical need: The type requires a new value type, since its contents is structurally distinct from all datatypes we have. Related to this, it requires a new UI, new JSON structures, and a new RDF encoding. Can't e done in a gadget.

Ok, we currently do not have a UI for the chemistry tags. However, I think parsing strings of the form

([αβγδφωλμπ]-)?([([]*[A-Z☐][ub]?[a-z]?[₁₂₃₄₅₆₇₈₉₀]*(\)?[¹²³⁴⁵⁶⁷⁸⁹⁰]*[⁺⁻]?)?[])|,₁₂₃₄₅₆₇₈₉₀]*(·\(?[-0-9.]*n?\)?)?)+

is much harder than using structured data.
I think writing a SPARQL Query that searches for the nucleon masses of atoms which co-occur with at least 6 carbon atoms is not trivial.
That H is Q556 can be obtained via the element symbol relation P246. But still, I would not call parsing that property to a structured data format a trivial task.

With regard to the technical need of the multilingual text, one could also model a slogan as an item, which has the translation capabilities. In total, the question Property vs. DataType/ValueType can be discussed controversal.

Re parsing strings: You are skipping the first step here. The question is not which format is better for advanced interpretation, but which format is specified at all. Whatever your proposal is, I have not seen any syntactic description of if yet. If -- in addition to having a specified syntax -- it can also be parsed for more complex features, that's a nice capability. But let's maybe start by saying what the proposed "structured data" format actually is.

Re multilingual text: There is nothing controversial about the technical aspects you refer to. One could make all complex datatypes we have into items and only use primitive datatypes instead. There are many reasons against this, so we decided to have datatypes instead for representing complex value objects.

It is a pity that you are not trying to explain what you propose but focus on attacking current proposals instead (how Wikidata editors store chemical formulas now, how the multilingual text datatype was planned). Since you are not providing much details, I have now also reviewed the gerrit patch. There is nothing in there that would enable you to search for "the nucleon masses of atoms which co-occur with at least 6 carbon atoms". The input text is simply sent to a LaTeX formatter like mathematical markup. There is no semantic interpretation or structured data there at all. There is also no syntax specification in this part of the code, so it seems the specification is "whatever the current version of MediaWiki does with text in ce-tags". All the doubts I raised in my first post remain valid.

First of all, I'd like to clarify that my comparison to other proposal and state of the art properties, should not be seen as attack.

However, I understand that I should write down what exactly MediaWiki does with the math and ce tags respectively.

The main difference between simple datatyes and the new math and ce datatypes, which can be used to describe strucutres, is that the input syntax is internally pre processed prior to being exposed to third parties. This is

for Math:      texvc  -> MathML
for Chemistry: mhchem -> MathML

That we currently do not have content-MathML output can be seen as bug. But we could simple enable it in the future without to change the specification of the output format. If MathML is the best output format for chemical sum formulae is certainly debatable.

I'll close this bug for now and reopen it, if I have better documentation and a convincing protype.

Change 270476 abandoned by Physikerwelt:
[mediawiki/extensions/Math@master] WIP: Chemistry datatype for Wikidata

Reason:

https://gerrit.wikimedia.org/r/270476