Page MenuHomePhabricator

Make it possible to query for math values
Open, MediumPublic

Description

As an ... I want to ... in order to ...

Problem:
Currently, we only export the MathML representation of a formula to RDF/Wikidata Query Service, not the original TeX string (T126349). This makes it impossible to efficiently query for statements with a certain math value – you have to use REGEX (CONTAINS is not enough, because the TeX annotation embedded in the MathML is not identical to the original TeX string). It also means that any external application that wants to use the TeX string based on the RDF export must use an XML parser to analyze the MathML and extract the TeX annotation.

Example:
Current JSON of Pythagorean theorem:

{
  "mainsnak": {
    "snaktype": "value",
    "property": "P2534",
    "datavalue": {
      "value": "c^2=a^2+b^2",
      "type": "string"
    },
    "datatype": "math"
  },
  "type": "statement",
  "id": "Q11518$fe4a5b31-4f7e-29a3-06da-25f0186d9a42",
  "rank": "normal"
}

Current TTL of the same item:

wd:Q11518 wdt:P2534 "<math xmlns=\"http://www.w3.org/1998/Math/MathML\" display=\"block\" alttext=\"{\\displaystyle c^{2}=a^{2}+b^{2}}\">\n  <semantics>\n    <mrow class=\"MJX-TeXAtom-ORD\">\n      <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n        <msup>\n          <mi>c</mi>\n          <mrow class=\"MJX-TeXAtom-ORD\">\n            <mn>2</mn>\n          </mrow>\n        </msup>\n        <mo>=</mo>\n        <msup>\n          <mi>a</mi>\n          <mrow class=\"MJX-TeXAtom-ORD\">\n            <mn>2</mn>\n          </mrow>\n        </msup>\n        <mo>+</mo>\n        <msup>\n          <mi>b</mi>\n          <mrow class=\"MJX-TeXAtom-ORD\">\n            <mn>2</mn>\n          </mrow>\n        </msup>\n      </mstyle>\n    </mrow>\n    <annotation encoding=\"application/x-tex\">{\\displaystyle c^{2}=a^{2}+b^{2}}</annotation>\n  </semantics>\n</math>"^^<http://www.w3.org/1998/Math/MathML> .

wd:Q11518 p:P2534 s:Q11518-fe4a5b31-4f7e-29a3-06da-25f0186d9a42 .

s:Q11518-fe4a5b31-4f7e-29a3-06da-25f0186d9a42 a wikibase:Statement,
                wikibase:BestRank ;
        wikibase:rank wikibase:NormalRank ;
        ps:P2534 "<math xmlns=\"http://www.w3.org/1998/Math/MathML\" display=\"block\" alttext=\"{\\displaystyle c^{2}=a^{2}+b^{2}}\">\n  <semantics>\n    <mrow class=\"MJX-TeXAtom-ORD\">\n      <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n        <msup>\n          <mi>c</mi>\n          <mrow class=\"MJX-TeXAtom-ORD\">\n            <mn>2</mn>\n          </mrow>\n        </msup>\n        <mo>=</mo>\n        <msup>\n          <mi>a</mi>\n          <mrow class=\"MJX-TeXAtom-ORD\">\n            <mn>2</mn>\n          </mrow>\n        </msup>\n        <mo>+</mo>\n        <msup>\n          <mi>b</mi>\n          <mrow class=\"MJX-TeXAtom-ORD\">\n            <mn>2</mn>\n          </mrow>\n        </msup>\n      </mstyle>\n    </mrow>\n    <annotation encoding=\"application/x-tex\">{\\displaystyle c^{2}=a^{2}+b^{2}}</annotation>\n  </semantics>\n</math>"^^<http://www.w3.org/1998/Math/MathML> .

Screenshots/mockups:

BDD
GIVEN
AND
WHEN
AND
THEN
AND

Acceptance criteria:

Open questions:

Notes:
See T126349#2014848 for more information.

Side note: the “math” datatype currently seems to be missing completely from the RDF Dump Format page – once we’ve implemented this, we should rectify that.

See also the discussion on the request a query page and contact the development team.

Note from Lucas: I think we should add full value nodes for math values, with a structure a bit like this:

wd:Q4115189 a wikibase:Item;
  # ...
  wdt:P2534 "<math xmlns=\"http://www.w3.org/1998/Math/MathML\" display=\"block\" alttext=\"{\\displaystyle a}\">\n  <semantics>\n    <mrow class=\"MJX-TeXAtom-ORD\">\n      <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n        <mi>a</mi>\n      </mstyle>\n    </mrow>\n    <annotation encoding=\"application/x-tex\">{\\displaystyle a}</annotation>\n  </semantics>\n</math>"^^<http://www.w3.org/1998/Math/MathML>.

wd:Q4115189 p:P2534 wds:Q4115189-af122b69-484b-2edd-1af8-f0a691b05039.

wds:Q4115189-af122b69-484b-2edd-1af8-f0a691b05039 a wikibase:Statement;
  ps:P2534 "<math xmlns=\"http://www.w3.org/1998/Math/MathML\" display=\"block\" alttext=\"{\\displaystyle a}\">\n  <semantics>\n    <mrow class=\"MJX-TeXAtom-ORD\">\n      <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n        <mi>a</mi>\n      </mstyle>\n    </mrow>\n    <annotation encoding=\"application/x-tex\">{\\displaystyle a}</annotation>\n  </semantics>\n</math>"^^<http://www.w3.org/1998/Math/MathML>;
  psv:P2534 wdv:d961720c22709f7991be5be0ddf51c88.

wdv:d961720c22709f7991be5be0ddf51c88 a wikibase:MathValue;
  wikibase:mathML "<math xmlns=\"http://www.w3.org/1998/Math/MathML\" display=\"block\" alttext=\"{\\displaystyle a}\">\n  <semantics>\n    <mrow class=\"MJX-TeXAtom-ORD\">\n      <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n        <mi>a</mi>\n      </mstyle>\n    </mrow>\n    <annotation encoding=\"application/x-tex\">{\\displaystyle a}</annotation>\n  </semantics>\n</math>"^^<http://www.w3.org/1998/Math/MathML>;
  wikibase:mathTeX "a"^^<http://latex.example/TODO>.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This would also be useful for Wikipedia equations. The TeX-string in the alttext attribute is not very useful, I hope we get rid of this modification soon, see T188879

Esc3300 updated the task description. (Show Details)

@Lucas_Werkmeister_WMDE with original TeX string you mean the userInputTex. It has been shown, that this representation is not suitable for Math Information Retrieval Tasks. Also, note that you need to install texvc in order to render the userInputTex. Reading the request a query page, I wonder what the underlying information need of the user was. Does it fall into the "what is the name of the formula ..." category?

I think it might fall into the category of "being able to search for a formula without knowledge of the arcane ways in which it was mangled downstream of data entry".

@Physikerwelt with original TeX string I mean what we actually store in the database, what we return in non-RDF formats, and what you see when you edit the statement. That is, in my opinion, the actual value of the statement. Unless we apply texvc normalization when saving the statement, that is just another output format to me.

I also don’t understand why texvc would be necessary to render the TeX – shouldn’t it be valid on its own? (Whether it’s safe to render the value directly is a different question, of course, but I don’t think texvc is the only solution to that, and in any case there might be third-party Wikibase installations that don’t need to worry about malicious commands in statement values.)

;-) I think everything falls into this category. See Table 4 for a list of examples that are being considered as examplary information needs of users.

There’s nothing stopping us from exporting the texvc version in addition to the original TeX string, of course:

wdv:d961720c22709f7991be5be0ddf51c88 a wikibase:MathValue;
  wikibase:mathML "<math xmlns=\"http://www.w3.org/1998/Math/MathML\" display=\"block\" alttext=\"{\\displaystyle a}\">\n  <semantics>\n    <mrow class=\"MJX-TeXAtom-ORD\">\n      <mstyle displaystyle=\"true\" scriptlevel=\"0\">\n        <mi>a</mi>\n      </mstyle>\n    </mrow>\n    <annotation encoding=\"application/x-tex\">{\\displaystyle a}</annotation>\n  </semantics>\n</math>"^^<http://www.w3.org/1998/Math/MathML>;
  wikibase:mathTeX "a"^^<http://latex.example/TODO>;
  wikibase:mathTexvc "{\\displaystyle a}"^^<https://www.mediawiki.org/wiki/Texvc/TODO>.

(Edit: except that it’s a bit redundant, I suppose… but really, the MathML is long enough that I don’t think the two TeX strings make a big difference.)

If it's a problem having both formats, I think the input format only should be exported.

You may want to re-read the comment by @mkroetzsch at T126349#2014848

The format problem might be the reason why this is the only datatype where the number of statements seems to regress. BTW it seems that https://grafana.wikimedia.org doesn't track that.

Vvjjkkii renamed this task from Export original TeX string of math values in RDF to e5baaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
matej_suchanek renamed this task from e5baaaaaaa to Export original TeX string of math values in RDF.Jul 2 2018, 8:11 AM
matej_suchanek raised the priority of this task from High to Needs Triage.
matej_suchanek updated the task description. (Show Details)

Maybe a simple way to implement this could be a psn: triple with the string.

I think we should not make more complicated than neccary and deceide either to use the MathML standard or not. If so one could also add another encoding, e.g., <annotation encoding=\"application/x-texvc\">a</annotation>. Moreover, the additional displaystyle is an artifact produced by the math extension. There is an ongoing effort to make the texvc input more conform to standard LaTeX. By the way with MathML you can query for identifiers like s with contains <mi>s</mi>, which is would also return symbols like \sin in the input form...

outputting the input string "IS" simple

Esc3300 renamed this task from Export original TeX string of math values in RDF to Export original TeX string of math-datatype values in RDF/on Wikidata Query Server.Jun 26 2019, 12:29 PM

I think we should not make more complicated than neccary and deceide either to use the MathML standard or not. If so one could also add another encoding, e.g., <annotation encoding=\"application/x-texvc\">a</annotation>.

Adding more encodings inside the MathML doesn’t help at all with the problem stated in the task description, which is that it’s inefficient to search for statements with a certain value.

OK, you mean "exact string matching for the input LaTeX source code" (i.e., a + b is different from a+b). That's correct. However, would it not be better to use a more generic approach for that? For example, how would I search for a certain javascript input value or (for example a picture) or whatever.

For the sake of clarity, maybe it should be mentioned that the presence of a psn: triple wont impact users of wdt: or ps: triples.

It is easy to perform some sanitization or conversion on the original string if you need it.

It is impossible to get back the original, human-readable LaTeX string from some sanitized version.

And even more, the current sanitization introduces errors, so the user is forced to input a wrong LaTeX string in order to get the desired rendering on MediaWiki and every other application that tries to render this string with LaTeX or MathJax will fail.

OK, you mean "exact string matching for the input LaTeX source code" (i.e., a + b is different from a+b). That's correct. However, would it not be better to use a more generic approach for that? For example, how would I search for a certain javascript input value or (for example a picture) or whatever.

I don’t know what you mean by “a certain javascript input value”, but to search for a certain picture you would look for its Special:FilePath URL (example). Of course, this isn’t exactly the same value as in the non-RDF representation, but it’s a fairly simple transformation. I’m not aware of any other data type where this transformation is anywhere near as complex as the TeX→MathML processing that’s currently done for math values.

For the sake of clarity, maybe it should be mentioned that the presence of a psn: triple wont impact users of wdt: or ps: triples.

Neither would the presence of psv: triples as proposed in the task description. I don’t think psn: is a very good fit for this – MathML is arguably more “normalized” than TeX(vc).

It is easy to perform some sanitization or conversion on the original string if you need it.

It is impossible to get back the original, human-readable LaTeX string from some sanitized version.

And even more, the current sanitization introduces errors, so the user is forced to input a wrong LaTeX string in order to get the desired rendering on MediaWiki and every other application that tries to render this string with LaTeX or MathJax will fail.

I don’t see how that’s related to this task.

Somehow I think the task as it's currently worded above is too complicated to solve the initial problem. The idea is that an input of

c^2=a^2+b^2

leads to a triple that contains only:

c^2=a^2+b^2

as would it be if the datatype was string.

This doesn't need to lead to a change to any other triples generated by the datatype. Using psn: seems like a standard way of doing so. Other suggestions are welcome. Maybe jura: can do ;)

Shall I create a separate ticket for that or update the task description?

This doesn't need to lead to a change to any other triples generated by the datatype.

The current proposal in the task description doesn’t lead to any changes in other triples either. It also adds a triple containing only the original input, but as part of a full value node rather than in psn:. (And since it’s a full value node, it includes the MathML again, that’s why the example perhaps looks a bit bloated.)

It is not a problem to export the userInputTex unless people start to use it. At that point our efforts to change the input format towards a more standard conform LaTeX input (for instance by deprecating \and or \or, which conflict with regular LaTeX command and sometime lead to strange behavoir in a standard LaTeX setup) will conflict with the interest of potential users of that new format. Currently, in there is no other place where the userInputTex is displayed or used.

This doesn't need to lead to a change to any other triples generated by the datatype.

The current proposal in the task description doesn’t lead to any changes in other triples either. It also adds a triple containing only the original input, but as part of a full value node rather than in psn:. (And since it’s a full value node, it includes the MathML again, that’s why the example perhaps looks a bit bloated.)

Do you prefer that I update it or shall I make a separate ticket?

[..]

[..]

It is impossible to get back the original, human-readable LaTeX string from some sanitized version.

[..]

I don’t see how that’s related to this task.

Given WMF's commitment to open source, WMF probably would want to give users access to the source string. Adding an additional triple would be a way to do this.

Lydia_Pintscher renamed this task from Export original TeX string of math-datatype values in RDF/on Wikidata Query Server to Make it possible to query for math values.Aug 29 2019, 12:27 PM

By the way, I checked if it’s possible to use WikibaseCirrusSearch as a workaround (haswbstatement:P2534=…), but it looks like we don’t index mathematical values.

Hey @Lucas_Werkmeister_WMDE, I tried to reorganize the task desc a bit with our usual template. Would you have a look, add what's missing or clean up things? Then we can move it to ready to estimate :) Thanks!

Gehel triaged this task as Medium priority.Sep 15 2020, 8:02 AM