Page MenuHomePhabricator

Enable Chem support in TexVC(PHP) for MathML generation
Open, Needs TriagePublic

Description

There are references defined in this test.
Also, a new test was added (see Patchsets) which has the most LaTeX formula from mhchem-specification.

Fix the output TexVCMMLGen, so that the chem-tex-inputs generate correct MathML output.

The testcases are very simple, it might make sense to have more sophisticated testcases.

References for Implementation:

Event Timeline

Change 894565 had a related patch set uploaded (by Stegmujo; author: Stegmujo):

[mediawiki/extensions/Math@master] Add new chem test and cases

https://gerrit.wikimedia.org/r/894565

Change 895160 had a related patch set uploaded (by Stegmujo; author: Stegmujo):

[mediawiki/extensions/Math@master] Render basic chemical equations and formulae

https://gerrit.wikimedia.org/r/895160

Stegmujo renamed this task from Fix Chem support for TexUtilTest to Enable Chem support.Mar 10 2023, 8:52 AM
Stegmujo updated the task description. (Show Details)
Stegmujo renamed this task from Enable Chem support to Enable Chem support in TexVC(PHP) for MathML generation.Mar 10 2023, 8:57 AM

So, what exactly do you need? An mhchem to MathML parser, right?
I wrote a TypeScript parser, so I would say I have a bit of experience. https://github.com/mhchem/mhchemParser

The challenge is the possibility of nesting of mhchem and LaTeX math. E.g. mhchem in math $V_{\ce{H2O}}$ or math in mhchem $\ce{NaOH(aq,$\infty$)}$ (or math in mhchem in math in ...).
I see a few options.
(1) Firstly, the main parser calls the mhchem parser which returns a special data structure the main parser understands. (1a) This could include an "and please parse this part again" part – or (1b) the mhchem parser can call the math parser recursively.
(2) Secondly, the mhchem parser could return math LaTeX syntax and the main parser would reparse the return of the mhchem parser (and potentially call mhchem a second or third time).
(3) Thirdly, the mhchem parser could be a "pre-compiler" that is run before the main parser, and converts all chem syntax (even the already visible nested ones) to math LaTeX syntax. This would not work with macros that expand to chemistry syntax.

@mhchem Thank you very much for your help.

So, what exactly do you need? An mhchem to MathML parser, right?

exactly. We need a mhchem to MathML parser written in PHP.

I wrote a TypeScript parser, so I would say I have a bit of experience. https://github.com/mhchem/mhchemParser

This looks very promising.

We have a parser (and no functional renderer yet)

https://github.com/wikimedia/mediawiki-extensions-Math/blob/f5d56be83b6ae0a77e3882700ffa10bc9f642782/src/TexVC/parser.pegjs#L247-L297

However, this was translated from the old renderer and might have reproduced the deprecated syntax. I remember we had discussions that some renderings in Wikipedia were wrong. So we need to ensure we do not repeat that mistake this time.

Does your parser output MathML or just plain LaTeX?

I think at the first step we need a comprehensive test set so that we can verify that our final MathML is equivalent to the output produced by MathJax or KaTeX. Do you all (@mhchem, @NSoiffer, @Stegmujo) agree?

The challenge is the possibility of nesting of mhchem and LaTeX math. E.g. mhchem in math $V_{\ce{H2O}}$ or math in mhchem $\ce{NaOH(aq,$\infty$)}$ (or math in mhchem in math in ...).
I see a few options.
(1) Firstly, the main parser calls the mhchem parser which returns a special data structure the main parser understands. (1a) This could include an "and please parse this part again" part – or (1b) the mhchem parser can call the math parser recursively.

We can not do any external calls. We need to reimplement everything in PHP. Unfortunately, we experienced that the maintenance overhead in production is much higher than the initial implementation effort.

(2) Secondly, the mhchem parser could return math LaTeX syntax and the main parser would reparse the return of the mhchem parser (and potentially call mhchem a second or third time).
(3) Thirdly, the mhchem parser could be a "pre-compiler" that is run before the main parser, and converts all chem syntax (even the already visible nested ones) to math LaTeX syntax. This would not work with macros that expand to chemistry syntax.

This will be the easiest option if we don't have accessibility for chemistry in MathML. I don't think we have such macros. In the Wikipedia context, one can not define macros; if a few built-in macros exist, they could be built into the pre-compiler.

I suggest these test cases. https://github.com/mhchem/mhchemParser/blob/master/test/test.html

Re (1), I did not mean external calls, but calls to a PHP class. This class would essentially be part of the main parser, because it would follow the parser's API.

Re (3) So, this basically would be an PHP implementation of https://github.com/mhchem/mhchemParser. This shouldn't be too much work. I could do it during my next vacation.
mhchemParser is a small parsing engine that operates on a complex data structure that defines its behaviour. This will make the conversion very straight-forward. In contrast to your current implementation, the grammar definition is not part of the control path, but the data structure.
Please have a look at https://github.com/mhchem/mhchemParser/blob/master/test/test.html, to see if this really is what you need. A drawback of this approach (mhchem → LaTeX → MathML conversion) is the fact that it does not produce minimum MathML. The main reason is the enforced vertical alignment of all subscripts and superscripts, which is needed for LaTeX, for typographical reasons. https://github.com/mhchem/MathJax-mhchem/issues/23 contains a lengthy discussion.

Hello @mhchem, thanks for your help.

I had a brief look at the supposed outputs by the typescript parser for in the mentioned test. These supposed outputs I was parsing with TexVC(PHP).

The statements which seem necessary to implement in TexVC(PHP) would be:

  • "mathchoice"
  • "smash" : here for in example for frac, "<mpadded height="0" depth="0">" can be used as sorrounding element
  • "mskip"
  • "mkern"
  • and some arrows like "\longrightleftharpoons"

To support processing the TeX output by the mhchemParser.

Tentatively speaking, these elements would be possible to implement for me to the parsing grammar and parsing functions in PHP.

drawback of this approach (mhchem → LaTeX → MathML conversion) is the fact that it does not produce minimum MathML

In the LaTeX produced by mhchemParser, would there be any information missing to produce valid MathML ?

About the nested chem formulas:

Just the basics of how TexVC(PHP) does the parsing:

  1. A TeX Math is read by a grammar and validation
  2. A parse tree is created, in PHP this is a nested object structure to distinquish the nested elements of a math expressions
  3. To generate MathL, parsetree is traversed from the root and for each element a parser function is called which is creating
{\\displaystyle K_{c}={\\frac {\\ce {[{CH_{3}CO_{2}}^{-}][{H_{3}O}^{+}]}}{\\ce {[{CH_{3}CO_{2}H}][{H_{2}O}]}}}}

This would be an example of a formula occurring on Wikipedia for a nested expression with regular TeX and a chem statement inside.
The chem statements are recognized by the current grammar of TexVC(PHP) and corresponding parse tree elements are created for each statement.
I think it is possible to adapt these tree elements, so they can store the raw TeX expression.

When rendering MathML from the obtained parsetree, a mhchem-php component, could then be used for rendering the tex inside each nested chem element to
regular TeX. This can then interpret by a recursive call of the TexVC(PHP) parser. With the 'translated' regular Tex, valid MathML can then be generated.

Example:

K_{c}={\\frac {\ ... } }

This can be parsed by TexVC already since it is not chem related

\\ce {[{CH_{3}CO_{2}}^{-}][{H_{3}O}^{+}]}}

This element would translate to a "Fun1" element in the parse tree.
By checking its properties, it can be recognized programmatically in TexVC-PHP that it is a chem-element.
When this is recognized, mhChemParser in PHP would be called, which then generates 'translated' regular TeX.
Recursive call of the TexVC Parsing procedure for the element, MathML can be generated.

For the other ce element the processing would be the same.

As conclusion for this approach and by having a more in depth look at the TexVC grammar, this would probably require too much effort, since there are "container" elements required for all chem-enviornments in a formular. This does not correspond to the current implementation grammar in checking results and creating granular chem nodes by the tree. The container elements would compromise the texvc checking features.

Maybe we can add an example. How about

https://github.com/mhchem/mhchemParser/blob/58d5f6e1c65550bac87459c6b8b8a3215ef246c8/test/test.html#L90

Assume we had the following wikitext

<chem>CH4 + 2 $\\left( \\ce{O2 + 79/21 N2} \\right)$</chem>

currently, this would not be supported and one would need to write

<chem>CH4 + 2 \begin{math}\\left( \\ce{O2 + 79/21 N2} \\right)\end{math}</chem>

We currently have the following internal tree

[
  [
    "MHCHEM",
    [
      "\\ce"
    ],
    [
      "CURLY",
      [
        [
          "CHEM_WORD",
          [
            "LITERAL",
            [
              "C"
            ]
          ],
          [
            "CHEM_WORD",
            [
              "LITERAL",
              [
                "H"
              ]
            ],
            [
              "CHEM_WORD",
              [
                "LITERAL",
                [
                  "4"
                ]
              ],
              [
                "LITERAL",
                [
                  ""
                ]
              ]
            ]
          ]
        ],
        [
          "LITERAL",
          [
            " "
          ]
        ],
        [
          "CHEM_WORD",
          [
            "LITERAL",
            [
              "+"
            ]
          ],
          [
            "LITERAL",
            [
              ""
            ]
          ]
        ],
        [
          "LITERAL",
          [
            " "
          ]
        ],
        [
          "CHEM_WORD",
          [
            "LITERAL",
            [
              "2"
            ]
          ],
          [
            "LITERAL",
            [
              ""
            ]
          ]
        ],
        [
          "LITERAL",
          [
            " "
          ]
        ],
        [
          "CHEM_WORD",
          [
            "DOLLAR",
            [
              [
                "LR",
                [
                  "("
                ],
                [
                  ")"
                ],
                [
                  [
                    "MHCHEM",
                    [
                      "\\ce"
                    ],
                    [
                      "CURLY",
                      [
                        [
                          "CHEM_WORD",
                          [
                            "LITERAL",
                            [
                              "O"
                            ]
                          ],
                          [
                            "CHEM_WORD",
                            [
                              "LITERAL",
                              [
                                "2"
                              ]
                            ],
                            [
                              "LITERAL",
                              [
                                ""
                              ]
                            ]
                          ]
                        ],
                        [
                          "LITERAL",
                          [
                            " "
                          ]
                        ],
                        [
                          "CHEM_WORD",
                          [
                            "LITERAL",
                            [
                              "+"
                            ]
                          ],
                          [
                            "LITERAL",
                            [
                              ""
                            ]
                          ]
                        ],
                        [
                          "LITERAL",
                          [
                            " "
                          ]
                        ],
                        [
                          "CHEM_WORD",
                          [
                            "LITERAL",
                            [
                              "7"
                            ]
                          ],
                          [
                            "CHEM_WORD",
                            [
                              "LITERAL",
                              [
                                "9"
                              ]
                            ],
                            [
                              "CHEM_WORD",
                              [
                                "LITERAL",
                                [
                                  "/"
                                ]
                              ],
                              [
                                "CHEM_WORD",
                                [
                                  "LITERAL",
                                  [
                                    "2"
                                  ]
                                ],
                                [
                                  "CHEM_WORD",
                                  [
                                    "LITERAL",
                                    [
                                      "1"
                                    ]
                                  ],
                                  [
                                    "LITERAL",
                                    [
                                      ""
                                    ]
                                  ]
                                ]
                              ]
                            ]
                          ]
                        ],
                        [
                          "LITERAL",
                          [
                            " "
                          ]
                        ],
                        [
                          "CHEM_WORD",
                          [
                            "LITERAL",
                            [
                              "N"
                            ]
                          ],
                          [
                            "CHEM_WORD",
                            [
                              "LITERAL",
                              [
                                "2"
                              ]
                            ],
                            [
                              "LITERAL",
                              [
                                ""
                              ]
                            ]
                          ]
                        ]
                      ]
                    ]
                  ]
                ]
              ]
            ]
          ],
          [
            "LITERAL",
            [
              ""
            ]
          ]
        ]
      ]
    ]
  ]
]

Can we change texvc in a way that all the CHEM_ tokens disappear and regular texvc tokes appear instead? I mean something like this

[
  [
    "CURLY",
    [
      [
        "FUN1nb",
        [
          "\\mathrm"
        ],
        [
          "CURLY",
          [
            [
              "LITERAL",
              [
                "C"
              ]
            ],
            [
              "LITERAL",
              [
                "H"
              ]
            ]
          ]
        ]
      ],
      [
        "DQ",
        [
          "CURLY",
          [
            [
              "FUN1",
              [
                "\\vphantom"
              ],
              [
                "CURLY",
                [
                  [
                    "LITERAL",
                    [
                      "A"
                    ]
                  ]
                ]
              ]
            ]
          ]
        ],
        [
          "CURLY",
          [
            [
              "CURLY",
              [
                [
                  "LITERAL",
                  [
                    "4"
                  ]
                ]
              ]
            ]
          ]
        ]
      ],
      [
        "CURLY",
        []
      ],
      [
        "LITERAL",
        [
          "+"
        ]
      ],
      [
        "CURLY",
        []
      ],
      [
        "LITERAL",
        [
          "2"
        ]
      ],
      [
        "LITERAL",
        [
          "\\,"
        ]
      ],
      [
        "LR",
        [
          "("
        ],
        [
          ")"
        ],
        [
          [
            "FUN1nb",
            [
              "\\mathrm"
            ],
            [
              "CURLY",
              [
                [
                  "LITERAL",
                  [
                    "O"
                  ]
                ]
              ]
            ]
          ],
          [
            "DQ",
            [
              "CURLY",
              [
                [
                  "FUN1",
                  [
                    "\\vphantom"
                  ],
                  [
                    "CURLY",
                    [
                      [
                        "LITERAL",
                        [
                          "A"
                        ]
                      ]
                    ]
                  ]
                ]
              ]
            ],
            [
              "CURLY",
              [
                [
                  "CURLY",
                  [
                    [
                      "LITERAL",
                      [
                        "2"
                      ]
                    ]
                  ]
                ]
              ]
            ]
          ],
          [
            "CURLY",
            []
          ],
          [
            "LITERAL",
            [
              "+"
            ]
          ],
          [
            "CURLY",
            []
          ],
          [
            "CURLY",
            [
              [
                "LITERAL",
                [
                  "\\textstyle "
                ]
              ],
              [
                "FUN2",
                [
                  "\\frac"
                ],
                [
                  "CURLY",
                  [
                    [
                      "LITERAL",
                      [
                        "7"
                      ]
                    ],
                    [
                      "LITERAL",
                      [
                        "9"
                      ]
                    ]
                  ]
                ],
                [
                  "CURLY",
                  [
                    [
                      "LITERAL",
                      [
                        "2"
                      ]
                    ],
                    [
                      "LITERAL",
                      [
                        "1"
                      ]
                    ]
                  ]
                ]
              ]
            ]
          ],
          [
            "LITERAL",
            [
              "\\,"
            ]
          ],
          [
            "FUN1nb",
            [
              "\\mathrm"
            ],
            [
              "CURLY",
              [
                [
                  "LITERAL",
                  [
                    "N"
                  ]
                ]
              ]
            ]
          ],
          [
            "DQ",
            [
              "CURLY",
              [
                [
                  "FUN1",
                  [
                    "\\vphantom"
                  ],
                  [
                    "CURLY",
                    [
                      [
                        "LITERAL",
                        [
                          "A"
                        ]
                      ]
                    ]
                  ]
                ]
              ]
            ],
            [
              "CURLY",
              [
                [
                  "CURLY",
                  [
                    [
                      "LITERAL",
                      [
                        "2"
                      ]
                    ]
                  ]
                ]
              ]
            ]
          ]
        ]
      ]
    ]
  ]
]

(generated via

./bin/texvcjs --usemhchem --debug --info --output=tree '{\mathrm{CH}{\vphantom{A}}_{{4}} {}+{} 2\,\left(  \mathrm{O}{\vphantom{A}}_{{2}} {}+{} {\textstyle\frac{79}{21}}\,\mathrm{N}{\vphantom{A}}_{{2}} \right) }'

)

I was starring at https://github.com/mhchem/mhchemParser/blob/master/src/mhchemParser.ts and it seems (to some degree) similar to what texvc does, however, it is standalone without dependency on pegjs.

Change 905651 had a related patch set uploaded (by Physikerwelt; author: Physikerwelt):

[mediawiki/services/texvcjs@master] Prototype: Short circuit mhchem

https://gerrit.wikimedia.org/r/905651

Hello @mhchem, I just wanted to inform you that I am currently working on a port of mhChemparser in typescript to PHP. I would be happy to get some help from you if you're interested. Let me know if you're available, and we can figure out the details on how we can collaborate on this.

I don't think it changes anyone's plans, but please be aware that MathML 4 is adding an "intent" attribute to aid in speech generation. The MathML draft spec has some text, but it is still changing, so nothing needs to be done now. There will likely be some defined intent values related to chemistry that should be used when ce gets translated to MathML. Most likely these will be some "intent properties" such as "chemical-element" and maybe "chemical-formula" and "chemical-equation".

As an example, \ce{H20} might generate:

<math>
  <mrow intent=':chemical-formula'>
    <mmultiscripts>
      <mi mathvariant='normal' intent=':chemical-element'>H</mi>
      <mn>2</mn>
      <none/>
    </mmultiscripts>
    <mi mathvariant='normal' intent=':chemical-element'>O</mi>
  </mrow>
</math>

Three things:

  1. I hand wrote this, so there are likely mistakes
  2. The names "chemical-formula" and "chemical-equation" came from discussions with the W3C chemistry community group but haven't really been discussed in the Math WG.
  3. I used mmultiscripts rather than msub because it will cause all sub/superscripts to align. Doing this allows much simpler output than what the ce macro for MathJax currently produces.

Using these intents allows speech generators to produce speech tailored to chemistry rather than trying to guess if something is chemistry. MathCAT has over 2,000 lines of code plus 1,000 - 2,000 lines of tests (mhchem's output, which is test input can be quite verbose) trying to make that guess. I doubt other systems will bother and if chemistry intents are used, they won't need much code to generate good speech. Potentially it makes braille generation easier also: there are special rules for chemistry in braille saying 'capitalize each letter individually', don't use word capitalization for things like "CO".

Hello @mhchem, I just wanted to inform you that I am currently working on a port of mhChemparser in typescript to PHP. I would be happy to get some help from you if you're interested. Let me know if you're available, and we can figure out the details on how we can collaborate on this.

@Stegmujo, It seems you have more time than I have. Feel free to start the work. Feel free to contract me at mhchem@MartinHensel.de.

  1. I used mmultiscripts rather than msub because it will cause all sub/superscripts to align. Doing this allows much simpler output than what the ce macro for MathJax currently produces.

@NSoiffer, If you have the TexVC command that produces mmultiscripts, the mhchem parser can use that, maybe behind a feature toggle. mhchem's output is so complex, because it has to use phantoms to work against some of TeX's alignment magic. Of course, all output of that new command must look nice in all rendering modes.

@mhchem: mmulticscripts is an area where MathML learned from TeX. I can't speak to what TexVC will do, but if there was some way to get the information that something is a chemical element (mhchem -> MathML output often puts things like "CO" into a single mi, so that needs to change) along with somehow tagging the whole \cd{...} as a chemical equation, that would at least tell AT that this mess is a chemical equation instead of forcing them to guess.

In the Math WG, we envision that some system will take advantage of encouraging users to use macros so the info can get transferred. So if TeXVC added some macros that you could target, then TeXVC could use them to generate MathML with intent names and properties... at least I think it could.

Change 923597 had a related patch set uploaded (by Stegmujo; author: Stegmujo):

[mediawiki/extensions/Math@master] Create prototype of MHChemParser in PHP

https://gerrit.wikimedia.org/r/923597

Change 894565 abandoned by Stegmujo:

[mediawiki/extensions/Math@master] Add new chem test and cases for MML generation

Reason:

These testcases are most probably obsolete and have been replaced by testcases within: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Math/+/923597

https://gerrit.wikimedia.org/r/894565

Change 923597 merged by jenkins-bot:

[mediawiki/extensions/Math@master] Implement mhchemParser in PHP

https://gerrit.wikimedia.org/r/923597

Change 965077 had a related patch set uploaded (by Stegmujo; author: Stegmujo):

[mediawiki/extensions/Math@master] Add rendering to MathmL visual results to MathMLTest

https://gerrit.wikimedia.org/r/965077

Change 965218 had a related patch set uploaded (by Stegmujo; author: Stegmujo):

[mediawiki/extensions/Math@master] Fix mkern and mskip in MMLmhchemTest

https://gerrit.wikimedia.org/r/965218

Change 965528 had a related patch set uploaded (by Stegmujo; author: Stegmujo):

[mediawiki/extensions/Math@master] Fix issues in MMLmhchemTest with braces

https://gerrit.wikimedia.org/r/965528

Change 965218 abandoned by Stegmujo:

[mediawiki/extensions/Math@master] Fix mkern and mskip in MMLmhchemTest

Reason:

not necessary

https://gerrit.wikimedia.org/r/965218

Change 965077 merged by jenkins-bot:

[mediawiki/extensions/Math@master] Add rendering to MathmL visual results to MathMLTest

https://gerrit.wikimedia.org/r/965077

Change 965528 merged by jenkins-bot:

[mediawiki/extensions/Math@master] Fix issues in MMLmhchemTest with braces and macro longrightleftharpoons

https://gerrit.wikimedia.org/r/965528

Change 965686 had a related patch set uploaded (by Stegmujo; author: Stegmujo):

[mediawiki/extensions/Math@master] Fix renderings in MMLmhchemTest

https://gerrit.wikimedia.org/r/965686

Change 967187 had a related patch set uploaded (by Stegmujo; author: Stegmujo):

[mediawiki/extensions/Math@master] Add a possibility for creating a test wikipage for chemical formulas

https://gerrit.wikimedia.org/r/967187

Change 967187 merged by jenkins-bot:

[mediawiki/extensions/Math@master] Add a possibility for creating a test wikipage for chemical formulas

https://gerrit.wikimedia.org/r/967187