Page MenuHomePhabricator

Adjust texvcjs to prevent whitespace modifications in ce-tags
Closed, ResolvedPublic

Description

texvcjs adds whitepsaces and brakets that do not change the way regular latex code is rendered but are interpreted by the mhchem package

Event Timeline

The description is quite correct. mhchem input syntax is not LaTeX syntax.

  • $\ce{^227_90Th+}$
  • $\ce{X_$i$^$x$}$

Whitespace has a syntactic meaning.

  • $\ce{H-C}$ differs from $\ce{H - O} differs from $\ce{H-}$ (- becoming a bond, a minus operator or negative charge)

Even braces have a synactic meaning.

Manual and test drive at https://mhchem.github.io/MathJax-mhchem/ (This is the new version 3.0.0, uploaded today and not available over the MathJax CDN yet.)

So, my suggestion for texvcjs would be to not alter the argument of the \ce command in any way.

There is little texvcjs could test without re-implementing the mhchem parser. Having balanced pairs of {...} and $...$ are the only things that come to mind.

Regards,

The author of mhchem

@mhchem bypassing the texvc security check is not an option. There are mainly two options to adress the the problem.

  1. Altering the way how texvcjs processes the ce input step by step. A first change could be 1.1 "Do not normalize WS", a second step 1.2 "Do not add or remove curlies?".
  2. Specify a new set of grammar rules for the syntax permitted inside the ce argument. This new grammar rules would need to get a security check. Maybe @csteipp can elaborate on the options?

@Physikerwelt Seems I have to learn quite a lot here. I did not even know that texvc's purpose is security. Can you point me to documentation? What threats are you protecting against by removing whitespace as input for a JS package?

  1. sounds good as a first step. This could solve most of the issues, I guess.
  1. I am happy to help. First I need to understand what we are targeting against. Then I would need some helping pointers to the grammar rules.

@mhchem thank you very much. Do you already have a set of test cases? This would simplify development. The next step would be to implement the grammar changes. Since this is a major change, and we are doing a major update to the grammar, we could consider to add additional semantic in that phase two. For instance we could add links to the wikidata concepts assoicated with the symbols.
Implementing that in texvcjs is as simple as might ever get. All that has to be done is to replace for instance He with \href{He}{https://www.wikidata.org/wiki/Q560}
I know that incorporating this change will increase the time until this bug is resolved, but we would have a really cool new feature, which micht be an additional motivation for users to apply the new syntax.

Hi @Physikerwelt! Please be aware that the Grammar I provided is for checking validity of the input, only. I assumed that you are interested in "active" LaTeX tokens (like ^ _ & $ { } \macros etc.), so I focussed on these. But I simplified much by merging letters and digits. So the grammar isn't really a semantic grammar. It doesn't give you the typographic parsing of \ce{CO2 + C -> 2 CO}. In fact, for this particular example, having no active token, the grammar would just find spaces, letters and non-letters and say that it is valid (= whitelisted) input.

To focus on active characters was just my assumption, not knowing what security threats we are trying to tackle here. With more input from your side, I might have to rework the validity grammar a bit.

But this is not the only simplification I made. I also invented something I called a "strict and simplified syntax". This is nothing that existed before. Its purpose was to dramatically simplify our validity check.
(This "strict and simplified syntax" might become the common feature set between LaTeX/mhchem and MathJax/mhchem. Currently, MathJax/mhchem 3.0.0 is very forgiving. So much that I probably never will be able to get LaTeX/mhchem on par.)

Re-implementing a complete syntactical parsing would be huge effort I wouldn't recommend to do. You might want to take a look at https://github.com/mhchem/MathJax-mhchem/blob/master/mhchem/unpacked/mhchem.js. The first 1300 lines are the parser (lines 100 to 200 = parser engine, then 1000 lines full of grammar hidden in regexps, state changes, variable manipulations and a few JavaScript conditions).

Your auto-link idea sounds good at first. But then, how should we know that Sb2O3 has a Wikipedia article, but Hg^2+ has not. Even more, mhchem does a typographical parsing. It doesn't know whether M^{..}_i is equivalent to M^{2.}_i or not. So we wouldn't even know the link target.

For test cases, I would start with the examples at the documentation (https://mhchem.github.io/MathJax-mhchem/). I can provide more, if need be.

To come up with a completely different idea. MathJax/mhchem takes string input and transforms it into a LaTeX string that is then rendered by MathJax. This is almost 'pure' LaTeX with just 9 macros added by mhchem (mostly arrows). Would it need your security needs if you check this LaTeX equivalent instead of the original string?

mh... this issue seems to be stalled; I think we should split this up into subissues that can be fixed with ony a few commits.

I think it would be good to teach texvjs not to modify the ws at all. Also in the alt tags the space modifications and additional brakets do not really improve upon the original input. For example is E=mc^{2} easier to read than just E=mc^2 or is {\frac {1}{2}} really better than just keeping \frac{1}{2}?
What do the others think?

Change 331906 had a related patch set uploaded (by Physikerwelt):
Maintain whitespaces in chem tags

https://gerrit.wikimedia.org/r/331906

Thanks for the effort. Please be aware that this is not just about whitespace. While whitespace might be the most important issue here, there are a lot of other cases where texvcjs makes it impossible to enter mhchem syntax, e.g. $\ce{^227_90Th+}$.

I see three options at the moment

  1. develop a full grammar that handles every case --> a lot of effort unliely to happen
  2. extend the grammar of texvc so that the inside of \ce is treated just like \mbox or other literals --> some work but doable
  3. just process ce tags as they are --> the proposed patch (almost no effort c.f https://gerrit.wikimedia.org/r/#/c/331906/2/routes/mathoid.js)

@mobrovac what is your suggestion?

To have all options on the table (but I don't know if I would like it):

  1. duplicate the mhchem parser into the texvcjs code base

(The core of MathJax/mhchem is a parser that takes mhchem/ce syntax and converts it into LaTeX syntax. So instead of having "input -> texvcjs -> mhchem.js -> MathJax", one could have "input -> mhchem.js -> texvcjs -> MathJax)

While option (1) would be the correct way to go, I understand that the effort that needs to be put into it is non-trivial. I would propose to go with option (2). For option (3), as I noted on the patch, formula duplication can become problematic, but if we settle on any other option, we can live with it in production for a short while as a work-around. Finally, option (4) may look interesting now, but code copy/pasta will surely bite us in the long run. Thoughts? @Physikerwelt , could you estimate the effort needed for option (2) ?

Sorry if this is a silly question but why is texvc still necessary/desirable when using mathoid?

@mobrovac (2) might take much longer than I expect. I'll do some preliminary experiments and come up with an estimate thereafter. Currently, it's hard for me to differentiate between https://github.com/DRMF/texvcjs (which has been altered by the DRMF students quite a bit) and the upstream version.

@Pkra the question is not silly at all, but somehow offtopic.

Change 331906 abandoned by Physikerwelt:
Maintain whitespaces in chem tags

Reason:
We are developing a grammar for the ce contents. https://github.com/manfredschaefer/texvcjs

https://gerrit.wikimedia.org/r/331906

If you tell me what you're up to, I could give some feedback. You can contact me via e-mail.

@mhchem thank you. @manfredschaefer is developing a grammar for accepting the mhchem commands according to your template see https://github.com/manfredschaefer/texvcjs While this requires some effort I think it's the best solution to the problem.

@Physikerwelt @manfredschaefer Could you please contact me via e-mail? I think I could contribute with some valuable hints.

@mhchem I have some questions regarding the grammar template:
The list for the possible single macros is not complete. Is there some documentation with all literals that should be accepted?

For color the existing color checks should be reused. Which color specification are supported by mhchem? texvcjs accepts named colors(e.g. \color{red} or \color[named]{red}) as well as gray, rgb, RGB, and CMYK colors (e.g. \color[RGB]{255,0,0}). The color specification starting with a backslash (\color{\red}) is not accepted by texvcjs.

The rest of the grammar is implemented in https://github.com/manfredschaefer/texvcjs.

@manfredschaefer
You mean the line <single macro> ::= '\alpha' | '\delta' | '\mu' | '\eta' | '\gamma' ! ToDo - Complete the list? Just insert all lower and uppercase greek letters.

For \color, the color code is just passed through. MathJax has two modes, so I included both in my syntax: \color{red}{H2O}and \color{\red}{H2O}. The first argument is not altered in any way by \ce. MathJax does not know the [named]and [RGB] syntax. I wonder how you handle that?

MathJax does not know the [named]and [RGB] syntax. I wonder how you handle that?

Actually, it does, but for historic reasons in a separate extension (until v3), cf. http://docs.mathjax.org/en/latest/tex.html#color.

And mathoid loads that particular extension.

To be honest, I am not sure yet, what the best solution would be. As \color was an undocumented feature anyway, I would leave it out of texvcjx, for now.

As \color was an undocumented feature anyway, I would leave it out of texvcjx, for now.

texvc already supports \color, cf. https://meta.wikimedia.org/wiki/Help:Displaying_a_formula#Color

@Pkra Sorry for the confusion. One can use \color, of course. But the option to use it unescaped inside \ce, like in \ce{\color{red}{H2}O}, that's an undocumented feature of mhchem for MathJax. It is undocumented, rarely used, not well tested, possibly inconsistent between different implementations (MathJax, LaTeX, color.js etc.) I'd just omit it for the time being. One can always escape to math mode \ce{$\color{red}{\ce{H2}}$O}.

that's an undocumented feature of mhchem for MathJax.

Thanks for the clarification. I didn't know that.

Not sure if its quite related to this bug but

<math chem>\ce{\log_{10}(x)}</math>

creates a render error. This is a minimal example I've found which creates an error. The original expression in the IC50 article was

<ce>pIC_{50} = -\log_{10} (IC_{50}) </ce>

introduced in https://en.wikipedia.org/w/index.php?title=IC50&diff=next&oldid=718812540. It looks like it originally parsed OK but some some software change broke it later.

oh no. This seems to be a regression. We updated the MathJax version last week and also enabled the new grammar based check for \ce tags T159735.
The input <ce>pIC_{50} = -\log_{10} (IC_{50}) </ce> should throw a warning, since it's not correct after the new grammar of @mhchem and @manfredschaefer since log is a math command, but not an error.
So from the grammar the input should be accepted at least it looks very similar to the test cases.
I'll look into it a bit deeper...

This is an interesting usage of mhchem's \ce. It is semantically wrong, but the result might be okay.

  1. -\log inside \ce would be read as "begins with a bond", instead of "minus". On the other hand, the minus glyph is used for a bond, so the output should be acceptable. Anyway, I would strongly recommend to not include mathematics inside \ce, even if they might look right. (I'd also change the test cases if it was for me to decide.)
  2. As for IC_{50}, the result will be correct, but semantically, it is not a chemical formula, i.e. it does not mean a molecule consisting of Iodine and Carbon. But I am fine if you use \ce for it.
  3. pIC_{50}, however, is parsed as "stoichiometric number p, followed by IC50". Depending on the version of mhchem, the rendering might differ a bit from the desired output. I just checked and found that I could add a special rule for pIC50 (and friends) in one of the next versions of mhchem, in order to cover this unexpected edge case.

Do these comments help?

I guess this is not going to be a big problem. It likely the only example on Wikipedia which was quickly detected by inclusion in https://en.wikipedia.org/wiki/Category:Articles_with_math_render_errors

It might be better to tighten the documentation to explain what is allowed.

Thanks for the new parser. It works fine on Wikipedia.

When we will switch away from the legacy version of mhchem to a current one (see T184355), we will be very good.

There is just one major gap with your parser, that I could find. This is inline math. The following examples fail because of the wrong $...$ treatment. (I am pretty sure that is is mentioned in the grammar.)

  • <chem>$n$ H2O</chem>
  • <chem> A ->[$x$][$x_i$] B</chem>
  • <chem> NaOH(aq,$\infty$)</chem>
  • <chem> Fe(CN)_{$\frac{6}{2}$}</chem>
  • <chem> X_{$i$}^{$x$}</chem>
  • <chem> X_$i$^$x$</chem>

Question: \ce seems to be limited to <chem>. It looks to me, we are missing the use case of mixing math and chemisty, like in <math>T_{\ce{H2O}}</math>. How would you go about this?

This gap is known. The problem is that the $ sign should always be escape (I think that’s for security reasons). To use inline math one need to surround the part with \begin{math}...\end{math}.

I think the limitation to <chem> might be due to the fact that the mhchem option is only passed to the parser for that tag.

How can we resolve these gaps?

For the $ issue, what about this? You do not escape $ to \$ (within <chem>), but you escape it to some special Unicode character, for instance U+0091 (PU1 Private Use One). I would adapt mhchem accordingly. (I assume you check the content within $ with the normal security rules.)

For the other issue, what could we do about this? <math>T_{\ce{H2O}}</math> will probably be something, users will ask for.

<math chem>T_{\ce{H2O}}</math> works. (Now, we just have to solve the $ issue and apply an update (legacy=false)).

I'm happy to review any pull requests... I have the complete list of all formulae used in the WMF wikis now. Thus, changing the grammar is less risky than it used to be.
https://github.com/wikimedia/texvcjs/pulls

The original intent of the bug is resolved. However, it seems that there are still some problems. It would be awesome if we could generate individual test cases for the individual problems and file specific bugs. Those can be merged within a time frame from 1-2 days. And go live with the next regular deploy. Trying to solve everything at once would take months and maybe be less effective.

To file a specific bug, please visit
https://wikimedia.org/api/rest_v1/#!/Math/post_media_math_check_type
write down the request e.g.

curl -X POST --header 'Content-Type: application/x-www-form-urlencoded' --header 'Accept: application/json' -d 'q=A%20%2B%20B%20-%3E%5B%5Cmathrm%7Ba%7D%5D%20C%20%2B%20D' 'https://wikimedia.org/api/rest_v1/media/math/check/chem'

the response

{
  "success": true,
  "checked": "A+B->[\\mathrm {a} ]C+D",
  "requiredPackages": [],
  "identifiers": [
    "A",
    "B",
    "\\mathrm{a}",
    "C",
    "D"
  ],
  "endsWithDot": false
}

and the expected response.

@Physikerwelt The documentation at https://en.wikipedia.org/wiki/Help:Displaying_a_formula#Chemistry states "Please note that there are still major issues with mhchem support in MediaWiki." and links to this task. Should that be updated now this task is resolved?

Let me observe that despite this ticket being "resolved", most of the examples in the first post still do not render properly.

@mhchem That the ticket is resolved does not mean that every visitor of Wikipedia gets the perfect experience. It means that the input that is passed to mathjax is not badly modified before rendering. Otherwise please provide test cases as described above:

To file a specific bug, please visit
https://wikimedia.org/api/rest_v1/#!/Math/post_media_math_check_type
write down the request e.g.

curl -X POST --header 'Content-Type: application/x-www-form-urlencoded' --header 'Accept: application/json' -d 'q=A%20%2B%20B%20-%3E%5B%5Cmathrm%7Ba%7D%5D%20C%20%2B%20D' 'https://wikimedia.org/api/rest_v1/media/math/check/chem'

the response

{
  "success": true,
  "checked": "A+B->[\\mathrm {a} ]C+D",
  "requiredPackages": [],
  "identifiers": [
    "A",
    "B",
    "\\mathrm{a}",
    "C",
    "D"
  ],
  "endsWithDot": false
}

and the expected response.

I would very much appreciate that. I really want to resolve the problems but I need concrete examples to make (incremental) changes. I also developed a simple testing frontend https://mathoid.beta.math.wmflabs.org/ that visualizes the effects of the texvc modifications. However, I had a hard time finding examples using this frontend.

@mhchem note. Now, as we have found the Community Group Math, we are in a position to perform more severe modifications to the rendering pipeline.

So just taking the first mention in this task.

$\ce{H-C}$ differs from $\ce{H - O}$. We can take as reference the output from the mhchem github page. https://mhchem.github.io/MathJax-mhchem/

image.png (370×652 px, 16 KB)

there is a clear difference in the the rendered output of the two formula.

If we do this in wikipedia

curl -X POST "https://wikimedia.org/api/rest_v1/media/math/check/chem" -H  "accept: application/json" -H  "Content-Type: multipart/form-data" -F "q=\ce{H-C}\ce{H - C}"
{
"success": true,
"checked": "{\\ce {H-C}}{\\ce {H - C}}",
"requiredPackages": [
  "mhchem"
],
"identifiers": [],
"endsWithDot": false
}

With hash
x-resource-location: cd4760b5999de62defd7c97cb841e905e896b6db

rendering this with

curl -X GET "https://wikimedia.org/api/rest_v1/media/math/render/png/cd4760b5999de62defd7c97cb841e905e896b6db" -H  "accept: image/svg+xml"

image.png (42×146 px, 1 KB)

or as png

image.png (28×125 px, 1 KB)

Both formula render the same.

@SalixAlba This example clearly shows that texvcjs does NOT modify the spacing. The rendering is the same with or without texvc. Also the test page confirms, that this is the same behavior as with the latest development version.

Screenshot_2020-05-06 mathoid test-page.png (657×506 px, 14 KB)

The screenshot difference that your browsers renders if one regards the test-page with the Common HTML layout (and not in the MathML/SVG Layout) however clearly shows more space in the second example. If I visit the page with MathML rendering I see the following output

Screenshot_2020-05-06 Manual for mhchem for MathJax, mhchem for KaTeX.png (190×260 px, 5 KB)

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mstyle mathcolor="#a33e00">
    <mrow class="MJX-TeXAtom-ORD">
      <mrow class="MJX-TeXAtom-ORD">
        <mi mathvariant="normal">H</mi>
      </mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mo>&#x2212;<!-- − --></mo>
      </mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mi mathvariant="normal">C</mi>
      </mrow>
    </mrow>
    <mrow class="MJX-TeXAtom-ORD">
      <mrow class="MJX-TeXAtom-ORD">
        <mi mathvariant="normal">H</mi>
      </mrow>
      <mrow class="MJX-TeXAtom-ORD">

      </mrow>
      <mo>&#x2212;<!-- − --></mo>
      <mrow class="MJX-TeXAtom-ORD">

      </mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mi mathvariant="normal">C</mi>
      </mrow>
    </mrow>
  </mstyle>
</math>

Which shows that Firefox 75 does not add space for the empty mrow elements around the second <mo>&#x2212;<!-- − --></mo>.

If I choose the mml output from the rest server

https://wikimedia.org/api/rest_v1/media/math/render/mml/cd4760b5999de62defd7c97cb841e905e896b6db

The all spacing info is removes

<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="{\ce {H-C}}{\ce {H - C}}">
  <semantics>
    <mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mtext>H</mtext>
        <mrow class="MJX-TeXAtom-ORD">
          <mo>
            &#x2212;
            <!-- − -->
          </mo>
        </mrow>
        <mtext>C</mtext>
      </mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mtext>H</mtext>
        <mrow class="MJX-TeXAtom-ORD">
          <mo>
            &#x2212;
            <!-- − -->
          </mo>
        </mrow>
        <mtext>C</mtext>
      </mrow>
    </mrow>
    <annotation encoding="application/x-tex">{\ce {H-C}}{\ce {H - C}}</annotation>
  </semantics>
</math>

(I also not the elements are <mtext>H</mtext> rather than <mi mathvariant="normal">H</mi> but not important).

If I examine the source from the git hub page in Firefox/Chrome developer tools I get a more complex html, (just showing the second H - O for brevity)

	 <span class="texatom" id="MathJax-Span-5481" style="color: rgb(163, 62, 0);">
	  <span class="mrow" id="MathJax-Span-5482" style="color: rgb(163, 62, 0);">
	   <span class="texatom" id="MathJax-Span-5483" style="color: rgb(163, 62, 0);">
	    <span class="mrow" id="MathJax-Span-5484" style="color: rgb(163, 62, 0);">
		 <span class="mi" id="MathJax-Span-5485" style="font-family: MathJax_Main; color: rgb(163, 62, 0);">H</span>
		</span>
	   </span>
	   <span class="texatom" id="MathJax-Span-5486" style="color: rgb(163, 62, 0);">
	    <span class="mrow" id="MathJax-Span-5487" style="color: rgb(163, 62, 0);">
		</span>
	   </span>
	  <span class="mo" id="MathJax-Span-5488" style="font-family: MathJax_Main; padding-left: 0.222em; color: rgb(163, 62, 0);">−</span>
	  <span class="texatom" id="MathJax-Span-5489" style="padding-left: 0.222em; color: rgb(163, 62, 0);">
	   <span class="mrow" id="MathJax-Span-5490" style="color: rgb(163, 62, 0);"></span>
	  </span>
	  <span class="texatom" id="MathJax-Span-5491" style="color: rgb(163, 62, 0);">
	   <span class="mrow" id="MathJax-Span-5492" style="color: rgb(163, 62, 0);">
	    <span class="mi" id="MathJax-Span-5493" style="font-family: MathJax_Main; color: rgb(163, 62, 0);">O</span>
	   </span>
	  </span>
	 </span>

which clearly has some padding info.

This seems to point to a difference between the client side and server side MathJax rendering.

@SalixAlba Thank you for your analysis. This increases my motivation to work on T237516 in the upcoming hackathon weekend.

A have to admit that the wrong rendering of \ce{H - C} (as if it were \ce{H-C}) is not the fault of texvcjs removing spaces, any more. It is because the mhchem "legacy" version is still being used. (So, this would indeed be another ticket. Looking forward to switching to MathJax3 T237516).
But formulas with $ are still being modified by texvcjs, as mentioned in the first comment.