Page MenuHomePhabricator

Deprecate texvc commands conflicting with LaTeX commands used in common packages
Open, MediumPublic

Description

  • remove them from Visual editor
  • add a warning if those commands are used

List of commands (and standard replacement):
$ \$
% \%
\and \land
\or \lor
\part \partial
\ang \angle
\C \Complex
\H \mathbb{H}
\bold \mathbf
\Bbb \mathbb

In addition, \pagecolor has no effect and should be removed.

Details

Related Gerrit Patches:
mediawiki/services/mathoid : masterImplement deprecation warnings
mediawiki/extensions/Math : masterremove problematic texvc from VE suggestions

Event Timeline

Physikerwelt triaged this task as Medium priority.Jun 21 2018, 5:12 AM
Physikerwelt created this task.
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJun 21 2018, 5:12 AM

Change 440737 had a related patch set uploaded (by Physikerwelt; owner: Debenben):
[mediawiki/extensions/Math@master] remove problematic texvc from VE suggestions

https://gerrit.wikimedia.org/r/440737

Change 440737 merged by jenkins-bot:
[mediawiki/extensions/Math@master] remove problematic texvc from VE suggestions

https://gerrit.wikimedia.org/r/440737

Physikerwelt updated the task description. (Show Details)Jun 21 2018, 8:27 AM

Thank you @mobrovac I think the next step is to add a warning to texvc if any of the commands above are used in the same way as we did it for invalid mhchem2 markup

Izno added a subscriber: Izno.Jun 22 2018, 5:22 PM
Vvjjkkii renamed this task from Deprecate texvc commands conflicting with LaTeX commands used in common packages to pjaaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: Aklapper, gerritbot.
WhitePhosphorus renamed this task from pjaaaaaaaa to Deprecate texvc commands conflicting with LaTeX commands used in common packages.Jul 1 2018, 2:26 AM
WhitePhosphorus lowered the priority of this task from High to Medium.
WhitePhosphorus updated the task description. (Show Details)
WhitePhosphorus added subscribers: Aklapper, gerritbot.

In https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Math/+/442124/ I added the functionality to add a new tracking category for pages that use <ce/> tags with the old syntax.

@Physikerwelt How exactly does that work? What are these "warning information from restbase"?

if the expression is not valid according to the grammar rules in texvcjs, it issues a warning. this might not resolve all problems but it will make it easier to understand the current rendering...

Debenben moved this task from Incoming to Doing on the Math board.Nov 11 2018, 7:17 PM

I think we should also deprecate exist as alias for exist.

@Physikerwelt what do you mean with "exist as alias for exist"?

While except for some protected pages, all deprecated syntax we could find is replaced. However we currently have no way to make sure that we found everything. I did some random sampling, looking for math introduced by templates and did not find anything else, so it looks good. I think we could do one more check and then go ahead and make the deprecated math produce errors.

Sorry I wanted to write exists.

I am currently working on a tracking category for all new pages that use the deprecated macros. I will let you know when it's done.

I checked this today. And it, unfortunately, does not work as expected.

`
curl -X POST "https://en.wikipedia.org/api/rest_v1/media/math/check/tex" -H  "accept: application/json" -H  "Content-Type: multipart/form-data" -F "q=%"

does not show any warnings. I need to look at this again:-(

Replacing \exist with \exists sounds reasonable, it should be no problem. My database dump search found 1302 pages in all projects with \exist. Without the texvc changes there might be other things creating problems which we are not aware of and could be fixed with a bot, so I would like to wait for the error categories such that we do not need to edit pages twice.

Oh no. I just realized that I did only implement the tracking categories for the mhchem-deprecation pages. But not the deprecated texvc syntax. I will start this today and see how far I can get.

So first I documented the deprecations in the texvc package documentation. https://github.com/ag-gipp/texvc-latex/commit/72835b8e4636ab0ad821edaf2b72dda627d1e5d0

Change 571100 had a related patch set uploaded (by Physikerwelt; owner: Physikerwelt):
[mediawiki/services/mathoid@master] Implement deprecation warnings

https://gerrit.wikimedia.org/r/571100

@Debenben the mathoid side looks good for now. For the math extension change, I would be happy to receive a suggestion for a tracking category name. I hope I get the mathoid change merged during the upcoming week so I can work on the math extension change next weekend.

@Physikerwelt Thank you very much for creating the pull request, it is definitely a very good fist step in the right direction. I guess nobody really cares about the name of the category, we could simply use "Pages that use a deprecated format of the math tags" in analogy to the chem tracking category. I am really happy with the pull request and surprised to see how much work this was but at the same time also a bit disappointed because I was hoping for more.

As far as I can see, we still use texvc to replace e.g. \omicron with \mathrm{o} although MathJax would render \omicron just fine. This means \mathrm{o} is not italic as it should be and cannot be changed with \mathit \mathbf etc. because unknown to the user \mathrm is already specified and e.g. we still remove whitespaces although MathJax would render whitespaces just fine.

I understand that this task is just the error category and it is already very helpful to have this error category for the deprecated commands above because then we can tell if a math formula is hidden in some template which would take too long to expand in the database-dump search. I was hoping though, it could also be triggered for things like my the exaggerated example

<math>a=\sqrt[\begin{align}[did you know that \exp=\frac{1}{2}? & be careful adding other stuff, it might break the layout]\\b^2\\&\\\end{align}+c^2]</math>

which for several reasons would fail without texvc modifications. I guess the problem with this might be that it cannot be easily integrated into mediawiki, since it would probably need two rendering calls, one with and one without texvc modifications. An alternative could be to just create such a rendering mode in mathoid server or I could try to create my own local mathoid server on toollabs that can render without texvc modifications and then push all formulas from my dumps and look at the return code. With this we cannot find 100% of those problems in real time, but we would still be able to find the majority and get an idea if we have to worry about those.

@Physikerwelt Thank you very much for creating the pull request, it is definitely a very good fist step in the right direction. I guess nobody really cares about the name of the category, we could simply use "Pages that use a deprecated format of the math tags" in analogy to the chem tracking category.

That's great.

I am really happy with the pull request and surprised to see how much work this was but at the same time also a bit disappointed because I was hoping for more.
As far as I can see, we still use texvc to replace e.g. \omicron with \mathrm{o} although MathJax would render \omicron just fine. This means \mathrm{o} is not italic as it should be and cannot be changed with \mathit \mathbf etc. because unknown to the user \mathrm is already specified and e.g. we still remove whitespaces although MathJax would render whitespaces just fine.

Maybe, it did not render it well before but does so today. I will open a separate task for that. T245343 I hope this is a one-line fix.

I understand that this task is just the error category and it is already very helpful to have this error category for the deprecated commands above because then we can tell if a math formula is hidden in some template which would take too long to expand in the database-dump search. I was hoping though, it could also be triggered for things like my the exaggerated example
<math>a=\sqrt[\begin{align}[did you know that \exp=\frac{1}{2}? & be careful adding other stuff, it might break the layout]\\b^2\\&\\\end{align}+c^2]</math>
which for several reasons would fail without texvc modifications.

I lost you. Are you saying it should fail or it should not? This example is quite complex. For me personally it is easier to discuss with minimal examples. https://texfaq.org/FAQ-minxampl is a good explanation for what I mean with this term. If we have a LaTeX document and the current rendering in MW next to each other, we have a good basis to improve texvcjs step by step until we are confident. Removing texvcjs entirely is not possible due to security constraints. However, what I would like to change is to run texvcjs on save so that the content of the math tex is identical to the tex string that is rendered. This would also optimize the overall performance of MediaWiki.

I guess the problem with this might be that it cannot be easily integrated into mediawiki, since it would probably need two rendering calls, one with and one without texvc modifications. An alternative could be to just create such a rendering mode in mathoid server or I could try to create my own local mathoid server on toollabs that can render without texvc modifications and then push all formulas from my dumps and look at the return code. With this we cannot find 100% of those problems in real time, but we would still be able to find the majority and get an idea if we have to worry about those.

I did not 100% understand what you plan, but I have set up a testing environment with a clone of MW pages that use math. See https://en.beta.math.wmflabs.org/wiki/Main_Page here you can replace en by any language code. While this is preliminary a research environment I think it might be also a good playground for testing major modification prior to switching to production.

I lost you. Are you saying it should fail or it should not?

yes, it should fail and if it renders it should render different.

The example was meant to illustrate that there is not one small problem but several small problems which can potentiate, most notably:

  • The align and aligned environment are different. One difference is that aligned can take optional arguments like t and b for top and bottom alignment in square brackets and align not, therefore align should not be replaced with aligned.
  • A minimal example like "a \sqrt[b] c" should stay like it is and not become "{\sqrt[{b}]{c}}" because it makes it hard to read.
  • If you combine it with the additional problem of detecting where the optional argument begins and where it ends you somehow end up with "\sqrt {[}..." in the more complicated example.

I have not investigated why in this particular case the optional argument is destroyed. If you really want to make sure that you can detect things like those optional arguments right, you have replicate the macro expansion like LaTeX or MathJax does it when it parses the expression (e.g. hardcode into texvc what kind of expressions and environments can take what kind of arguments) and I do not see the point of this since texvcjs does not render it. It also does not provide helpful error messages to the user therefore it should not reject the input. MathJax already has the ability to reject input and provide helpful error messages, I do not see the point of replicating that either.

I lost you. Are you saying it should fail or it should not?

yes, it should fail and if it renders it should render different.
The example was meant to illustrate that there is not one small problem but several small problems which can potentiate, most notably:

  • The align and aligned environment are different. One difference is that aligned can take optional arguments like t and b for top and bottom alignment in square brackets and align not, therefore align should not be replaced with aligned.

We can change that. This seems straight forward.

  • A minimal example like "a \sqrt[b] c" should stay like it is and not become "{\sqrt[{b}]{c}}" because it makes it hard to read.

I think this is very annoying. I tried to change it long ago, but I was not successful in getting this plain tex inspired extra curlies removed. Maybe I will try again to get this change merged. On the other hand, it is just hard to read, not a real error. Or?

  • If you combine it with the additional problem of detecting where the optional argument begins and where it ends you somehow end up with "\sqrt {[}..." in the more complicated example.

Not exactly. This should be possible with pegjs rules.

I have not investigated why in this particular case the optional argument is destroyed. If you really want to make sure that you can detect things like those optional arguments right, you have replicate the macro expansion like LaTeX or MathJax does it when it parses the expression (e.g. hardcode into texvc what kind of expressions and environments can take what kind of arguments) and I do not see the point of this since texvcjs does not render it. It also does not provide helpful error messages to the user therefore it should not reject the input. MathJax already has the ability to reject input and provide helpful error messages, I do not see the point of replicating that either.

To operate independently of MathJaxs (or whatever future rendering engine) problems and possible security exploits texvc was introduced in 2003. This security feature can not be dropped without broader discussion beyond the Math functionality horizon. In conclusion, texvcjs might be replaced by another security and whitelisting guard, but it won't be dropped :-(

For the more complex example "\sqrt{[} ..." is a real error. In the simple example "{\sqrt[{b}]{c}}" is only hard to read, not a real error. Still, if the example is more complex, keeping spaces and newlines is even more important. For comparison: In C++ you can also remove newlines, scramble white-spaces, add unnecessary scopes and rename variables such that the output is still correct, but not human-readable anymore.

I am not saying that validating LaTeX code is impossible. However it is not as simple as applying some rules. Is for example "\color{\mycolor}" a valid LaTeX expression?

You cannot say, because you do not know if or how \color and \mycolor has been defined. If you define \color by including the color package you still cannot tell unless you know if or how \mycolor was definied. If you have defined "\newcommand{\mycolor}{blabla}" you still cannot tell, because you have to know how the color macro was modified, i.e. if someone has written "\definecolor{blabla}{rgb}{0,0,0}" and if you want to be really accurate you also need the definition of the \definecolor macro at the point where this was written, so you still cannot tell. The point I want to make is that you have to do all the macro expansions the same way like in LaTeX and therefore would need to replicate all the code in all the applicable LaTeX packages in order to achieve that.

I guess most people know, but just to clarify what I mean with "{\sqrt[{b}]{c}}" not being a real error: In LaTeX it depends on the definition of \sqrt and where it is placed. With the normal definition it removes extra spaces around \sqrt which in this case is not a problem because there would be no extra spacing anyway. In other situations, e.g. for relations like "a\xrightarrow\alpha b" or "a\stackrel\alpha= b" you still get wrong spacing in Wikipedia due to those extra braces.

There is no \newcommand in MediaWiki and wrong spacing is hard to fix, if the correct spacing is unknown. A sperate issue that lists the commands would be helpful. \sqrt is from the class fun_ar1 https://github.com/wikimedia/texvcjs/blob/master/lib/texutil.js#L647-L697 and stackrel from fun_ar2 \xrightarrow from fun_ar1opt this is quite a large list of functions. I wonder that it has never been discovered that the spacing is wrong due to additional brackets. Please note that the spacing is marginally different in LaTeX, MathJax, and MathML.

The list of functions where the spacing gets wrong with additional braces is everything that is defined like a unary or binary operator because those get a spacing of \medmuskip. If you place them in braces they become a subformula with spacing rules of an ordinary math expression. By default you have block layout in LaTeX where the size of all spaces between letters and words depends on how much needs to fit into a particular line.

If you search for \stackrel in enwiki https://en.wikipedia.org/w/index.php?sort=relevance&search=insource%3A%2F%5C%5Cstackrel%2F&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1 you will see that people just insert spaces manually. Workarounds like inserting the spaces manually take the average editor half a second. If they see that other editors also insert those spaces they just copy what the others do and think that the behavior is intentional.

wrong spacing is hard to fix,

There is a very easy fix: Not modify the source code at all. With this you can be sure that you do not create errors.

If you want to modify the source code the only other genuine fix is parsing everything like in LaTeX or Mathjax in order to see what characters and macros insert what spacing at the position where they are placed.

The list of functions where the spacing gets wrong with additional braces is everything that is defined like a unary or binary operator because those get a spacing of \medmuskip. If you place them in braces they become a subformula with spacing rules of an ordinary math expression. By default you have block layout in LaTeX where the size of all spaces between letters and words depends on how much needs to fit into a particular line.
If you search for \stackrel in enwiki https://en.wikipedia.org/w/index.php?sort=relevance&search=insource%3A%2F%5C%5Cstackrel%2F&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1 you will see that people just insert spaces manually. Workarounds like inserting the spaces manually take the average editor half a second. If they see that other editors also insert those spaces they just copy what the others do and think that the behavior is intentional.

wrong spacing is hard to fix,

There is a very easy fix: Not modify the source code at all. With this you can be sure that you do not create errors.
If you want to modify the source code the only other genuine fix is parsing everything like in LaTeX or Mathjax in order to see what characters and macros insert what spacing at the position where they are placed.

Mathjax and LaTeX spacing is very different. That is why it is hard to fix.

Let's focus on getting work done rather than discussing fundamentals of computer science.

Spacing in LaTeX is fine, spacing in Mathjax also. We actually do not need to fix spacing, we just have to ensure that the source code that we pass to Mathjax is still correct. The spacing is just one effect those brackets can have, in principle they can change everything, for example "\sin\limits_a^b" is correct, "{\sin}\limits_a^b" does not compile.

In my view this is a fundamental problem, it is not about those two macros. I can remember that at some point texvc inserted brackets around normal equal signs. The problem was reported and then just this one problem, brackets around equal signs was fixed. If you add an underline like "a\underline = b" then you still have the same problem. Additionally in some cases people have the opposite problem now: "bei zusammengesetzten Symbolen funktioniert dies jedoch nicht richtig..." https://de.wikipedia.org/wiki/Hilfe:TeX#Andere_Zeichen

The way to fix this is getting rid of all those modifications.

I agree with Debenden here. Sanity-checking the TeX code before passing it off to MathJax or whatever other formatter is being used: probably a good idea. Trying to mung that code without having full knowledge of the syntax accepted by LaTeX and/or MathJax: a recipe for bad formatting and worse workarounds by users trying to achieve the formatting they expect. Just don't do it.

OK. I think we understand the LaTeX commands allowed in MediaWiki quite well. And to rewrite another sanity checker that does not modify the output in any way would require quite a bit of time, probably cause new problems and would need to get a quite intensive security review, which might also need additional time. Thus I am proposing to fix what we have. In 2015 at NIST we got rid of the additional curlies

https://github.com/DRMF/texvcjs/commit/0690737efcc41d2c6112dc0adf84e636ea4acbf7

and also other additional whitespaces

https://github.com/DRMF/texvcjs/commit/959995d84d22ae833230865d4c7cc5050a6eed05

added by texvc(js), to improve the readability of the checked LaTeX code. Not surprisingly this produced better rendering results with LaTeXML (another alternative to MathJax and LaTeX), At that time I tried to apply the same change to Wikimedia's version of texvcjs. However, at that time we had two rendering engines LaTeX to produce PNG images and MathJax to generate MathML + SVG images. Since we wanted to have them look as similar as possible, we could not apply the changes to texvcjs as the extra brackets were already part of earlier LaTeX sanity checking code texvc

https://github.com/ag-gipp/texvc-ocaml/blob/master/lexer.mll

Now, since we have switched to MathJax rendering in the backend alone the only argument to not applying the changes is that current manual corrections of the spacing will make it worse in some situations. For permanent links to old versions of Wikipages this won't be fixable. If we as Wikimedia Community User Group decide that this is an acceptable price to get improved spacing in the current version I can implement that. However, we should also decide on a strategy to communicate that breaking change.

Is it not possible to use the existing sanity checker, but then if it accepts the input, discard its output and re-use the input as-is?

Physikerwelt added a comment.EditedWed, Feb 19, 6:53 AM

@DavidEppstein to do that one must show that sane texvcjsoutput implies sane input. In general this does not seem obvious to me.

Is sanity not as simple as checking that all characters are printable-ascii and all macros are from a given whitelist? (Assuming the whitelist doesn't include dangerous stuff like \csname) We don't care that it really be valid LaTeX — invalid coding that merely causes MathJax or whatever to issue an error message is still non-problematic.

OK. I think we understand the LaTeX commands allowed in MediaWiki quite well.

No, you don't. mhchem is completely unusable in MediaWiki. That is because texvc abuses arguments to the command and alters them in weird ways. Just taking the examples of the manual, 90% of it are not working. You made tiny improvement, but still, my package remains unusable with texvc.

If someone would put in the effort to make texvc a proper parser that really understands the input, I would be fine. But that doesn't seem to happen. I made the effort and created a validation syntax, four years ago, but I haven't seen any progress since. So, I have to agree to those who say: Get rid of texvc's input-changing behavior.

By the way, I am still waiting for an answer to my question: What exactly is the attack vector texvc tries to eliminate? What bad things could an attacker do if their input was handed to MathJax unmodfied?

Presumably part of the issue is that LaTeX renders on the server and slow-rendering LaTeX could DOS the server. Timeouts could help, but quickly detecting problematic LaTeX would (if implemented well) save more time and make DOS more difficult. There's some discussion of this issue at https://www.usenix.org/system/files/login/articles/73506-checkoway.pdf which, while mostly aimed at actual LaTeX installations (where file I/O could be problematic) also address online viewing situations such as this one.

I think it would be good to focus on:

Deprecate texvc commands conflicting with LaTeX commands used in common packages

Offtopic:
@mhchem thank you for your grammar rules. They ended up here https://github.com/wikimedia/texvcjs/blame/master/lib/parser.pegjs#L221-L276 three years ago. Feel free to submit amendments to them as pull requests.

Regarding the breaking changes, we have:
(1) version histories
(2) current pages where math would not render
(3) current pages where math would render differently

Concerning (1) we tried to communicate this as much as possible when asking for permission for the Texvc2LaTeXBot. We asked people to join T195861 if they want to have a say, now we should point them to https://meta.wikimedia.org/wiki/Wikimedia_Community_User_Group_Math and the mailing-list. Additionally every editsummary of the bot links to https://www.mediawiki.org/wiki/Extension:Math/Roadmap. I think the best communication strategy would be to update that page. Most of the content is currently created by me, but I am not happy with it, please feel free to do a complete rewrite with your ideas and if we do not agree on them we know what to discuss.

Concerning (2) I was hoping for a tracking category that would indicate "this page will no longer render after the breaking changes" and those breaking changes in my view should not just include removal of some macros but removal of all modifications, setting mhchem legacy option to false and possibly also update to MathJax 3.0, however I am not sure if this is possible because I guess it would need two rendering calls? Also I do not understand the tracking category https://en.wikipedia.org/wiki/Category:Pages_that_use_a_deprecated_format_of_the_chem_tags Does it indicate that the formula will not render after legacy=false or after removing texvc modifications or both? I have the impression there should be more than those 63 pages in the enwiki category. Is there a name for the category which works for all projects?

Concerning (3) we should go with the suggestion to put a version that renders with those breaking changes on https://en.beta.math.wmflabs.org/wiki/Main_Page to get an overview and maybe we can identify some patterns which are affected. @Physikerwelt If you send me a login for the beta-math-wiki I would create pages with all mchem from the database dumps. Even better would be to send it over the mailing-list, then everyone can test.

After all we can - as a last resort - always tell the bot to edit problematic formulas and replace the source code in the math tags with the source code produced by texvcjs which will fix all errors due to texvcjs modification removal.