Page MenuHomePhabricator

Remove texvc
Open, Needs TriagePublic

Description

Since texvc is not needed any more, unmaintained and causing numerous bugs, it should be removed.

Event Timeline

Debenben created this task.Mar 4 2018, 9:21 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 4 2018, 9:21 PM
Reedy added a subscriber: Reedy.Mar 5 2018, 1:18 AM

textvc needs removing from mediawiki-config first (if the config is still using it in deployment.. Any user preferences that are using a rendering method that uses texvc). Then it can be cleaned up from Puppet (removing the packages from servers)

Then it can be stripped out of the extension

I just added some problems that would be solved as examples. Of course there are many more (all tickets that are tagged with texvc and those that don't have phabricator tickets yet, because people have given up reporting them)

Reedy removed a subscriber: Reedy.Apr 4 2018, 2:05 PM

@TheDJ technically it is. However, even if we removed texvc, we still have texvcjs. I got the impression that the intention is to remove texvcjs as well, which would have to carefully considered under security aspects and it would be also a major change to wikitext.

TheDJ added a comment.EditedApr 4 2018, 3:10 PM

@Debenben So the intent of this ticket is to remove user input validation completely, so that users can add whatever they want (including invalid latex) and make it the responsibility of the rendering engine to validate the input ?

[edit] Oh and as a side effect, to allow different rendering engines to potentially allow different subsets of input, so we give up on the requirement that <math> is consistent and renderer independent.

The problem is that texvcjs does not prevent the user from adding invalid latex. There mainly two cases where mathjax is more tolerant than latex: it allows unescaped % (it would be a comment in latex) and latex primitives with multiple arguments are treated like normal macros and don't need extra brackets (some historic relict in latex). Both these cases are neither corrected, nor rejected with texvcjs. What texvcjs is doing is mainly destroying or rejecting valid latex code or making it harder to read.

It would be a nice feature to have something that would detect these edge cases and reject invalid latex code (or a simple bot that calls latex and list all equations where it fails to render), but on a priority list that should be less important than a rendering service which actually works.

Debenben added a comment.EditedApr 4 2018, 10:15 PM

I have to correct my earlier statement that mathjax does not treat % as a comment: It does. Only wikipedias own texvc is different. This is a very bad feature because everyone uses latex and nobody knows texvc. This should throw an error category such that it can be corrected everywhere.

For those that are unfamiliar with writing mathematical articles I'll provide an example of what I mean by destroying valid latex code: Try rendering

<math>\begin{align}
a + b &= c \\  [c] &= \mathrm m
\end{align}</math>

which is valid latex markup (assuming the align environment has been defined properly e.g. by including the amsmath package). The source code returned by texvcjs is

{\begin{aligned}a+b&=c\\[c]&=\mathrm {m} \end{aligned}}

This is not just unreadable, it is also wrong to remove the space after \\ because it makes c an optional argument. In contrast to the m, where nobody would write both spaces and brackets. Furthermore the align environment was changed to aligned, which has a slightly different behavior.

Those changes are usually hidden to the editors, so what they see is for the standard svg fallback is a confusing, partially untranslated error message (here the german version):

Fehler beim Parsen (Konvertierungsfehler. Der Server („https://wikimedia.org/api/rest_“) hat berichtet: „Cannot get mml. TeX parse error: Bracket argument to \\ must be a dimension“): {\displaystyle {\begin{aligned}a+b&=c\\[c]&=\mathrm {m} \end{aligned}}}

or in case of png the error message is telling the user to check his latex installation. It even spits out the original, correct latex code therefore hiding the cause of the error completely:

Fehler beim Parsen (PNG-Konvertierung fehlgeschlagen. Bitte die korrekte Installation von LaTeX und dvipng überprüfen (oder dvips + gs + convert)): \begin{align} a + b &= c \\ [c] &= \mathrm m \end{align}

This is only one little example of a very common behaviour that is unique to wikipedias texvc. In case it is not convincing I can provide more examples. Everywhere else it works correctly and in case there is true a syntax error, one usually gets a meaningful error message in his language of choice that tells exactly what is wrong.

@TheDJ Now that the rendering part of texvc is gone, maybe we can move forward on this one. Sorry in advance if the following explanation is too lengthy or trivial:

LaTeX is not easy to validate because it is based on macros. For example \somemacro{\othermacro} is valid iff \somemacro is defined in the given environment, takes zero or one mandatory argument and in case of one argument accepts \othermacro which again, has to be defined with zero mandatory arguments... Because you don't want to define all macros yourself, there are popular packages like amsmath which you can include. Any proper validation has to know all macro definitions, primitives and characters that can be used. The easiest way to build a validation would be to take XeLaTeX or LuaLaTeX, include a subset of supported packages and submit the input as a tex file (otherwise LaTeX commands could become command-line arguments).

Different rendering engines e.g. MathJax vs. KaTeX support different subsets of packages. Usually they also allow to selectively disable certain features. Most importantly: They all respect the common rules of the LaTeX language, well... all except one. The odd one being different is called Wikipedia or texvc. Texvc is not rejecting things, it is modifying the source code in inexplicable fashions without any reason at all. In particular:

  • The Wikipedia-editor generally knows best how to make the source code human readable so don't change that
  • Never remove whitespaces, depending on the environment (like in the example above) they may be significant
  • Optional arguments are a basic feature of LaTeX, don't falsely create them (like in the example above) or destroy them, e.g changing \cfrac[l]{a}{b} to {\cfrac {[}{l}}]{a}{b} is wrong
  • Starred names for alternative versions is another (e.g. changing \operatorname*_a to \operatorname {*} _{a} is wrong)
  • Don't escape unescaped %, it denotes a comment and don't escape unescaped $
  • Support a well defined subset of commands e.g. everything defined in the amsmath package without exceptions.
  • In case the input is incorrect, provide a proper error message (most simple solution: just use the one given by the rendering engine). For example \begin{align}123@56\end{align} should just render 123@56. Because texvc has problems with @ you get 'unknown function "\begin{align}"'.
  • Modifications to parts of an equation should never influence the rendering of completely unrelated parts, (e.g. for the mhchem package texvc is jumping between two different syntax rules, both of them incorrect, depending on whether the input contains seemingly random, secret magic words or symbols).

Summary: A validation that rejects input and returns a proper error message is desirable and is probably most easily implemented by disabling certain MathJax features. A "validation" that modifies the source code like texvc is incompatible and not desirable.

TheDJ added a comment.May 22 2018, 4:33 PM

I suggest you take this up with @Physikerwelt and the security team as I don't have the days available to figure out what can and cannot be done about this.

TheDJ removed a subscriber: TheDJ.May 22 2018, 4:33 PM

@Physikerwelt I guess there are two things that need to be done:

  1. We need to escape $ and % in the wikitext in case they are not escaped. Are there other modifications where the current rendering could be desired and would change? For the article namespace in the English and German Wikipedia I did this by downloading the latest of those huge outdated xml dumps and searched them with awk because the xml parsers were extremely slow. I would prefer a method where I don't have to download those dumps for all wikimedia projects. Is there a database of all original LaTeX strings on Wikimedia projects that you could give me / my tool-labs account access to. Then I could do that.
  2. As long as those maction things are turned off I don't expect any security issues since MathJax is used by a lot of other websites (math.stackexchange, GitHub...). However I am not an expert on this, so we should request a security review. What version of software would need this review: MathJax v3, MathJax, MathJax-node, Mathoid-Mathjax-node, Mathoid...?

@Debenben Now, since we have stopped using texvc in production, we are technically in a position that we can change anything we can image. However, the things you mention require a lot of work thus we should make sure, to plan wisely before we start with the implementation and to make sure the implementation suites the community needs. I would, therefore, suggest that we form a committee that discusses the changes we want to make to the <math/> tag before we start with the implementation.

Note, that I did not invent texvc. In the last five years, I was just making sure to maintain the backward compatibility.

Sounds great, I am happy to discuss the changes further

Debenben moved this task from Incoming to Next-up on the Math board.Nov 11 2018, 7:18 PM