HTML tags shouldn't be escaped before passing to MathJax
Closed, ResolvedPublic

Description

Or "&" is rendered as "amp;", "<" as "lt;" and ">" as "gt;".


Version: unspecified
Severity: critical

bzimport added a project: Math.Via ConduitNov 22 2014, 12:21 AM
bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz36059.
liangent created this task.Via LegacyApr 18 2012, 8:33 AM
liangent added a comment.Via ConduitApr 18 2012, 8:36 AM

(In reply to comment #0)

Or "&" is rendered as "amp;", "<" as "lt;" and ">" as "gt;".

Should be "If '&' is expected there, '&' is eaten and the rest is shown as plain text. Otherwise it triggers an error and the whole line is not rendered."

Schnark added a comment.Via ConduitApr 18 2012, 9:46 AM

Reproducible on mw.org, but not on http://leuksman.com/mw
On leuksman.com <math>a<b</math> renders correctly, while on mw.org a box with a&lt;b (or \displaystyle a&lt;b, depending on where you insert the formula) is displayed.

liangent added a comment.Via ConduitApr 18 2012, 9:52 AM

(In reply to comment #2)

Reproducible on mw.org, but not on http://leuksman.com/mw
On leuksman.com <math>a<b</math> renders correctly, while on mw.org a box with
a&lt;b (or \displaystyle a&lt;b, depending on where you insert the formula) is
displayed.

Guessing it's related to Tidy.

liangent added a comment.Via ConduitApr 18 2012, 9:55 AM

(In reply to comment #2)

Reproducible on mw.org, but not on http://leuksman.com/mw
On leuksman.com <math>a<b</math> renders correctly, while on mw.org a box with
a&lt;b (or \displaystyle a&lt;b, depending on where you insert the formula) is
displayed.

By the way, on leuksman.com, before the MathJax transformation is applied, raw HTML (with escaped &lt; stuff) is displayed making it not so pretty.

bzimport added a comment.Via ConduitApr 18 2012, 10:45 AM

mal.malego wrote:

Come on. This has been resolved over two years ago in my mathJax user script (en.wikipedia.org/wiki/User:Nageh/mathJax). Why does it reappear in the MediaWiki code? I wished the devs would give a little bit more feedback about what they are doing when they are reusing my code but then cut away stuff out of what seems pure ignorance. If you want to try a working MathJax implementation, try my user script.

TheDJ added a comment.Via ConduitApr 18 2012, 6:59 PM

@Nageh, please do not jump to conclusions that quick. You might not have realized it, but looking into it, it seems your script was actually creating invalid HTML for the "math/tex" script element. The content is not HTML escaped.

The Wikipedia developers (in this case, brion and 2 volunteer patch contributors (one of them me)) are accustomed to writing certain things in certain ways. Apparently one of the developers added escaping to the HTML element creation, because that's what he's used to doing. That has brought forward that actually all the time the script element was not properly created and read out again. Now it's written correctly, but of course reading has broken, which is why this ticket was filed. Stuff like that happens, it's just part of the development cycle.

From en.wikipedia with Nageh MathJax script:
<script type="math/tex" id="MathJax-Element-408">\displaystyle u'' + p(x)u' + q(x)u=f(x),\quad x>a </script>

Should be:
<script type="math/tex" id="MathJax-Element-408">\displaystyle u'' + p(x)u' + q(x)u=f(x),\quad x&gt;a </script>

bzimport added a comment.Via ConduitApr 19 2012, 4:10 PM

mal.malego wrote:

Fair enough, and obviously my criticism was unwarranted. Sorry for my attack.

At the same time this is actually something that MathJax should be expected to take care of. Even in the standard installation of MathJax, no matter whether you write < or &lt; the symbol will be put unescaped into the script element. I will report this as a bug to the MathJax devs.

bzimport added a comment.Via ConduitApr 19 2012, 9:58 PM

mal.malego wrote:

I really should have thought further before posting my last message. There is nothing wrong with leaving the <, >, and & symbols un-escaped because the maths text will be added as a text node(!) to the DOM, and thus will NOT be interpreted by the HTML parser and will NOT create invalid HTML, and my script was NOT broken. ;)

bzimport added a comment.Via ConduitApr 23 2012, 8:35 PM

mal.malego wrote:

As I said, comments by others are always being ignored. Now 1.20wmf1 has been deployed on the English Wikipedia, and all the TeX code is broken. < gets mangled to &lt; and then mangled again to &amp;lt;. Sigh.

MZMcBride added a comment.Via ConduitApr 24 2012, 7:50 PM

(In reply to comment #9)

As I said, comments by others are always being ignored. Now 1.20wmf1 has been
deployed on the English Wikipedia, and all the TeX code is broken. < gets
mangled to &lt; and then mangled again to &amp;lt;. Sigh.

This sounds rather serious. Do you have a link to an example of such breakage?

bzimport added a comment.Via ConduitApr 25 2012, 7:11 PM

mal.malego wrote:

Just selected "Leave it as TeX" in the Preferences->Appearance menu. Then open any page that includes TeX code with any of <, >, or &, and view the source (or the text that is displayed). Example page: http://en.wikipedia.org/wiki/Decimal_representation . I have implemented a work-around for the mathJax user script for the moment.

MZMcBride added a comment.Via ConduitApr 26 2012, 3:53 AM

(In reply to comment #11)

Just selected "Leave it as TeX" in the Preferences->Appearance menu. Then open
any page that includes TeX code with any of <, >, or &, and view the source (or
the text that is displayed). Example page:
http://en.wikipedia.org/wiki/Decimal_representation . I have implemented a
work-around for the mathJax user script for the moment.

Ah, that makes more sense. This is only a problem for people with that particular user preference set. So instead of seeing (for example) "$ r_n\leq x &lt; r_n+\frac{1}{10^n}.\, $", people should be seeing "$ r_n\leq x < r_n+\frac{1}{10^n}.\, $"? Is that correct?

This alone wouldn't be a high priority. Are you saying that the escaping is also messing up the MathJax user gadget?

bzimport added a comment.Via ConduitApr 26 2012, 11:45 AM

mal.malego wrote:

Yes, that is correct. Take a look at the HTML source and you'll notice that the reason you see &lt; is because the ampersand is encoded as &amp;, which is followed by "lt;" as a normal text. However, the HTML source should simply contain &lt; which would get rendered as <.

Also, this change has only been introduced with 1.20wmf1 so it's a quite disappointing to hear that reverting this flawed change isn't a high priority.

The change was also messing up the mathJax user script, but I have implemented a work-around. So whether you decide to fix this or not, I don't know, but I'm not the only one to note that the MediaWiki devs community is pretty remote from its users (see comments at sections http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#New_.22diff.22_view_is_horrible_and_illegible and http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Change_notifcations_to_editors ).

SalixAlba added a comment.Via ConduitMay 3 2012, 5:11 PM

I've now switched to the experimental mathJax and this bug shows itself in the differential equations example in Help:Formula http://en.wikipedia.org/wiki/Help:Formula#Differential_equation. Rather than the correct output its displaying

\displaystyle  u'' + p(x)u' + q(x)u=f(x),\quad x&gt;a

with the &gt; instead of >.

SalixAlba added a comment.Via ConduitMay 3 2012, 5:12 PM

Add to block list for 31406 as Help:Formula broken.

jhsoby added a comment.Via ConduitMay 3 2012, 5:17 PM
  • Bug 36485 has been marked as a duplicate of this bug. ***
SalixAlba added a comment.Via ConduitMay 4 2012, 7:56 AM

It also messing with arrays and matrices
\begin{matrix}
x & y \\
z & w
\end{matrix}
This is converted this to &amp; before passing to MathJax so it is rendering as
x amp;y
z amp;w
This is more serious as all matrices, case statements, multiline equations, and tables are broken. See http://en.wikipedia.org/wiki/Help:Formula#Fractions.2C_matrices.2C_multilines for a whole section of broken examples.

He7d3r added a comment.Via ConduitMay 4 2012, 9:09 AM

In terms of articles containing formulas, I would consider the appropriate "Severity" for this bug is "blocker" for the use of MathJax on Wikipedia, since every formula containing <, >, or & is broken.

Setting the fields accordingly, since this is consistent with the description of the fields at [[mw:Bug management/Bugzilla usage#Priority]]. I would set also "priority=highest" but I'm not sure about the availability of devs for fixing this... So I'll leave it to some developer to check the appropriate priority for this bug.

See also reports on

  • [[Wikipedia talk:WikiProject Mathematics#MathJax]]
  • [[de:Portal Diskussion:Mathematik#Mathjax wird getestet!]]
He7d3r added a comment.Via ConduitMay 4 2012, 9:16 AM

(In reply to comment #18)

See also reports on

  • [[Wikipedia talk:WikiProject Mathematics#MathJax]]
  • [[de:Portal Diskussion:Mathematik#Mathjax wird getestet!]]

and
https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=490606575#TeX_broken

Schnark added a comment.Via ConduitMay 5 2012, 8:28 AM
  • Bug 36491 has been marked as a duplicate of this bug. ***
Schnark added a comment.Via ConduitMay 15 2012, 8:56 AM
  • Bug 36842 has been marked as a duplicate of this bug. ***
Schnark added a comment.Via ConduitMay 21 2012, 7:25 AM
  • Bug 36977 has been marked as a duplicate of this bug. ***
TheDJ added a comment.Via ConduitJun 2 2012, 9:52 AM

Fixed the double escaping, but the generated DOM is still unescaped, which I don't really like.

That probably requires changes in MathJax though.

brion added a comment.Via ConduitJun 2 2012, 10:00 AM

Changeset link?

He7d3r added a comment.Via ConduitJun 4 2012, 11:20 AM

I think it is change I6d548d06 (https://gerrit.wikimedia.org/r/#/c/9739/)

SalixAlba added a comment.Via ConduitJun 4 2012, 1:51 PM

Not sure what the desired output for the MW_MATH_SOURCE should be. This will not parsed by MathJax so we end up with illegal and untreated html. Perhaphs better to pass it to htmlspecialchars().

SalixAlba added a comment.Via ConduitJun 4 2012, 3:11 PM

Just noticed that MathJax suports \lt and \gt for < and >. These could solve the problem with < and >.
http://www.onemathematicalcat.org/MathJaxDocumentation/TeXSyntax.htm#L

The & used in matrices and arrays need not be a error
According to http://www.w3.org/TR/html5/tokenization.html#data-state "& " is legal and interpreted as Ampersand then space.

There are a few subtitles which we might need to watch for. \& is a literal ampersand, \>is an alternate medium space. I don't know if html entities are allowed but they seem to work x &#x229D; y gives x circled minus y.

bzimport added a comment.Via ConduitJun 8 2012, 5:40 PM

mal.malego wrote:

The data is content of a script(!) element, of course unescaped entities are legal there! Do you escape your < and > signs in javascript code???

liangent added a comment.Via ConduitJun 8 2012, 6:12 PM

Is it really so easy to fix? gerrit 10708.

liangent added a comment.Via ConduitJun 8 2012, 6:32 PM

Marking as fixed. There're already two patches which are independently worked out and almost the same. Reopen if they don't fix this issue.

bzimport added a comment.Via ConduitJun 9 2012, 11:01 AM

mal.malego wrote:

Yes, it is extremely easy to fix. That's why I didn't understand the cold shoulders I got.

Anomie added a comment.Via ConduitJun 24 2012, 6:03 PM

(In reply to comment #30)

Marking as fixed. There're already two patches which are independently worked
out and almost the same. Reopen if they don't fix this issue.

Is it really appropriate to mark it as fixed when the patches haven't been reviewed or merged yet?

SalixAlba added a comment.Via ConduitJun 24 2012, 6:24 PM

I don't think the fix is quite correct as it introduces problems for the source option. What I think you want is:

		if( $this->mode == MW_MATH_SOURCE ) {
			# No need to render or parse anything more!
			# New lines are replaced with spaces, which avoids confusing our parser (bugs 23190, 22818)
			return Xml::element( 'span',
				$this->_attribs(
					'span',
					array(
						'class' => 'tex',
						'dir' => 'ltr'
					)
				),
				'$ ' . str_replace( "\n", " ",  htmlspecialchars($this->tex) ) . ' $'
			);
		}
		if(  $this->mode == MW_MATH_MATHJAX ) {
			# No need to render or parse anything more!
			# New lines are replaced with spaces, which avoids confusing our parser (bugs 23190, 22818)
			return Xml::element( 'span',
				$this->_attribs(
					'span',
					array(
						'class' => 'tex',
						'dir' => 'ltr'
					)
				),
				'$ ' . str_replace( "\n", " ", $this->tex ) . ' $'
			);
		}

You want to call htmlspecialchars in MW_MATH_SOURCE but not MW_MATH_MATHJAX. You might also want to mention the bug in the comments.

liangent added a comment.Via ConduitJun 24 2012, 6:29 PM

(In reply to comment #33)

You want to call htmlspecialchars in MW_MATH_SOURCE but not MW_MATH_MATHJAX.

I don't understand why. It will be further escaped by Xml::element(), so you'll see still see double-escaped TeX in source mode.

SalixAlba added a comment.Via ConduitJun 24 2012, 6:45 PM

Good point. Its worth checking we get legal html in source mode when its reviewed.

liangent added a comment.Via ConduitJul 12 2012, 5:34 AM

already merged

Add Comment