MathML with input <math>hello</math> causes tidy to die
Closed, DeclinedPublic

Description

If the user selects MathML rendering for texvc, input text such as
<math>hello</math> will generate what tidy considers to be XML with "serious
errors". This causes severe corruption of the page display, on all those
discussion pages where people are using tidy-dependent signatures.


Version: 1.6.x
Severity: normal
URL: http://en.wikipedia.org/w/index.php?title=Wikipedia:Requests_for_adminship/Oleg_Alexandrov&oldid=23500666

Details

Reference
bz3504
bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz3504.

The output text for <math>hello</math> is:

<p><math
xmlns='http://www.w3.org/1998/Math/MathML'><mi>h</mi><mi>e</mi><mi>l</mi><mi>l</mi><mi>o</mi></math>
</p><p><br />
</p>
<!-- Tidy found serious XHTML errors -->

j.niesen wrote:

proposed patch, not too nice

An ugly way to resolve this bug is to strip out all the <math> ... </math> tags
before sending the HTML to tidy and to plug them back in afterwards. The
attached patch follows this strategy. I've only testing it with external tidy,
because I couldn't get internal tidy to work.

Attached: patch3504.txt

j.niesen wrote:

Please disregard the patch. I did not test it properly against the CVS version,
and something changed in the code.

j.niesen wrote:

I think the patch in comment #2 works, but it depends on my patch from bug 5344.

sspecter wrote:

I've made a working (but unstable) implementation with MathML on my wiki here.

Here's an example: http://www.sspecter.com/wiki/index.php/Ajuda:ASCIIMath-Sintaxe
Its in portuguese but you can see it working.

I solved that by creating MathML as an extension (probably it is not parsed by
tidy, only sanitizer), and naming the extension tag <asciimath>, to not conflict
with <math> from MathML.

My solution works but is is unstable because sanitizer like to generate
bad-formed tags inside my good-formed MathML, and my XHTML pages crash. It just
happens in some cases, like wiki list (*) + <asciimath>, or <asciimath> with
blank lines inside it. But i believe these problemas will still happen with
comment #2's solution.

j.niesen wrote:

(In reply to comment #5)

I solved that by creating MathML as an extension (probably it is not parsed by
tidy, only sanitizer), and naming the extension tag <asciimath>, to not conflict
with <math> from MathML.

I think that the sanitizer - specifically, Sanitizer::removeHTMLtags() - does
not touch extension tags and <math> tags, though it is hard to tell anything
from Parser.php. Furthermore, it seems that tidy is not enabled on your site.

My solution works but is is unstable because sanitizer like to generate
bad-formed tags inside my good-formed MathML, and my XHTML pages crash. It just
happens in some cases, like wiki list (*) + <asciimath>, or <asciimath> with
blank lines inside it.

As I said, I doubt it is the sanitizer or tidy that generates the bad-formed
tags. Your problem may be that the MathML which you generate contains newlines.
This confuses the parser. Try replacing all the newline characters with spaces.

But i believe these problemas will still happen with comment #2's solution.

Is this just a guess, or do you have an example in which the patch does not
work? It seems to work fine on http://wiki.blahtex.org/ .

brion added a comment.Jun 3 2006, 11:43 PM

I'm going to WONTFIX this.

We're going to be ditching tidy (with its bugginess, overhead, and
annoying features) once the internal HTML normalizer is fixed up,
which it soon will be (in progress, bug 5497).

With normalization working on our own output, we don't have to
worry about tidy choking on extension output.

Add Comment