Page MenuHomePhabricator

Parser: HTML table syntax (e.g. <td>) should not be parsed when inside <pre>
Closed, DeclinedPublic

Description

Author: tim.trent

Description:
Description copied from support desc page (though please treat as a bug report,

not a plea for help):

Mixing Wikitable and HTML syntax

I am setting up a new wiki using the current stable version. Under the terms of
the GFDL I am copying a little (attributed) information from Wikipedia. This
includes a template which works perfectly there, but does not on my new
implementation.

I have tracked the problem down to the original template author mixing wikitable
pipe syntax and html table syntax. Ignoring the fact that this is poor
practice, I need to know what I must do to make it work on my new wiki, please.

To distill the problem I have an example:

<pre>
{|

-

<td>
Wiki table including conventional table syntax
</td>

-
}

</pre>

This creates

<pre>
<td> Wiki table including conventional table syntax </td>
</pre>

in the finished article.

If I do the same on Wikipedia I just get
<pre>
Wiki table including conventional table syntax
</pre>
This is the intended end result, and thus highly desirable.


This is regrettably simple to reproduce on pujr pretty much vanilla installation


Version: 1.17.x
Severity: normal
URL: http://www.mediawiki.org/wiki/Project:Support_desk#Mixing_Wikitable_and_HTML_syntax

Details

Reference
bz8948

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:31 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz8948.
bzimport added a subscriber: Unknown Object (MLST).

ayg wrote:

I vaguely recall that this has something to do with whether Tidy is enabled.
Try enabling Tidy (even if it's not installed, I think it keys off $wgUseTidy
somehow) and see if it works.

tim.trent wrote:

That is a valid workaround, yes. But the concept of running Tidy every time a
page is rendered is, at best "unusual". What it shows is that the code produced
is buggy. This ought to be a simple issue to resolve, and, if resolved
correctly, will have no negative impact in everyone who has been constrained to
run Tidy.

ayg wrote:

Correct, this is almost certainly a bug and should be fixed. But that fact is
necessary to figure out where this is happening and/or reproduce it on a local
install.

tim.trent wrote:

The challenge with a bug like this is that it wastes an inordinate amount of
time to locate, to identify, and to track down while trying to install and
configure a wiki. Reproducing it is dead simple. You turn tidy off and use my
examples, whcuh are the simplest illustration of the problem. Since it is
repeatable in a controlled circumstance the bug has to be within the area that
parses the table pipes (0.9 probability)

We need to ignore the fact that people mix HTML and piped syntax, and
concentrate on the simple issue that it escapes the lt and the gt - a bizarre
behaviour

This is illegal wiki syntax; fix all templates using such constructs or they'll
break when we fix the bug.

ayg wrote:

Why is it illegal wiki syntax?

tim.trent wrote:

(In reply to comment #5)

This is illegal wiki syntax; fix all templates using such constructs or they'll
break when we fix the bug.

You know, since this illegal wiki syntax pervades wikipedia, and since I just
simplified it to show you here, I genuinely do not care one way or the other. I
do care about the tone of that message, though. I'm glad I bothered to report it.

And if it is illegal wiki syntax, why does it render a correct table at all with
Tidy turned on?

The bug is the bug.

ayg wrote:

(In reply to comment #7)

And if it is illegal wiki syntax, why does it render a correct table at all with
Tidy turned on?

I believe Brion perceives the bug to be that the table does render with Tidy on,
rather than not rendering with Tidy off. I don't know why, though.

tim.trent wrote:

There is nothing, anywhere, to state that the syntax is illegal. It is obvious
insanity to mix the syntax, but wikis allow insane people to edit.

If there is to be rigid syntax (not arguing ome way or the other) then that
syntax needs to be parsed for the wiki-editor and rejected at submit time.

It's illegal because | and <td> are different things, as '' and <i> are.
But there's some disagreement on it, so we haven't yet made the fix to the tidy
mode to operate properly. :)

The difference in behavior with tidy on and off is a known problem due to the
way tidy is run at a high level while the built-in HTML nesting sanitizer is run
on smaller chunks. (A known problem for some time.)

It's possible that we'll change the built-in sanitizer to behave more like the
way we use tidy, which is IMHO sloppy and ugly and dangerous, but would remain
backwards-compatible with the existing bogus templates.

tim.trent wrote:

Since it is a known problem, and since it is not documented anywhere, or at
least anywhere the slightest bit obvious, then it at least should be documented
in a substantially better manner.

The argument about | vs <td> and '' vs <i> is interesting. But, since ''
generates <i> or conceivably <em> (I have not checked), and also generates the
closing tag, it seems to me that one could say with some validity that | is, in
this circumstance, equivalent to <td>.

'' and <i> are different because '' may turn into <i> or </i> depending on
context - or even stay '', for example if there's nothing else in the paragraph

  • or turn into <b> if followed by another '. The same is true for the table

syntax i suppose - it becomes very hard to parse the wiki-style tables when you
at the same time try to respect html markup.

Ans yes, this probably should be documented somewhere.

ayg wrote:

I'm pretty sure that it's not possible to open an element with wikimarkup and
close it with HTML or vice versa, anywhere, so the analogy to '' is perhaps not
apt. All the wikitext parser has to do is ignore stuff inside tables that don't
look like table rows/cells/etc., and let the sanitizer/Tidy deal with it if they
aren't actually table rows/cells/etc. On the other hand, trying to match stuff
like "''Foo</i>bar''baz" would require significantly complicating the wikitext
parser (well, the wikitext regex replacements :P).

It can be convenient to mix wikitables and HTML tables. For instance, lots of
pipes might occur (or potentially occur, for a template) somewhere inside a
table cell, and you avoid any problems with those by using HTML markup for that
one cell.

EN.WP.ST47 wrote:

Tested on live, the code still works on wikipedia at 1.17wmf1, and still does not work on a local wiki at 1.16.2.

I don't see this bug in the latest master version of MediaWiki, with or without Tidy enabled.

Everything inside <pre>...</pre> is HTML escaped.