Page MenuHomePhabricator

Parser generates malformed HTML with a list inside a table
Open, LowPublic

Description

Author: ioeth.trocolar

Description:
It has been brought to my attention at http://en.wikipedia.org/wiki/Wikipedia:TW/BUG#TW-B-0250 that Twinkle (and WP:FRIENDLY|Friendly as well, of course) are having problems editing pages that are utilizing editnotices. I think I've tracked the problem down to a case of bad XML (HTML) being returned from the server, although I cannot figure out what the cause would be, other than a bug in the MediaWiki software. That being the case, I'm coming here to confirm. I'll be using the talk page of User:Amalthea (http://en.wikipedia.org/wiki/User_talk:Amalthea) to demonstrate what I'm talking about.

First, have a look at http://en.wikipedia.org/wiki/User_talk:Amalthea/Editnotice. Pretty standard, right? And when you do a "View Page Source" have a look at the HTML generated by the {{tmbox}} template, in particular at the end of the unordered list (<ul> tag). Nothing wrong there.

Now, have a look at the same section of HTML when you edit Amalthea's talk page (http://en.wikipedia.org/w/index.php?title=User_talk:Amalthea&action=edit). What I'm seeing is that the </td> tag from the {{tmbox}} template is somehow being placed before the </li></ul> tags that close the unordered list. Immediately following the </li></ul> line is a line with </tr> to close the table row. This is clearly incorrect and is, I believe, causing the XML parser to bomb out when doing Twinkle or Friendly functions to pages where this occurs.

If I just use one line of plain text in an editnotice, the corresponding HTML on the page where it is displayed is correct. However, if I use a bulleted list, the MediaWiki software incorrectly places the </div> tag exactly as it was doing to the </td> tag in the previous example. It seems that there might be a problem specifically with unordered lists in editnotices. Have a look at editing my user talk page (http://en.wikipedia.org/w/index.php?title=User_talk:Ioeth&action=edit) for an example.

Details

Reference
bz17486

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:28 PM
bzimport set Reference to bz17486.
bzimport added a subscriber: Unknown Object (MLST).

rockmfr wrote:

Reduced test case:
http://en.wikipedia.org/wiki/MediaWiki:Editnotice-2-RockMFR-sandbox
http://en.wikipedia.org/w/index.php?title=User:RockMFR/sandbox&action=edit

Is Tidy not enabled on editnotices?

I think we can work around the original problem on enwiki by changing the spacing at MediaWiki:Editnotice-3.

rockmfr wrote:

(In reply to comment #2)
Actually, in this case, we would probably need to add spacing at Template:Tmbox.

ayg wrote:

(In reply to comment #1)

Is Tidy not enabled on editnotices?

I think it's not, in fact, since IIRC they're inserted using JavaScript to avoid caching.

Amalthea.wikimedia wrote:

FWIW, this also happens with other MediaWiki messages, like with http://en.wikipedia.org/w/index.php?title=MediaWiki:Newarticletext&oldid=277222693
Similar setting there, a list inside a table cell. This makes me think that we deliver quite a lot of pages with invalid HTML.

Simetrical, only some specialized editnotices are placed through javascript when it can't be done through the MEdiaWiki system, currently only on disambiguation pages I think, and there seem to be plans for BLP pages.

MediaWiki:-messages are not tidied and it expected that those who modify them don't create invalid HTML. Also, isn't this bug unnecessary after r48276?

Amalthea.wikimedia wrote:

No, and No.

First, r48276 only removed per-page edit notices from some namespaces. We still have per page notices in the namespaces that allow subpages, we have namespace-wide editnotices, and we have user-space editnotices.

Second, this bug is about the fact that MediaWiki syntax can create invalid XHTML. This doesn't usually shine through since articles are tidied, but if you put a list inside a table cell

{|

*A
*B

}

into either an editnotice or a MediaWiki message, it will create invalid XHTML, and it isn't tidied up.
MediaWiki syntax shouldn't ever do that (not even if the syntax is used incorrectly). In my opinion, the tags used in MediaWiki are also part of the MediaWiki language, and one shouldn't be able to create invalid XHTML through them either. We have this layer of abstraction for a reason.

davidg wrote:

This might be related to the problem that we can not put tables inside unordered lists. There's probable a bug for that already, but here goes:

This usually doesn't work:

<pre>

  • List item 1. {{some template with a table in}}
  • List item 2.

</pre>

Probably since it is equivalent to this, which also doesn't work:

<pre>

  • List item 1. {| class="wikitable"
-
Cell 1.
}
  • List item 2.

</pre>

Or this, which also doesn't work:

<pre>

  • List item 1. <table class="wikitable">

<tr>
<td> Cell 1. </td>
</tr>
</table>

  • List item 2.

</pre>

But if using HTML wikimarkup and only using a single line of code in the cell, then it works:

<pre>

  • List item 1. <table class="wikitable"><tr><td> Cell 1. </td></tr></table>
  • List item 2.

</pre>

But most templates are not a single line, and we can not use wikitable markup to build tables. That is, this never works:

<pre>
{| class="wikitable" |- | | Cell 1. |}
</pre>

(In reply to comment #7)

This might be related to the problem that we can not put tables inside
unordered lists.

Not really.

I cannot use the "rollback" or "rollback vandal" functions on bengali wikipedia. When I do, I get the following message: "Reverting page: couldn't grab element "editform", aborting, this could indicate failed response from the server." Apologies if this has been reported before/elsewhere.

<pre>
{|class="wikitable" |- || Cell 1.1 || Cell 1.2 |- || Cell 2.1 || Cell 2.2 |}
</pre>

If only this could work as intended ! This has been a limitation since long (the nightmare of newlines in wikitables should have an end sometime, notably becauise the above syntax is not ambiguous about the role of each pipe delimiter (note that I corrected your example to use double pipes where appropriate to start ALL cells that are on the same wiki-syntax line (the fact that Wiki uses a single pipe for cells starting at a nelwine just means that newline+single pipe is equivalent to a double pipe)

But there still remains a problem with the header cells : they start with exclamation marks, whose effect ranges for the whole line (even if other cells on the same wiki-line are separated by pipes instead), but should not exceed the row. If adopting a single-line syntax for wikitables (without using newlines), all cells that must be headers should have their own exclamation marks begining them: so:

<pre>
{|class="wikitable"

-

! HeadCell 1.1 || HeadCell 1.2

-
Cell 2.1Cell 2.2
}

</pre>

could be safely converted to single line with:

<pre>
{|class="wikitable" |- HeadCell 1.1 HeadCell 1.2 |- || Cell 2.1 || Cell 2.2 |}
</pre>

and not:

<pre>
{|class="wikitable" |- !! HeadCell 1.1 || HeadCell 1.2 |- || Cell 2.1 || Cell 2.2 |}
</pre>

(because HeadCell 1.2 should no longer be a header cell but a normal cell)

What I suggest: if the line containing "{|" further contains a "|-" row start tag (or even just "||" without the first row start tag, or "|+" which initiates a table caption) the rest of the table will be interpreted as a table using the single-line syntax. The newlines will no longer be significant to discrimate between single and double pipes, or between single and double exclamation marks; they will still be valid in the content, but will become interpreted as regular paragraph continuator or separator within the cell.

So:

<pre>
{|class="wikitable"|+

caption |- Cell 1.1 Cell
1.2 |- || Cell 2.1 || Cell 2.2 |}
</pre>

will be valid, but the "caption" will be embedded within a <p> paragraph(because of the double newline, like in normal wiki syntax), and "Cell 1.2" will be part of the same HTML block, just continued on the next line).

Numbered and unnumbered lists (with "*" and "#"), as well as indented blocks (starting with ":") also need fixing to allow atributes and safe continuation on the next line : I also sugggest adding support for the "||" separators (only 1: no attribute present; 2 pipes: there are attributes and text after it; the item text can be continued on the next lines on both cases). The "line-safe" list items and indented blocks would be recognized ONLY if there are NO space between the "*/#/:" initiator and the pipe:

<pre>
*|attribute1="value1"| bulleted text
|attribute2="value2"| bulleted text
#|attribute3="value3"| numbered text (1.)
#|
numbered text (2.)
#|numbered
text (3.)
**#:|attributes...|indended text

:|attribute=value...| Indented text which
continues on the next line.

:|| Another indented text which
continues on the next line.
</pre>

In addition, it should also be possible to create single line lists with the same syntax (where a pipe immediately follows the initial list tag), with more free separation in the wiki code:

<pre>
#|| item 1 || item 2 ||start="4"| item 4

item 5item6

<pre>

The list above would only be terminated by a double newline (or by another reserved character such as = for ==section headings==), and the support for HTML attributes in lists would allow easy renumbering, or would allow specifying alternate numbering systems).

The same logic should also apply to section headings, to add support for attributes (there must be no space between the last "=" of the init tag, and the first pipe), and also to allow them to be continued on multiple lines, or terminated by the "=" signs on a separate line, without having to write HTML tags; this would also preserve the possibility of displaying the [edit] link, which disappears when <h1>..<h6> elements are explicitly used, but also the possibility to remove it selectively where appropriate for a single heading (by using a custom MediaWiki attribute using the syntax of XML/XHTML namespace prefixes, here "mw:"). For example:

<pre>

| attribute=value | Section heading

|| Subsection

heading ===
This subsection is editable separately.

| mw:noedit | Another subsection

This Subsection is not editable separately.

</pre>

Well the above comments for improving the syntaxe lists, indented blocks and section headings should probably be part of a request for enhancement (because they have possible compatibility issues in some articles).
But the improvement for tables should still be part of this bug (as they don't cause compatibility issues: pipes have no function in the current table first line starting by "{|".

Another alternative for single line tables would be to recognize them only if they start with "{||" without any space within the 3 characters, before its attributes, in which case tables would be recognized even in the middle of a wiki-line.

cbm.wikipedia wrote:

Still causing problems:

http://en.wikipedia.org/w/index.php?title=MediaWiki:Newarticletext&diff=336851551&oldid=329585835

Is there a way to tell which messages could be run through tidy? It's a little much to expect naive local wiki admins to verify XHTML validity themselves.

We should provide them an error when they produce invalid XHTML if tidy is disabled for the messages.

ayg wrote:

What, like "this wikitext happened to produce malformed XML because our parser is broken, so please figure out the necessary hacks to make our parser decide to output well-formed content because we can't be bothered to fix it"? It's not reasonable at all to have a fatal error here, anyway, because almost all admins won't care in the slightest about XML well-formedness, and they should have the right to place their own convenience over well-formedness. Why don't we run Tidy on these fragments at save time instead of output time, maybe?

(The tools in question should switch to using html5lib instead of XML tools, incidentally, at least if it's available for their language yet. It's meant to be a drop-in replacement for common XML libraries and will make errors like these nonfatal.)

(In reply to comment #14)

What, like "this wikitext happened to produce malformed XML because our parser
is broken, so please figure out the necessary hacks to make our parser decide
to output well-formed content because we can't be bothered to fix it"? It's
not reasonable at all to have a fatal error here, anyway, because almost all
admins won't care in the slightest about XML well-formedness, and they should
have the right to place their own convenience over well-formedness. Why don't
we run Tidy on these fragments at save time instead of output time, maybe?

We could cache a tidied html message, but that might produce problems with the
substitutions. We would need a different caching for plural, {{PAGENAME}} etc..

I think we do run Tidy at "save time", since after saving the admin will go to
the view of Mediawiki:whatever.
If when they look at it they got a message "This message produces malformed xml
when showed directly, please fix" that would be an improvement over having^W
expecting admins manually check by themselves the xhtml wellformedness.

If they don't know how to fix it, they can ask somewhere like
[[Wikipedia:Village_pump_(all)#Technical]] or [[Wikipedia:MediaWiki_messages]].
At least they can remember that message they broke when their layout is broken
because they forgot a </div>
Or have it easier to find by others.

ayg wrote:

You're right, save-time tidying won't work. If we put up a warning for well-formedness errors, we'd have to make sure a) it's easy to ignore, b) it only hits the right messages. There are almost certainly some messages where an XML fragment is perfectly reasonable to have -- a tag could legitimately be opened in one message and closed in another. So it would require some work.

As I say, the long-term solution is to avoid XML like the plague. It's just too fragile.

I was thinking in a box below the message text. I don't think we have messages where
unclosed tags are reasonable. At most, it would be an unsupported hack.

Matching only wikitext messages is an important point. It shouldn't warn on html
messages or a javascript. Attached to that identification, the editng admins could
also get a message on edit telling them the available syntax.
I think that knowledge is already used for translatewiki?

As I say, the long-term solution is to avoid XML like the plague. It's just
too fragile.

The problem is not XML but having a syntax which we are unable to map into
the target language.

cbm.wikipedia wrote:

Re comment 13 "We should provide them an error when they produce invalid XHTML if tidy is
disabled for the messages."

That is made substantially more difficult by the use of parser functions inside messages. The error I reported in comment 12 only showed up for nonexistent pages in the Talk: namespace, which is why it sat unfixed for a long time. There's not going to be any practical way to test every possible control flow path through the message for XML validity.

Re comment 17, "As I say, the long-term solution is to avoid XML like the plague."

The proposed solution of using html5lib on the client side would help, if it accepts slightly malformed input. But of course the sort of malformed code we are serving wouldn't be well-formed HTML5 either (that is, I doubt it would pass a validator). A better long-term solution would be to fix our system to have validatable output. Of course this is not a simple thing to achieve, which is why we usually use tidy to force the output to be correct.

No offense intended, but the code on [[MediaWiki:Newarticletext]] is insane, from the first #ifexist
trying to find if you forgot a closing bracket, to the anon talk page detection
Although you can fool it with http://en.wikipedia.org/w/index.php?title=User_talk:007/2&action=edit :)

The parser does strange stuff with the default[1] message. If there is only one line break in it, things are fine. If there are more than one line breaks in it, things start to get funky. See bug 19226.

Furthermore, anything you guys choose to do to the default messages, including adding parser functions, tables and what not, may or may not work. If it doesn't work, find a different solution.

[1] Default content for MediaWiki:Newarticletext:
You have followed a link to a page that does not exist yet.
To create the page, start typing in the box below (see the [[{{MediaWiki:Helppage}}|help page]] for more info).
If you are here by mistake, click your browser's '''back''' button.

ayg wrote:

(In reply to comment #17)

I was thinking in a box below the message text. I don't think we have messages
where
unclosed tags are reasonable. At most, it would be an unsupported hack.

We have so many messages that I wouldn't bet on that. But we can raise the warning anyway, if it's only a warning.

Matching only wikitext messages is an important point. It shouldn't warn on
html
messages or a javascript.

Why not on HTML messages too, if they're not tidied?

The problem is not XML but having a syntax which we are unable to map into
the target language.

XML is a necessary although not sufficient condition for this problem to be worth worrying about. If everyone used HTML5 parsers, the consumer would fix the markup in a standard way and it wouldn't be a big problem (although ideally we'd fix it anyway).

(In reply to comment #18)

The proposed solution of using html5lib on the client side would help, if it
accepts slightly malformed input.

html5lib will parse an arbitrary stream of bytes into a DOM. In practice, it will fix simple nesting errors invisibly -- <div><span>Foo</div></span> will become <div><span>Foo</span></div>. You can test out an HTML5 parser by getting a copy of Firefox 3.6 or later, setting html5.enable to true in about:config, loading the URL

data:text/html,<!doctype html><div><span>Foo</div></span>

and inspecting the resulting DOM from Firebug. This transformation is standardized in HTML5, so the DOM output is well-defined, and it's meant to be a close match to what all browsers already do anyway. Of course the markup won't validate, but that's not as big an issue as breaking bots that users rely on.

Another thing we could consider eventually is using html5lib ourselves instead of Tidy. If it's fast enough (there's a C++ version that Mozilla uses), we could pass all our output through it to parse and reserialize. It's kind of silly for us to do this when it could more easily be done by the client, though.

tl;dr
If this is only about list inside tables, could someone provide a testcase at translatewiki.org, all previous have been deleted; I think this was fixed

No, this is about tables inside lists (or any wiki feature that requires a single line syntax: this applies to unnumbered lists with "*" converted to ul/li, numbered lists with "#" converted to ol/il, and definition lists starting with ";" and ":" converted to "dl/dt/dd").
Lists inside tables have no problems.
I really think that wiki-tables should have a syntax that removes the single line constraint : I suggested a syntax starting with "{||" instead of just "{|" to mean that every thing, up to the closing "|}" would be a table, and that all cells MUST be initiated by "||" for td, or "!!" for th, and new rows start with "|-", independantly of how newlines are found in the wikisource, so that a single or multiple line wiki-syntax for tables would be accepted and recognized such as:

  • list item
  • another list item containing a table: {||class="wikitable" |-valign="top" scope="row"|header cell of 1st row ||align="right"|1st data cell |-valign="top" scope="row"|header cell of 2nd row ||align="right"|2nd data cell |}
  • another list item containing a table broken multiple lines: {||class="wikitable"
-valign="top" !!scope="row"header cell of

1st row ||align="right"|
1st data cell

-valign="top" !!scope="row"header cell of

2nd row ||align="right"|
2nd data cell

  • another list item containing a table: {||class="wikitable"
+align="center"Table caption
-valign="top" !!scope="row"header cellalign="right"data cell
-valign="top" !!scope="row"header cellalign="right"2nd data cell
}

When parsing this syntax, all cell contents are parsed as if they were already at the begining of a new block, meaning that line breaks are only converted to create new paragraphs if there are two line breaks. If there are no two-linebreaks, then the cell content will not be converted into a paragraph (with the "p" element), however, if there are any two-linebreaks in that content, all paragraphs including the first one will be converted to a "p" element in that cell content. Whitespaces in cell contents or captions should continue to be compressed, and single line breaks converted to a compressable whitespace.

In this syntax, the "!!" cell separator would not convert cells starting by "||" on the same line into header cells (I think this is a bad inconsistency of the Wiki syntax, which also forces us to break the line when a header cell is followed by a data cell, and does not help making the syntax compact and easily readable, just like in the last list item above).

The parser should also take care of NOT breaking a list item where it contains a new explicit HTML block element that requires an explicit closure (for example div, blockquote, pre, and HTML or wiki tables). This requires a change on how wiki-lines starting by "*", "#", ";" and ":" are parsed: they will have to read for more lines to find the end of the embedded elements.

Also, there's stil lthe lack of support for adding attributes to list items and to their containing list element.

Why can't we have the syntax like:

{* attributes for the unnumbered list container (ul)

  • simple list item
  • another list

item

  • attributes for the list item| bulleted list item content whose

content includes...

multiple paragraphs...

  • attributes for the list item| bulleted list item content whose content

is splitted on multiple lines (whitespace compressed by the parser)

*}

Note: attributes on list items, separated by "*"-lines are optional: the first "|" delimits where attributes are terminated before the visible content. We then gain also more freedom in adding empty lines between list items, because whitespaces at end of an item are always trimmed.

Idem for numbered lists using "{#" and "#}":

{# attributes for the numbered list container (ol)

  1. attributes for the list item| numbered list item content
  2. attributes for the list item| numbered list item content

*}

Or for definition lists using "{;" and ";}":

{; attributes for the definition list container (dl)
; attributes for the defined term| defined term content
; attributes for the list item| numbered list item content
*}

This would also simplify the design of lots of templates, with less problems when transcluding them in lists (notably when a text-like template requires some special formatting to support some advanced typography, or special layouts emulating some text features using positioned block elements).

No, this is about tables inside lists (or any wiki feature that requires a
single line syntax: this applies to unnumbered lists with "*" converted to
ul/li, numbered lists with "#" converted to ol/il, and definition lists
starting with ";" and ":" converted to "dl/dt/dd").
Lists inside tables have no problems.
I really think that wiki-tables should have a syntax that removes the single
line constraint : I suggested a syntax starting with "{||" instead of just "{|"
to mean that every thing, up to the closing "|}" would be a table, and that all
cells MUST be initiated by "||" for td, or "!!" for th, and new rows start with
"|-", independantly of how newlines are found in the wikisource, so that a
single or multiple line wiki-syntax for tables would be accepted and recognized
such as:

  • list item
  • another list item containing a table: {||class="wikitable" |-valign="top"

!!scope="row"|header cell of 1st row ||align="right"|1st data cell

-valign="top" !!scope="row"header cell of 2nd rowalign="right"2nd data

cell |}

  • another list item containing a table broken multiple lines:

{||class="wikitable"

-valign="top" !!scope="row"header cell of

1st row ||align="right"|
1st data cell

-valign="top" !!scope="row"header cell of

2nd row ||align="right"|
2nd data cell

  • another list item containing a table: {||class="wikitable"
+align="center"Table caption
-valign="top" !!scope="row"header cellalign="right"data cell
-valign="top" !!scope="row"header cellalign="right"2nd data cell
}

When parsing this syntax, all cell contents are parsed as if they were already
at the begining of a new block, meaning that line breaks are only converted to
create new paragraphs if there are two line breaks. If there are no
two-linebreaks, then the cell content will not be converted into a paragraph
(with the "p" element), however, if there are any two-linebreaks in that
content, all paragraphs including the first one will be converted to a "p"
element in that cell content. Whitespaces in cell contents or captions should
continue to be compressed, and single line breaks converted to a compressable
whitespace.

In this syntax, the "!!" cell separator would not convert cells starting by
"||" on the same line into header cells (I think this is a bad inconsistency of
the Wiki syntax, which also forces us to break the line when a header cell is
followed by a data cell, and does not help making the syntax compact and easily
readable, just like in the last list item above).

The parser should also take care of NOT breaking a list item where it contains
a new explicit HTML block element that requires an explicit closure (for
example div, blockquote, pre, and HTML or wiki tables). This requires a change
on how wiki-lines starting by "*", "#", ";" and ":" are parsed: they will have
to read for more lines to find the end of the embedded elements.

Also, there's stil lthe lack of support for adding attributes to list items and
to their containing list element.

Why can't we have the syntax like:

{* attributes for the unnumbered list container (ul)

  • simple list item
  • another list

item

  • attributes for the list item| bulleted list item content whose

content includes...

multiple paragraphs...

  • attributes for the list item| bulleted list item content whose content

is splitted on multiple lines (whitespace compressed by the parser)

*}

Note: attributes on list items, separated by "*"-lines are optional: the first
"|" delimits where attributes are terminated before the visible content. We
then gain also more freedom in adding empty lines between list items, because
whitespaces at end of an item are always trimmed.

Idem for numbered lists using "{#" and "#}":

{# attributes for the numbered list container (ol)

  1. attributes for the list item| numbered list item content
  2. attributes for the list item| numbered list item content

*}

Or for definition lists using "{;" and ";}":

{; attributes for the definition list container (dl)
; attributes for the defined term| defined term content
; attributes for the list item| numbered list item content
*}

This would also simplify the design of lots of templates, with less problems
when transcluding them in lists (notably when a text-like template requires
some special formatting to support some advanced typography, or special layouts
emulating some text features using positioned block elements).

Finaly I am waiting for long the support for styling column groups ("col", "colgroup") and row groups ("thead", "tbody", "tfoot") notably because it simplifies a lot the editing of long or large tables, and also offers additional accessibility features (such as *auto* scrollable column groups or row groups when the diaply size is narrow, while preserving the alignment of fixed header/footer cells on the sides, but also allows printing tables with the duplication of fixed header cells on each page.)

Amalthea.wikimedia wrote:

Tim Starling, DieBuche: Part of the issue was resolved, the wikicode from Comment 6 works now. I can still reproduce it with the old editnotice mentioned in the OP though. I've placed a minimal test case at http://test.wikipedia.org/wiki/MediaWiki:Editnotice-2-Amalthea-bug17486 which can be checked in the page source of http://test.wikipedia.org/w/index.php?title=User:Amalthea/bug17486&action=edit

In brief:

* CCCCCC
* DDDDDD

results in

<ul><li> CCCCCC
</li><li> DDDDDD

Written like this it's understandable that it may go wrong, it is less obvious if the table is built by a template with something like <td>{{{text}}}</td> like w:en:Template:fmbox does.

Philippe Verdy: tldr. Generic wiki syntax improvements are certainly off topic here though.

This is really invalid HTML too so changing the subject.

Note that the intent now is to deprecate the support for valid XHTML, but we still have a problem for HTML5 with the (more lenient) HTML5 parsing rules.

We still need to find solution to have a way to ease the integration of templates possibly generating tables (or similar) within our old wiki syntax based on prefixes of single lines (this concerns our syntax for numbered lists, bulleted lists, definition lists/indented blocks, tables, as well as horizontal rulers, base on the first character of lines in "{|!-;:#*", as well as doube newlines for creating new paragraphs).

I still think that we should have an alternate way to avoid the syntaxic limitations introduced by newlines and specific parsing at begining of lines, to allow more flexibility, and less problems for parsing them.

Our current syntax for tables is the most problematic one, forcing us to use ugly syntaxes in templates, and nightmares when we want to integrat them (e.g. in navigation templates and infoboxes).

But bulleted lists and numbered lists still suffer from the lack of support for adding attributes (e.g. in numbered lists we still annot set the initial number, we cannot specify classes or styles)

We should be able to use:

  • item1

*|attributes...| item2

and also allow generation using an explicit list initiator:

{*attributes for the list...

item1
attributes for item2item2item3

item4
}

as if it was a table containing a single row where each item is a cell, except that newlines are trateed here like other whitespaces, so it is equivalent to:

{*attributes for the list...

item1
attributes for item2item2
item3
item4

}

or to:

{*|attribute...||item1|attributes for item2|item2||item3||item4}

This last syntax shows that it allows easy syntaxes in templates. Such syntax will remain integrable into another list, or indented block or in a table cell, for example here with embedded numbered lists:

{#

item1
item2
{#
||item2.1
||item2.2
}
item3
item4

}

In such syntax, all newlines are treated like whitespaces, and whitespaces are trimmed, allowing free form for indenting in wiki sources, and easier syntaxes for templates. The previous example could as well be compacted into a single line, with all "cells" fully trimmed:

And optional attributes are specifiable everywhere if needed (between the doubled pipes in this example).

The alternative is to just allow the HTML5 syntax and improve its parsing in MediaWiki so that it can be really used everywhere, when our old syntax as problems (HTML5 does not force us to close all tags, this is good for wiki editing, even if it is a bit verbose, but it should not be a problem for creating complex templates like infoboxes, designed by a few competent contributors). But MediaWiki still does not treat HTML tags like its old "simplified" syntax for equivalent tags.

Example of limitation, these are not recognized as one list:

<ul style="...">

  • item1
  • item2

</ul>

Second example with similar limitation:

  • item1

<li> item2

  • item3 <li> item4

If those limitations were solved, we would have less problems to generate contents by mixing the best of the HTML syntax, when it solves problems in templates, or the "simpler" wiki inherited syntax. We would alno no longer have to suffer the current nightmare of newlines.

(In reply to comment #28)

Note that the intent now is to deprecate the support for valid XHTML, but we
still have a problem for HTML5 with the (more lenient) HTML5 parsing rules.

XHTML 1.0 is dead, there is no deprecation, we do not support XHTML 1.0 at all anymore.

However we are NOT deprecating well-formed XML output. We still intend for parser and interface output to be well-formed XML when $wgWellFormedXml = true; is set. We also try to support XHTML5 when you set $wgMimeType = 'application/xhtml+xml';. And even when well formed XML is false we still want to output non-malformed HTML.

And THIS bug is about mixed table/list WikiText creating invalid output that isn't tidied up by the parser when Tidy is not enabled.

NOT about some syntax improvements to WikiText you want. Please stop talking about those here and create a real bug for them. You've been warned about this already.

My comment was on topic simply because the malformed output is caused by incorrect specification about how distinct content elements can be safely embedded into each other.

And the whole topic is about this issue: the basic wiki syntax interacts very badly with the HTML (or XML) syntax based on *explicit* closure of tags (or wiki syntaxes). The current parsing rules contradict between each other, and we constantly have to find tricks to avoid these issues and incorrect output (which may parse as valid HTML5 but was in fact not the one intended and will be wrong XHTML5 anyway).

Note that I did not discuss about XHTML 1.0, but HTML5 is still intended to have a valid XMHTL representation, so that XHTML5 should be parsed by *both* an XML parser or an HTML5 parser (generating a compatible DOM structure using either parsers).

All out issues are in fact created when inserting contents from utility templates (this reduces their reusability or forces them to use very ugly tricks, or ugly parameters where they are used, and this does not make them simpler to use in articles).

I maintain that wiki syntaxes should be fully integrated with the HTML syntax under the same content model (offering to users the choice between them, using HTML where the wiki syntax is too limited, but without breaking parsing rules; the wiki syntax should then only be a purely *local* shorthand of the HTML syntax, everything being generated with knowledge of the HTML DOM, even if the syntax generated will also be compatible with XML/XHTML parsers).

(In reply to comment #30)

My comment was on topic simply because the malformed output is caused by
incorrect specification about how distinct content elements can be safely
embedded into each other.

And the whole topic is about this issue: the basic wiki syntax interacts very
badly with the HTML (or XML) syntax based on *explicit* closure of tags (or
wiki syntaxes). The current parsing rules contradict between each other, and
we
constantly have to find tricks to avoid these issues and incorrect output
(which may parse as valid HTML5 but was in fact not the one intended and will
be wrong XHTML5 anyway).

Specification and mixing custom WikiText syntaxes with HTML is irrelevant. We're supposed to fail silently when bad WikiText is used and output valid HTML even when given crap, not output malformed markup.

This WikiText:

  • List item 1. <table class="wikitable">

<tr>
<td> Cell 1. </td>
</tr>
</table>

  • List item 2.

Outputs this:
<ul><li> List item 1. <table class="wikitable">
</li></ul>
<tr>
<td> Cell 1. </td>
</tr>
</table>
<ul><li> List item 2.
</li></ul>

There's a </li></ul> right after the <table class="wikitable"> it leaves <tr> and <td> elements outside of a table, that's invalid.

This but has nothing to do with integrating the WikiText list syntax and HTML table markup. The fix for this issue is simply making sure that the garbage we output for this invalid input is still well-formed markup.

Try inserting that garbage output back into a wiki page:
<ul><li> List item 1. <table class="wikitable">
</li></ul>
<tr>
<td> Cell 1. </td>
</tr>
</table>
<ul><li> List item 2.
</li></ul>

This is essentially the same garbage that the user gives us. But this time the parser outputs:
<ul><li> List item 1. <table class="wikitable">
&lt;/li&gt;&lt;/ul&gt;
<tr>
<td> Cell 1. </td>
</tr>
</table>
<ul><li> List item 2.
</li></ul></li>
</ul>

While there is a minor validity issue in the fact that we have a string of text inside of a <table> but outside of a cell -- fixing that would probably be a separate bug -- that aside the markup is still well formed XML. Tags are properly paired up, same number of each, and they are closed in the correct order. When output into an XHTML5 page parsed with an XML parser this will work and won't give you an XML parse error.