Page MenuHomePhabricator

Multiline tags in lists should be output more intelligently
Open, LowPublic

Description

Parsertest

Related with bug 5497,

*Some enumeration<div style="clear: both; color: red">
This text should be red
</div>

produces
<ul><li>Some enumeration<div style="clear: both; color: red">
</li></ul>
<p>This text should be red
</p>
</div>

"fixed" by tidy the wrong way:
<ul><li>Some enumeration<div style="clear: both; color: red"></div>
</li></ul>
<p>This text should be red
</p>

which affect several templates.


Version: unspecified
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=1581

Attached:

Details

Reference
bz9996

Related Objects

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:40 PM
bzimport set Reference to bz9996.
bzimport added a subscriber: Unknown Object (MLST).

michaeldaly wrote:

Your first example is not legal HTML. The fix by Tidy is correct.

You cannot overlap tag starts and ends:

<tagA>
<tagB>
</tagB>
</tagA>

is legal.

<tagA>
<tagB>
</tagA>
</tagB>

is not legal

ayg wrote:

The illegal behavior was not being requested. See the parser test. It's questionable what behavior is acceptable for this period, given that attributes may include borders or floats or who knows what. The best solution is to not provide illegal markup in the first place.

Note that parser tests are run with Tidy off, and so parser tests for this are probably pointless. Furthermore, this is an upstream issue, so it should probably stay closed anyway.

MW output is not legal HTML. The tidy fix is wrong. Legal xhtml, but not the expected output.
Cases that tidy is able to fix are not-so-bad, but illegal tags that tidy screw are imho more important to fix in the parser (as tidy will stay).

Of course, what should be done is not providing illegal markup, and is the reason of the parser test.
My view is that when during a list, a block level element as <div is found in the same line, the list should be closed before outputtung the <div> i.e. what would do preg_replace("/(\*|#)(.*)(<div)/i", "\\1\\2\n\\3", $WikiText) on the beginning of parsing.

Any reasons to have lists of one-line <div>s?

brion added a comment.May 22 2007, 2:04 PM

The general issue is that wiki lists are line-based markup, so a <div> that spans multiple lines is not considered legal, and the results are undefined.

Combine that with the ugly multi-pass parser, and it doesn't always come out pretty. :)

ayg wrote:

Probably an ideal solution would be to have multiline tags not terminate the list item until the tag is terminated, as expected. I.e., it should produce <li><div>...</div></li>. This is probably not something anyone wants to implement with the current parser, however.

Phrased that way, this seems to be a duplicate of bug 1581, which is a special case (and the most important one). Probably makes most sense to dupe it to that.

Punting this to the new parser Brion has under development.

  • Bug 28691 has been marked as a duplicate of this bug. ***

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

sumanah wrote:

+need-review

Reedy added a comment.Nov 20 2011, 5:34 PM

Killing both patch and need-review

It's a diff adding a parser test, which is to show the failure, although we could commit it now, but then we'd have failing parser tests showing up

So only wants committing when this bug is supposed to be fixed

  • Bug 33918 has been marked as a duplicate of this bug. ***

(In reply to Aryeh Gregor from comment #5)

Probably an ideal solution would be to have multiline tags not terminate the
list item until the tag is terminated, as expected.

This would also help bug 58429, which hits bug 1115 as well.

Danny_B removed a subscriber: wikibugs-l-list.
Izno added a subscriber: Izno.

Remex outputs the same as Tidy did.

cscott added a comment.Jan 8 2020, 7:18 PM

Implicit ending of list items by newlines is probably too wide-spread in existing wikitext to change now.
But T230683: New syntax for multiline list items / talk page comments and the end of T114432: [RFC] Heredoc arguments for templates (aka "hygienic" or "long" arguments) discuss ways to escape the newline where desired.

The shame is that {{!CRLF}} template containing &#xa; does not work anymore :(

T134469 is related but as @cscott noted, it is not going to be a simple thing to fix and rollout since it will probably require a whole lot of linting to clean up pages to prevent breakage.

The clear intent in this example is to insert a floatting element inside a list item, not outside of it.

Here there's a limitation with the wiki syntax for list items (using asterisks at the begining of a wiki source line) and the fact that it its not explicitly terminated by a closing tag. But the floatting element is generated by a template.

Mediawiki is unable to "extend" the scope of the list item in its own syntax (marked with asterisks), or other similar items (starting by colons or semicolons for definition lists, or by hashes for numbered lists) when the item contains other block elements written on the same wiki source line. It just looks for the first newline that follows and then breaks the embeded HTML element.

The only way to solve that is to not use the Mediawiki syntax for list items whose content is not purely inline but embeds block elements (such as "div" or "p", or any explicitly tagged list using "ul", "dl" or "ol" elements, but as well "h1"..."h6" even if they should not be used inside lists).

This is not a problem of "Tidy", but a problem of the Mediawiki source parser which has limitations on its specific syntax for basic list items (i.e. a severe limitation of the MediaWiki syntax).

Unfortunately this limitation of the legacy Mediawiki syntax affects various pages, including most talk pages (not using the controvertial LiquidThread extension that has its own bugs for correctly displaying threads with many replies and for tracking changes) using ":" for the indentation of replies inside pseudo-definition lists (not really definition lists because they usually do not start by a leading line starting by ";" for the defined term converted to a "dt" element in a "dl" list, but only "dd" elements in a broken definition list, which should rather be converted to "blockquote" elements when there's no prior ";").

May be we should deprecate the Mediawiki syntax for such cases, and make sure the parser works correctly when using source pages using explicit ul/ol/dl elements containing li/dt/dl elements.
And MediAwiki should evolve to add direct support for new HTML5 content elements (which are better for the semantics than just the few legacy HTML4 block/inline elements), and as well add support for table elements (notably captions, colgroups and rowgroups).

If we want to keep the legacy MediaWiki syntax for lists, may be

  • add some "magic keyword" {{__LIST__|....}} that can be used before the 1st list item (on a separate line) or at start of the content of the 1st item to indicate additional attributes for the list that follows (e.g. the list type or starting number or number style for numeric lists, or additional style and class), or __LONGITEM__ inside the line of a list item (after one or more leading ":" or ";" or "*" or "#") to instruct the MediaWiki parser to NOT stop parsing the list item at the next newline but extend it to the content of any embedded block element starting on the same line (but not necessarily terminated there);
  • or just a generic {{__ATTR__|...}} that could be used to give custom attributes for other the legacy MediaWiki syntaxes that are not properly delimited that still cannot be attributed (including "galery" items for example; only the gallery container has syntaxic support for attributes).
  • Both would allow creating safe templates that can still be inserted in pages containing the legacy syntaxes.

Adding support for HTML5 content elements (like "section" or "summary") would be useful as well and treated like blocks, not limited to just "p", "div" and "blockquote".

Another possibility would be to define a magic keyword like __TIDY__ that could be used as a "meta-element" inside template to indicate that what is generated by the template will be tidied only in the scope of this template, which would then generate some replaceable hidden element.

The main page including the template would then be tidied with the replaceable pseudo-element still not replaced, but then a final pass would replace the pseudo-element which was already HTML-tidied, almost "as is" as plain HTML without additional Mediawiki reparsing.

The clear intent in this example is to insert a floatting element inside a list item, not outside of it.

The last section of T114432 proposes:

:: <<< any
== wiki text ==
markup you like
* including embedded lists
<div style="clear: both; color: red">including html-ish tags</div>
>>>
  • or just a generic {{__ATTR__|...}} that could be used to give custom attributes for other the legacy MediaWiki syntaxes that are not properly delimited that still cannot be attributed (including "galery" items for example; only the gallery container has syntaxic support for attributes).

In T230658#5786980 the following is proposed:

:: {{#attr|id=foo|class=bar}}

which would apply to the list item. (In general I'd like to standardize on {{#....}} for new magic words/parser functions, instead of the crazy quilt of different syntaxes used historically; see T204370: Behavior switch/magic word uniformity.)

Another possibility would be to define a magic keyword like __TIDY__ that could be used as a "meta-element" inside template to indicate that what is generated by the template will be tidied only in the scope of this template, which would then generate some replaceable hidden element.

See T114445: [RFC] Balanced templates; which uses {{#balance}} for this purpose (again, T204370: Behavior switch/magic word uniformity).

tl;dr: you're brilliant! These are all fantastic ideas!

:: <<< any
== wiki text ==
markup you like
* including embedded lists
<div style="clear: both; color: red">including html-ish tags</div>
* list continued here
>>>

Not really, because this breaks the bulleted list in two parts. Does not work with multilevel lists:

:: <<< any
== wiki text ==
markup you like
* 1st item
** including embedded sublists
<div style="clear: both; color: red">including html-ish tags</div>
** sublist continued here
* 2nd item
>>>

You may need to nest the heredoc syntax to get the effect you're looking for, if I understand correctly.

Verdy_p added a comment.EditedJan 10 2020, 7:09 PM

Also note the special case for numbered lists; it is frequently needed to split a numbered list in several parts, or showing only an extract, or numbering them differently (e.g. with letters, or with roman digits, or starting at a different value than 1). This cannot be specified on list items, only as attributes of the list container (and MediaWiki provides no syntax for the container of the list itself, only a syntax for individual list items, it infers the presence of containers (ul, ol, dl) from the sequential generation of individual items (li, dt, dd, encoded with the wiki syntax), and then it groups them "magically".

As well list containers should be stylable (class and style attributes, as well as id for anchors, without using hacks with empty span elements currently used by anchoring templates).

Also I suppose that the "heredoc" is the syntax you show using <<< and >>>. Fine, but where is the doc ? Was it announced or is it a proposal in a specific plugin for Mediawiki tested in some wikis but not deployed in Wikimedia wikis ? And it still does not attributes (style/class/id) the list containers themselves.

# item 1
# <<< item 2
whatever you put in here doesn't break the list

# even another numbered list >>>
# item 3

Heredoc syntax is in T114432: [RFC] Heredoc arguments for templates (aka "hygienic" or "long" arguments). Not yet implemented, but RFC is approved and @Arlolra's written a first draft patch. Extension to lists is mentioned at the tail end of that RFC and is being discussed in T230683: New syntax for multiline list items / talk page comments.

Styling is currently discussed in T230658: Syntax for list item attributes with a placeholder proposal in T230658#5786980 which would look like this:

# {{#attr:value=3}} This list starts at item 3.
# This would be numbered 4

See https://developer.mozilla.org/en-US/docs/Web/HTML/Element/li#attr-value -- note that we can't use the start attribute on the <ol> because the parent <ol> item doesn't have any representation in wikitext. Luckily we don't seem to need it.

Changing the type is done via CSS; probably Extension:TemplateStyles would be the best mechanism for that, since the type attribute on the li only affects that one item. You'd use the CSS list-style-type property, although again it's a little unfortunate we can't set the class on the parent <ol>/<ul>/<dl>, just on individual list items. This could be made to work though:

<div style="list-type-type: upper-roman">
# roman
# numerals!
</div>

We'd need to set body { list-style-type: decimal } ol { list-style-type: inherit } in our default site html (the default value is ol { list-style-type: decimal }, which otherwise overrides the style set in the containing <div>) but this could be done.

An alternative would be to make this work:

<ol type="I">
# roman numerals
</ol>

But currently that doesn't do what you expect: it makes roman numerals an element in a doubly-nested list; that is the # starts a brand new list instead of being adopted into the open empty <ol>.

Verdy_p added a comment.EditedJan 10 2020, 9:41 PM
<div style="list-type-type: upper-roman">
# roman
# numerals!
</div>

only works for a single level of lists. Sublists (##) that are numbered in a different style (1st level: upper roman, 2nd level: numeric, 3rd level: alphabetic... A common style for texts of laws and lot of books for Wikisource and similar documents) would still need to be attributed. Other sublists may also be horizontal rather than vertical, or surrounded by boxes.

So yes we need a way to specify attributes for the list containers themselves, even if we don't have a wikisyntax for delimiting them for now as they are "magically" infered but using basic assumptions.

Using a div container only works for the 1st level, but it's an HTML hack (a styled "div" containing a "ol" containing "li", rather than just a styled "ol" containing "li").

I would prefer this:

# {{#attr:style="list-type-type: upper-roman"}}
# roman
# numerals!

I.e. using an empty list item at start of the list with styles/attributes to apply to it (the empty item will be dropped, but the attributes kept for the container). And it would woirk as well for sublists, without breaking the main list.

# {{#attr: value=10}}
# item 10
# item 11
## {{#attr: style="list-type-type: upper-roman"}}
## item I (part of item 11)
## item II (part of item 11)
# item 12

Note that above, two lines do NOT generate any list item, as they contain only attributes and no other content. It's not needed to specify any custom separator like "|" between attributes and contents of list item... MediaWiki just has to drop empty list items and can keep the attributes for the list container ONLY when there's no content in the 1st line, and keep the atributes for the list item only for all other cases.

This means: place the styles and other attributes for the list container in a separate leading line with NO content (except ignorable blanks and HTML comments and the "#attr:" magic).

As well, with the "heredoc" syntax, no longer need to count the leading "#" for sublists (or for indenting replies in talk pages, which can also be styled the same way with a leading empty item):

# {{#attr: value=10}}
# item 10
# item 11 <<<
# {{#attr: style="list-type-type: upper-roman"}}
# item I (part of item 11)
# item II (part of item 11)
>>>
# item 12

And it would be very smart for templates if they can isolate their generated lists in "heredoc" elements, so that templates would be transcludable at various levels (they could also generate multiple paragraphs/blocks inside list items:

# {{#attr: value=10}}
# item 10
# item 11 <<<

There are two items in the sublist below:

# {{#attr: style="list-type-type: upper-roman"}}
# item I (part of item 11)
# item II (part of item 11)

>>>
# item 12

Extension:TemplateStyles is really a much better solution here. It will let you set up styles for ol, ol ol, ol ol ol, etc in whatever style alternation you prefer. The type attribute directly on list items is deprecated in HTML5, and inline style attributes are also being discouraged in wikitext in favor of proper encapsulation in a (scoped) stylesheet via TemplateStyles.

So while it's interesting to think about how {{#attr}} interacts with the wikitext-invisible list containers, this isn't a strong use case for changing how (say) empty-list-item elimination works (done by Remex these days) because there are better ways to handle this particular use case.

Styling ol, ol ol, ol ol ol, etc. does not work for the general case. It's much preferable to be able to give attributes (notably id and class) for list items and list containers (style can then be made on classes, id's can be used as anchors).

Managing the content and the separation from style is much easier (and do not require templatestyles at all, which anyway is limited to just CSS styles, and no other attributes, including semantic attributes or accessibility attributes).

Tell me more about semantic attributes or accessibility attributes.

At the moment I'd rather associate properties with list containers by making something like this work:

<ul class="foo" id="bar">
* item
</ul>

and making list item syntax smart enough to adopt the items into the open container when present instead of starting a new one. That seems cleaner (to me) than adding a new corner case where attributes of an empty list element get adopted into the parent container -- magic behavior like that seems risky: what if the contents of the items are automatically generated (from a template argument, say) and just *happen* to occasionally be empty? Then we'd get unexpected application of list item properties to the parent container, but only in the buggy case where the list item happened to be empty. There are a lot of these unexpected surprises already in wikitext, and I would be very reluctant to add another one.

Verdy_p added a comment.EditedJan 10 2020, 11:44 PM

Yes, using explicit HTML tags for list containers can be a solution too.... if it works (for now no, because, like you say, the wikisyntax opens a new embedded container)

The folllowing case still does not work as expected:

* item 1
* item 2 <ul>
* item 2.1
* item 2.2
</ul>
* item 3

jsut like also:

* item 1
* item 2 <div>
This is a paragraph in item 2.

This is another paragraph in item 2.
</div>
* item 3

The problem here is to recognize where the parser must terminate the bulleted item: it only searches for the next newline in the wikisource and then expects it to be the end of the item.
However a generic "<<<" anywhere in that line would mean that it would suspend the parsing, and use a separate subparser to read it up to the ">>>", and then treat all the embedded content as if it was inline (when the second parser will return, it will return a replaceable element). The first parser would then continue after the ">>>" to continue searching the newline.

This is more difficult to do if the parser has to consider many possible opening tags (p, div, ul, ol, dl, galery, blockquote...) but not inline tags (b, i, u, wikilinks, images, q, br... this list beign also extensible to the case of template transclusions (starting with "{{" but whose parameters may be also split on separate source lines).

Take the example of galeries: their items are also not easy to edit when we have to include templates: all must fit in the same line; the same applies to references: it's not easy to break the text of the reference, even if that content will be actually moved elsewhere in the page, but only a simple link with a note marker will inserted inline in the list item.

The same applies to table cell contents, also limited by newlines in the wikisource.

Using the "<<<" is a good way to avoid many complications in the wiki parser. It isolates the content that may be on separate source lines, including HTML comments... It's a clean and generic way to provide an "escape". But of course if requires supporting multiple parser instances on the same page.

But Mediawiki's parser uses an old simpler strategy using a single non-recursive parser for some passes. There's no concept of a "stack of parsers" (or stack of saved/restored parser states), recursion is used only for templates transclusion, but it is broken in the 1st pass where the legacy MediaWiki syntax breaks the content on newlines in the wiki source, without knowing if that breaking is relevant for the correct level of analysis.

And anyway we still lack proper support of HTML5 semantic elements (section, heading, footer, article...) and even very basic HTML4 elements (like caption, etc.)

Even the "a" elements could be accepted (what is limited is the content of the "href" attribute and other attributes forbidden as well in supported elements (notably event handlers): MediaWiki should be able to parse them directly wihout necessiraly requiring the "wikification".

It would speed up the generation of pages with other editing tools (not necessarily made specifically for Mediawiki), and some basic HTML could be automatically converted to the wiki syntax when saving. Users could then contribute directly using standard text processors: they could copy-paste the HTML. The editor would rescan the pasted HTML and reduce it automatically in a wiki form, or would warn the user for unsupported features that still need to be fixed manually before saving. This is notably a complication for editing tables, which are notoriously difficult in MediaWiki when there are lot of other styles and complex structures like colspans and rowspans. Users could be able to work on tables within Excel or LibreOffice and copy-paste the result even if it gets automatically reconverted to the wiki syntax, when possible.

For accessibility attributes and semantic attributes, see the HTML5 and WAIS references from W3C. I won't comment more about them. Accessibility attributes include tab orders, keyboard shortcuts, title attributes,

This is not just a question of style attributes for CSS. There are also other useful attributes, including data attributes (e.g. for sorting data in tables or lists and providing a correctly sortable semantic that won't break on the presentation of numbers, such that "1 234" will unexpectedly sort between "1" and "2" in French instead of after "999", or "1,234" will unexpectedly sort after "999" instead of between "1" and "2" in French). Plus other content classifiers (like role, lang, dir, scope, etc.) and behavior attributes (like target).

LGoto moved this task from Needs Triage to Backlog on the Parsoid board.Feb 15 2020, 12:06 AM
ssastry moved this task from Backlog to Feature requests on the Parsoid board.Mar 10 2020, 12:27 AM