λ (master) echo ":* 123\n;* 456" | php maintenance/run parse <dl><dd><ul><li>123</li> <li>456</li></ul></dd></dl> λ (master) echo ":* 123\n;* 456" | php maintenance/run parse --parsoid <section data-mw-section-id="0" id="mwAQ"><dl id="mwAg"><dd id="mwAw"><ul id="mwBA"><li id="mwBQ">123</li></ul></dd> <dt id="mwBg"><ul id="mwBw"><li id="mwCA">456</li></ul></dt></dl> </section>
Description
Event Timeline
I think the root cause is here:
https://gerrit.wikimedia.org/g/mediawiki/core/+/ab50bb3847ed28ca9b7b29797519745dbe0a0863/includes/parser/BlockLevelPass.php#241
Legacy parser normalizes both : and ; to the same character : where parsoid doesnt
Is this workaround something we want to also keep in parsoid? In theory, parsoid is semantically more correct.
Is this workaround something we want to also keep in parsoid? In theory, parsoid is semantically more correct.
I think we might want to decline this unless we see it's producing a lot of visual differences
Searching insource:/\n";*"[^\n]*\n":*"/ in the article namepsace on enwiki returns only a handful of results, none of which really seem all that intentional
Untagging WIP and unassigning because it looks like this ticket is only for future reference and its not actionable.
Here is a different example which shows up as a real difference in enwiki navboxes, but where I think Parsoid's rendering is correct and also better.
$ cat /tmp/lwt ;a :b ;c :d ::e ;f :g ::h $ php maintenance/parse.php < /tmp/lwt <dl><dt>a</dt> <dd>b</dd> <dt>c</dt> <dd>d <dl><dd>e</dd></dl></dd></dl> <dl><dt>f</dt> <dd>g <dl><dd>h</dd></dl></dd></dl> $ php maintenance/parse.php --parsoid < /tmp/lwt .. edited output to strip extraneous attributes & section wrappers .. <dl><dt>a</dt> <dd>b</dd> <dt>c</dt> <dd>d <dl><dd>e</dd></dl></dd> <dt>f</dt> <dd>g <dl><dd>h</dd></dl></dd></dl>
Compare https://en.wikipedia.org/wiki/Template:Android_tablets?useparsoid=1 and https://en.wikipedia.org/wiki/Template:Android_tablets?useparsoid=0
Here is another example where Parsoid's interpretation of definition list handling is better: ;*a :b renders as you would intuitively expect in Parsoid (the : binds to the * and is not interpreted as a <dd> tag. This leads to better rendering on https://en.wikipedia.org/wiki/User%3ACarlossanchezbeltran%2FChoose_an_Article (compare Parsoid vs legacy).
Some thoughts and I'm not sure if they're actionable or useful even:
- The use of an unordered list in a <dt> I would generally treat as sus, as it were. I can think of little legitimate reason (ignoring the reason that is talk pages, which don't tend to use dt anyway) for one to put an unordered list in a definition term. (The standard provides for multiple definition terms per definition.) Most current uses today should probably just use '''content''' that I can see just eyeballing things.
- The use of the constructs in that navbox are... more or less in the sus category as well. I've seen the pattern, but : item\n:: sublist item is also just bad wikitext at the end of the day, when the definite intent is sublists (and not talk pages). (Preferably these would be : item\n:* sublist item.) This search mostly just seems to illustrate how trivial it is to use the character incorrectly for indentation....
Pretty sure there's nothing to do with that information. We have users with at-minimum uninformed intent here, no matter whether it ends up in HTML better.
NB I don't get why you inserted quotation marks here (though a search without them ends up the same).
Oh, double quotes is another escape mechanism. Compare that to what you had to write insource:/\n;\*[^\n]*\n:\*/ to escape the * metacharacter.
https://www.mediawiki.org/wiki/Help:CirrusSearch#Regular_expression_searches