Page MenuHomePhabricator

Investigate DOM with <li> nodes not found in <ol>/<ul> elements
Closed, ResolvedPublicBUG REPORT

Description

Error
normalized_message
[error/html2wt] Input DOM is not well-formed. Top-level <li> found that is not nested in <ol>/<ul>
 LI-node: <li id="mwAY0" data-object-id="576">2009.10.30 <span class="mw-image-border noviewer" typeof="mw:Transclusion mw:Image" about="#mwt36" id="mwAY4"
exception.trace
Impact

html2wt doesn't abort at this point. It just ignores the <li> and proceeds with the rest of the DOM .. so at worst, dirty diffs of the page.

Notes

Details

Request URL
https://hu.wikipedia.orghttp//hu.wikipedia.org/w/rest.php/hu.wikipedia.org/v3/transform/pagebundle/to/wikitext/MOL_Liga_2009%E2%80%932010

Event Timeline

ssastry changed the subtype of this task from "Production Error" to "Bug Report".Aug 3 2021, 3:55 AM
Arlolra triaged this task as Medium priority.Aug 3 2021, 7:50 PM
Arlolra moved this task from Needs Triage to Bugs & Crashers on the Parsoid board.

The wikitext source of the https://hu.wikipedia.org/wiki/MOL_Liga_2009%E2%80%932010 page contains a <li></li> (and a <li> without closing tag) that are not nested in a <ol>/<ul>.

Said content was added as new content on https://hu.wikipedia.org/w/index.php?title=MOL_Liga_2009%E2%80%932010&type=revision&diff=9586766&oldid=9281944

parse.php --wt2wt and parse.php --selser do not show dirty diffs in that area, so it seems it roundtrips properly.

If @Arlolra confirms what I'm saying here, I'll fix the wikitext on the wiki so that it doesn't pop in the logs anymore.

If @Arlolra confirms what I'm saying here, I'll fix the wikitext on the wiki so that it doesn't pop in the logs anymore.

Sounds good.

A couple things to note while I'm here. To reproduce the log, you need a case like,

<li> hi

* ho

since it comes from trying to get the bullet list for the nested list, not just from the presence of a top level list item.

Also, the unclosed list item sucks up a lot of the page, making the DOMCompat::getOuterHTML quite verbose,
https://github.com/wikimedia/parsoid/blob/master/src/Html2Wt/DOMHandlers/DOMHandler.php#L224

Hmmmmm, I'm not sure I understand your point about the reproduction 🤔

My understanding of that piece of code, outside of its context, is that if these <li> do not have a parent at all, it would also trigger, and I think that this is what happens here (for the record, the initial wikitext is as follows:

=== Statisztikák ===
Legtöbb gól egy mérkőzésen:
<li>2009.11.24 {{zászló|HUN}} [[Újpesti TE jégkorong szakosztály|Újpesti TE]] - {{zászló|ROM}} [[SCM Fenestela 68 Braşov|SCM Brassó]] 11:7 </li>
Legnagyobb arányú győzelem:<br>
<li>2009.10.30 {{zászló|ROM}} [[SCM Fenestela 68 Braşov|SCM Brassó]] - {{zászló|HUN}} [[Dunaújvárosi Acélbikák]] 1:10
== A bajnokság végeredménye ==

It's worth noting that "bare" <li> are not fixed up by the HTML TreeBuilder algorithm:

> d = (new DOMParser()).parseFromString('<li>foo', 'text/html')
> d.body.outerHTML
"<body><li>foo</li></body>"

Although these were almost certainly fixed up by the "old" HTML Tidy. So it's not an "error" per-se to see them in the wt2html output or in the html2wt input. We could either: (a) tweak our tree builder to always tidy up this case so we never emit it in our output, and/or (b) get rid of the logs on html2wt input since it is expected and not an error.

Hmmmmm, I'm not sure I understand your point about the reproduction 🤔

Try,

> echo "<li>hi" | php bin/parse.php --wt2wt
<li>hi

vs

> echo -e "<li>hi\n*ho" | php bin/parse.php --wt2wt
[error/html2wt] Input DOM is not well-formed. Top-level <li> found that is not nested in <ol>/<ul>
 LI-node: <li data-object-id="1">hi
<ul data-object-id="2"><li data-object-id="3">ho</li></ul></li>
<li>hi
*ho

Huh. Indeed. (Well, that happens on master, but not on my annotation branch, which is INCREDIBLY suspicious >_< ).
Thanks! :)

In the meantime I edited the Hungarian wiki to clean up the wikitext.

@cscott you wanna do either a/ or b/ within this phab?

(Well, that happens on master, but not on my annotation branch, which is INCREDIBLY suspicious >_< ).

Probably changed very recently in,
https://github.com/wikimedia/parsoid/commit/41b25569a4b673925fbc2e44c41fe2f5c4f5b407

Oh cool, thanks a lot for pointing that out!

We could either: (a) tweak our tree builder to always tidy up this case so we never emit it in our output, and/or (b) get rid of the logs on html2wt input since it is expected and not an error.

Keep in mind what I'm saying in T287931#7259759, the error isn't about finding a stray top-level list item. It's from encountering one when trying to reconstruct the bullet list which presumably has different expectations about what is well-formed.

The behaviour has been investigated; there are indeed assumptions of Parsoid that are violated by this page, but the consensus is that it's rare/corner-case enough to fix the warning by editing the wikitext that triggers it.