Page MenuHomePhabricator

[error/html2wt] Input DOM is not well-formed. Top-level <li> found that is not nested in <ol>/<ul>\n LI-node
Open, Needs TriagePublicPRODUCTION ERROR

Description

There are quite a few of these (5K in the last 30 days). Unhelpfully, they show up as %{message} in logstash because it handles long messages poorly.

Find reqId in Logstash

Sample message (I've cut off lots more HTML):

{"@timestamp":"2025-03-24T13:35:38.978048+00:00","@version":1,"host":"mw-api-ext.eqiad.main-685cd4f6fd-hrphh","message":"[error/html2wt] Input DOM is not well-formed. Top-level <li> found that is not nested in <ol>/<ul>\n LI-node: <li id=\"mwAZ4\" data-object-id=\"461\"> 4 Consequences of the Rules\neach subdivided into Dionysian (Julian) and Gregorian.<meta typeof=\"mw:Placeholder/StrippedTag\" id=\"mwAZ8\" data-object-id=\"462\"/>\n\n<dl id=\"mwAaA\" data-object-id=\"463\"><dd id=\"mwAaE\" data-object-id=\"464\"><span data-mw-selser-wrapper=\"\" data-object-id=\"4124\">In each case, Julian should precede Gregorian; and the introduction should say so.</span></dd></dl>\n\n<dl id=\"mwAaI\" data-object-id=\"465\"><dd id=\"mwAaM\" data-object-id=\"466\"><span data-mw-selser-wrapper=\"\" data-object-id=\"4125\">If copyright permits, there should be legible images of the necessary material as in the Book of Common Prayer</span><span typeof=\"mw:DisplaySpace\" id=\"mwAaQ\" data-object-id=\"467\"> </span><span data-mw-selser-wrapper=\"\" data-object-id=\"4126\">: for Gregorian, that amounts to one paragraph plus three pages. SNIP

Obviously needs better logging at the minumum. Not sure whether in Parsoid or MediaWiki's Logstash plugin, probably both. LogstashFormatter should enforce a max length; Parsoid probably shouldn't put the whole HTML in the message.

Event Timeline

Change #1130628 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] html2wt: Get rid of non-actionable logging + use sensible fallbacks

https://gerrit.wikimedia.org/r/1130628