Page MenuHomePhabricator

Parsoid drops closing tags when normalizing misnested tags, leaving wikitext that looks like unbalanced HTML
Closed, ResolvedPublic

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
kostajh added subscribers: Catrope, kostajh.

@Trizek-WMF do you know if this is a regression?

If so, @Catrope maybe it was caused by T209114 ?

@Trizek-WMF do you know if this is a regression?

First time I see this.

Catrope moved this task from Needs Discussion to External on the Growth-Team board.

This is a normalization bug in Parsoid. The output technically isn't incorrect, but it's unhelpful.

Note that Nemo Le Poisson's signature is not fully valid HTML, because it has misnested tags: &nbsp;'''[[User:Nemo Le Poisson|<span style="color:orange">Nemo</span>]] <sup><small style="border-bottom:1px solid #000;">[[Discussion utilisateur:Nemo Le Poisson|<span style="font-variant:small-caps; color:blue">Discuter</span>]]'''</small></sup> 28 avril 2019 à 18:10 (CEST).

Specifically, the <sup> and <small> tags are opened inside the bolding triple single quotes (''') but closed outside them. An intuitive way for Parsoid to normalize this would be for it to change '''</small></sup> to </small></sup>''', so that the tags are closed in the right order. Instead, it just removes </small></sup> which results in something that sort of works by accident, in that closing the bolded text also implicitly closes the unclosed tags inside it. This works, and parsing it produces the correct result, but it's not very nice, because there are no closing tags at all for the small and sup tags (and human readers get confused by the absence of closing tags).

A second normalization bug that I noticed is that [[Fichier:Vogelherdhöhle,in ihr fand man die ältesten Kunstwerke der Menschheit - panoramio.jpg|180px]] got normalized to [[Fichier:Vogelherdhöhle,in ihr fand man die ältesten Kunstwerke der Menschheit - panoramio.jpg|180px|lien=Fichier:Vogelherdh%C3%B6hle,in_ihr_fand_man_die_%C3%A4ltesten_Kunstwerke_der_Menschheit_-_panoramio.jpg]] (lien= is French for link=). This probably happens because Parsoid fails to recognize that the image name and the link target are the same except for URL encoding.

There's also a bunch of whitespace that gets dropped, which isn't supposed to happen. I'm not sure how this happened, because I thought Parsoid was at least able to preserve whitespace in very simple cases, like in headings (== Foo ==). In this diff, much of the heading whitespace is removed, but somehow not all of it (and some headings only have their trailing whitespace removed, not their leading whitespace).

Catrope renamed this task from Flow talk page manager strips HTML when converting a page to Parsoid drops closing tags when normalizing misnested tags, leaving wikitext that looks like unbalanced HTML.Sep 26 2019, 11:23 PM
LGoto triaged this task as Low priority.May 1 2020, 4:20 PM
LGoto moved this task from Backlog to Needs Investigation on the Parsoid board.

When you say "normalization" do you mean the "scrubWikitext" option?

Arlolra claimed this task.
Arlolra subscribed.

When you say "normalization" do you mean the "scrubWikitext" option?

No, --wt2wt on the command line,

&nbsp;'''[[User:Nemo Le Poisson|<span style="color:orange">Nemo</span>]] <sup><small style="border-bottom:1px solid #000;">[[Discussion utilisateur:Nemo Le Poisson|<span style="font-variant:small-caps; color:blue">Discuter</span>]]'''</small></sup> 28 avril 2019 à 18:10 (CEST).

turns into,

&nbsp;'''[[User:Nemo Le Poisson|<span style="color:orange">Nemo</span>]] <sup><small style="border-bottom:1px solid #000;">[[Discussion utilisateur:Nemo Le Poisson|<span style="font-variant:small-caps; color:blue">Discuter</span>]]''' 28 avril 2019 à 18:10 (CEST).

as described. I'm going to say the reasoning in T42438#465151 still applies and dropping these in edit contexts is acceptable. All the dirtying here looks like there was a selser failure that's hopefully resolved elsewhere.

A second normalization bug that I noticed is that ...

This one is fixed now,

> echo "[[Fichier:Vogelherdhöhle,in ihr fand man die ältesten Kunstwerke der Menschheit - panoramio.jpg|180px]]" | php bin/parse.php --domain fr.wikipedia.org --wt2wt
[[Fichier:Vogelherdhöhle,in ihr fand man die ältesten Kunstwerke der Menschheit - panoramio.jpg|180px]]