Page MenuHomePhabricator

Page corruption in reply tool caused by unclosed ref tag
Closed, ResolvedPublic

Description

Unclosed ref tags cause havoc. Recent examples:

If Parsoid doesn't support this, then we should prevent replying in our tool (like we do for pages with fostered content). There might not be an easy way to detect it right now, though.

Event Timeline

Looking at the first diff, the original wikitext seems to be something like:

* Comment one. <ref> the preceding ref is unclosed
** other stuff
== a heading ==
* More comments with self-closed <ref/>
** Finally a comment with a full <ref>here comes the close tag!</ref> stuff after

Everything between the first heading after the <ref> and final </ref> is deleted in the edit, and the original <ref> is no-wiki'ed:

* Comment one. <nowiki><ref></nowiki> the preceding ref is unclosed
** other stuff
</ref>stuff after

So next step is to figure out what the HTML looked like before/after this use of the reply tool, to figure out whether Parsoid or DiscussionTools or both are at fault. Suspicious features: (a) the heading is being treated as terminating the <ref> somehow? It may be that section editing is implicated somehow, (b) the original open <ref> tag is left in the wikitext, but <nowiki>ed, (c) The closing <ref> tag is left in the wikitext, even though it is now unpaired with an open ref.

I couldn't reproduce this bug exactly with a simpler test case, but here's a minimal example that behaves suspiciously similar: https://test.wikipedia.org/wiki/T301213a

a<ref>b<ref><ref>c</ref>

(imagine that this is a mistake in wikitext; someone forgot a /, the second <ref> should be </ref>)

Legacy parser displays an error and no references: https://test.wikipedia.org/wiki/T301213a?action=render

image.png (175×715 px, 7 KB)

Parsoid displays a reference and no errors: https://test.wikipedia.org/api/rest_v1/page/html/T301213a

image.png (263×679 px, 3 KB)

And trying to edit the reference in visual mode produces <nowiki> tags that look suspiciously like the diff:

image.png (2×3 px, 296 KB)

If that is the source of the error, can we simply fix the pages and be done with it? We obviously won't recover cleanly form all possible errors and unless this is a common scenario, my inclination is to say that it is not worth the time that might be spent debugging and fixing it in Parsoid.

I think the issue is that the unclosed ref tag resulted in a huge amount of content being deleted from the page: https://fr.wikipedia.org/?diff=190533941

I don't particularly care if Parsoid renders this differently, but whatever Parsoid chooses to do should at least round-trip in some fashion which preserves rather than deletes most of the text.

The original question was whether there was a way that DiscussionTools could identify this case, eg via a lint like they do for unclosed tables, to warn the user and prevent them from using DiscussionTools on this page.

I figured out why this happens, it's not a Parsoid issue. And it's not even related to unclosed <ref> tags.

Minimal test case involves a reference with two comments in it followed by other content, an automatically generated reference list, and a section with no replies at the end of the page:

https://test.wikipedia.org/wiki/Talk:T301213a

1a<ref>b [[User:Matma Rex|Matma Rex]] 00:01, 16 February 2022 (UTC)
2:c [[User:Matma Rex|Matma Rex]] 00:02, 16 February 2022 (UTC)
3d
4</ref>
5== e ==
6f [[User:Matma Rex|Matma Rex]] 00:03, 16 February 2022 (UTC)
image.png (422×824 px, 43 KB)

Note how the comments are rendered in different order than they are in the source. This tricks the reply tool into thinking that comments "b" and "c" are replies to "f". When we try to add a reply to "f", it is therefore placed below the existing replies, in the middle of the reference list. That wouldn't cause this issue by itself (Parsoid would ignore the extra content in the wrong place, and the edit would not go through); however, there is also content in the reference following the replies, so the reference list item gets split in two to make that possible, and all content after the split is list – in this case, "d" disappears (and the new comment isn't even posted): https://test.wikipedia.org/w/index.php?title=Talk:T301213a&diff=500716&oldid=500715&diffmode=source

We should probably just ignore all comments in references list… Even when they don't cause this bug, nothing good can come from that.

Change 764427 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Don't detect comments within references

https://gerrit.wikimedia.org/r/764427

Change 764427 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Don't detect comments within references

https://gerrit.wikimedia.org/r/764427

This looks good to me.
Using the minimal test case. Added in the order s, t, u, v, w, x for more context.

Screenshot 2022-03-07 at 19.49.32.png (1×2 px, 182 KB)
Screenshot 2022-03-07 at 19.51.42.png (1×2 px, 286 KB)

Comments t and u are not treated as replies to x. See:
Screenshot 2022-03-07 at 19.58.00.png (680×2 px, 151 KB)