Page MenuHomePhabricator

Parsoid corruption (loss of content, removed newlines) when editing pages with fostered content (often caused by unclosed table inside a list item `:{|`)
Open, LowPublic

Description

Parsoid removes all newlines from the page when editing a table inside a list item :{|.

Example: (in this edit, I typed "a" after the "test" in the a section)
https://en.wikipedia.org/w/index.php?title=User:Matma_Rex/sandbox&diff=930049246&oldid=930049140&diffmode=source

My guess is that the list item puts Parsoid into a single-line context, which tries to render the rest of the page as if it was a single list item, with no newlines.

Event Timeline

All the content in the table in your example is found in a fosterable position.

If your table looked like,

:{|
|

Parsoid probably wouldn't have a problem with it, since it does have protection to avoid single-line context constraints in tables,
https://github.com/wikimedia/parsoid/blob/master/src/Html2Wt/DOMHandlers/TableHandler.php#L29-L31

It's the fostered content that ends up being serialized in the list item that results in the single line.

In general, editing fostered content is error prone.

Being able to corrupt the entire document just because of some unbalanced wikitext elsewhere on the page seems less than ideal...

Maybe we should consider marking fostered content such that it is uneditable?

Maybe we should consider marking fostered content such that it is uneditable?

Maybe? I think that something needs to be done about this. If I introduce a wikitext error in one part of the page, then IMO you should still be able to edit other parts of the page without being blocked by my error, and you should be able to fix my wikitext error.

I'm not convinced this should be invalid -- it's a pretty bad page-corruption issue if bad wikitext goes through Parsoid.

Fostered markup and dealing with the mess it creates has been one of the biggest sources of complexity in the parsoid codebase. We handle a lot of scenarios already but in general, it is hard to catch all issues.

And, in this scenario, the table markup is bogus ... but I suspect this is just one more hack editors are using to get around the limitation of not being able to embed multi-line content in lists. And, they are relying on the fact that the entire content will get fostered, and the empty table will be invisible. So, we'll have to add special hacks to handle this special editor hack.

But one option is to block editing of such tables / list items. But, if we want to limit it to just this scenario vs. blocking editing of all tables that have fostered content, we'll need special detection at which point we might as well handle it. Alternatively, we can just mass disable editing of tables with fostered content. LintErrors already identifies all pages that have this problem. https://en.wikipedia.org/wiki/Special:LintErrors says about 38K instances on enwiki .. which is about 0.1% of pages?

Thoughts? And, perhaps that discussion should be a different ticket.

I'm not convinced this should be invalid -- it's a pretty bad page-corruption issue if bad wikitext goes through Parsoid.

Parsoid does not guarantee it will preserve every broken pattern in use. But, if it is a common enough pattern, we add hacks / workarounds. So, I wouldn't say "bad wikitext" in general.

In any case, my previous comment lays out some options for this particular scenario. How commonly is this hack used on talk pages?

I think what makes this case unique is that the failure is so catastrophic. The entirety of the page (or just after the table?) gets corrupted during the edit.

I think what makes this case unique is that the failure is so catastrophic. The entirety of the page (or just after the table?) gets corrupted during the edit.

Is it just that list item pr everything else on the page beyond the list item? Reading the title and description, I got the impression it is just the table in that list item and hence is limited.

In any case, what are your thoughts on T240280#5737072 and also any info about how common this is?

We'll separately chat about this to see if there is a simple enough fix to detect this scenario.

Is it just that list item pr everything else on the page beyond the list item? Reading the title and description, I got the impression it is just the table in that list item and hence is limited.

The linked edit by Bartosz shows the entire rest of the page getting corrupted.

In this example a lot of the page is corrupted because the previous edit added an unclosed table.

(re-opening as the discussion is active)

Is it just that list item pr everything else on the page beyond the list item? Reading the title and description, I got the impression it is just the table in that list item and hence is limited.

The linked edit by Bartosz shows the entire rest of the page getting corrupted.

In this example a lot of the page is corrupted because the previous edit added an unclosed table.

In both these cases, if you look at the rendering, you will notice that adding that unclosed table wikitext effectively breaks the rendering of the entire page from that point on, i.e. the rest of the page is pulled into the table as fostered content .. and the now empty table is deleted. So, from there, a few questions.

  1. What is the use case that benefits from that brokenness? So, assuming you don't really want to support that use case, one possible solution would be for us to suppress (visual) editability of those tables and associated fostered content which serves as an explicit indicator to editors to fix the broken markup before proceeding. But, I guess the question I asked earlier is if we should make it a more general suppression of editability of table with fosterable content for the same reason - get editors to fix the broken markup. In Parsoid land, we want to remove all these hacks, not add more of them, if we can help it.
  2. But, assuming you want to support this use case from some reason, and you don't want Parsoid to suppress editability either, we can investigate more.

So, yes, I agree Parsoid shouldn't do nasty dirtying of pages with broken markup, but it is a question of what the best strategy here is. Do we want to pile on more code hacks or do we want to disincentivize reliance on such mark up? If this is an uncommon use case, we are willing to say: please fix the markup - we won't support this use case.

From david: “The ability to enter :{| into any page and have it strip all newlines the next time Parsoid interacts with said page” isn’t great!

This is relevant to talk pages, since : is auto-inserted by the post-a-reply tool to get proper indentation. So the user "just" made a little typo when trying to create a table in the reply, and suddenly the whole page gets corrupted.

From david: “The ability to enter :{| into any page and have it strip all newlines the next time Parsoid interacts with said page” isn’t great!

This is relevant to talk pages, since : is auto-inserted by the post-a-reply tool to get proper indentation. So the user "just" made a little typo when trying to create a table in the reply, and suddenly the whole page gets corrupted.

I am not trying to be difficult, but

  • If someone enters a "{|" after a ":" (accidentally or not), it immediately breaks the rendering of the page (Parsoid not involved here). So, it will require a revert / fix of this broken syntax. That is why I was hesitant to prioritze this highly for Parsoid. But, yes, I get the notion that ppl can do source wikitext comment even in its broken form and have the revert happen later .. but that affordance isn't there with Parsoid.
  • "{|" after a ":" is much less likely to be a typo and much more likely to be a deliberate insertion since it requires a specific unusual combination to occur at the start of a line (vs. elsewhere on the line).

Overall, I continue to feel this is not an important blocker given the above observations. But given how strongly a number of you seem to feel about it, we'll take a look at it and see what is reasonable on Parsoid's end.

I am not sure there is an easy fix for this on the Parsoid end. The crux of the matter can be demonstrated with this snippet:

[subbu@earth:~/work/wmf/parsoid] cat /tmp/x.html
<h2>"native wikitext" representation exists for the table-in-list embedding below</h2>
<dl><dd><table><tr><td>x</td></tr></table></dd></dl>

<h2>No "native wikitext" representation exists for the table-in-list embedding below</h2>
<dl><dd>a<table><tr><td>x</td></tr></table></dd></dl>

[subbu@earth:~/work/wmf/parsoid] php bin/parse.php --html2wt < /tmp/x.html
== "native wikitext" representation exists for the table-in-list embedding below ==

: {|
|x
|}

== No "native wikitext" representation exists for the table-in-list embedding below ==

: a {| |x |}

The second form is what is generated when you add the bare {| which then causes the entire rest of the page to become "fostered content" and is hoisted into the list item before the table (verified by the rendering). This is a more generic HTML -> Wikitext problem where Parsoid has to render the table with literal HTML tags. Even if we assume Parsoid could do this, this will generate a mass dirty diff of the page where a lot of native wikitext markup will be converted to literal HTML tags. I am not sure editors will like that either.

So, from our point of view, this is a WONTFIX, unless it is clear why the mass-dirty-diff solution is better.

That said, if we had support for multiline list items, then, Parsoid could use that to reduce the dirty-diffs by using that to embed the table or whatever else needs to be embedded.

But, once again, that said, I cannot see a scenario where editors will prefer this solution to simply reverting the original mistake of breaking the page by inserting the {| needlessly.

The other alternative that we had recommended early in the comment thread of this ticket is for Parsoid to mark the page / the entire blob as uneditable to prevent any corruption (so that Parsoid can preserve the brokenness intact), but even that solution fails the desrability test.

So, at least to me, it appears that perhaps the only desirable solution here is to have editors revert the wikitext breakage.

To be clear, supporting arbitrary HTML is part of Parsoid's long-term HTML -> wikitext roadmap ( we originally antiicpated having that functionality in place well before now, but there was no pressing demand for it for us to prioritize that work ), and so eventually Parsoid will generate valid serialized output for HTML like this by switching to native HTML tags as required. But, that approach won't help here.

In particular, switching to native HTML tags would be a huge dirty diff on a typical talk page, because you need to switch the tags both on the current list item but also all parent lists, and thus all their items and maybe subitems -- or else break the parent list and start a new list, which is a non-faithful serialization and would cause a WP:LISTGAP but which might still reasonably fall under the scrubWikitext functionality.

matmarex renamed this task from Parsoid removes all newlines from the page when editing a table inside a list item `:{|` to Parsoid corruption (loss of content, removed newlines) when editing pages with fostered content (often caused by unclosed table inside a list item `:{|`).Mar 2 2020, 10:48 PM
matmarex removed a parent task: T233443: [Epic] Reply Tool.
LGoto triaged this task as Low priority.Mar 27 2020, 4:18 PM
LGoto moved this task from Backlog to Feature requests on the Parsoid board.