Page MenuHomePhabricator

[Research spike] Why does posting a new comment delete `</span>` from a previous one?
Open, Needs TriagePublic

Description

This task is about answering three questions related to the scenario described below...

Questions

  • Why does parsoid strip the </span> tag from the wikitext snippet [1]?
  • Assuming we deem it important to fix this issue, what would be involved with doing so?
  • Meta: are issues like this one resolvable on a case-by-case basis or is it a fundamental limitation of Parsoid?

Scenario

  1. Post the following wikitext snippet to this talk page, using the new replying workflow or by using "Edit source". See this example.
  2. Notice: "red test." is published to the talk page and word "red" is bolded and the color red.
  3. Reply to the edit you made in "1." using the reply tool (it doesn't seem to matter what content you include in this reply).
  4. ⚠️ Notice: the red styling spills down to the end of the section

Additional context

For additional context, this issue was initially reported on-wiki, here: Topic:Vcwvt3bq03o5gv8h


  1. Wikitext snippet: ::{{#if:x|<span style="color:red;|<span style="}} font-weight: bold">red</span> test.

Event Timeline

Fwiw, it seems wrapping the snippet [1] in <nowiki>...</nowiki> prevents the </span> from being stripped, as you'd expect. Although, it seems odd that those tags also re-introduce == New section == which had previously been deleted. See: diff.


  1. ::{{#if:x|<span style="color:red;|<span style="}} font-weight: bold">red</span> test.

The "interleaved" {{#if:…}} and <span …> apparently causes Parsoid not to recognize the <span …> tag, so the closing </span> tag no longer has a matching opening tag, which is impossible to represent in HTML – so when that part of the page is edited (e.g. by adding another comment nearby), it gets lost.

You can see this when you compare the normal PHP Parser rendering (https://en.wikipedia.org/w/index.php?title=User:Matma_Rex/sandbox&oldid=935163905) to the Parsoid rendering (https://en.wikipedia.org/api/rest_v1/page/html/User%3AMatma_Rex%2Fsandbox/935163905 or https://en.wikipedia.org/w/index.php?title=User:Matma_Rex/sandbox&oldid=935163905&veaction=edit)

I think Parsoid has workarounds in place for some other things that can't be represented in HTML, but clearly not this one. I don't know if that's some fundamental limitation or just something they didn't find important enough to spend time on.

I really cannot imagine a reason why anyone would want to post that (i.e., unescaped) on wiki in the first place. It's the kind of thing on enwiki that would get reverted as a "test edit" without anyone thinking twice.

I think Parsoid has workarounds in place for some other things that can't be represented in HTML, but clearly not this one. I don't know if that's some fundamental limitation or just something they didn't find important enough to spend time on.

Generally speaking this is behavior we seek to lint out from articles and remove support from in the core/legacy parser, rather than add workarounds to Parsoid. When we make exceptions it's usually only because reliance on the bad behavior is so widespread that deprecation will take a long time, not that the behavior won't eventually be deprecated and removed.

This is an instance of a general pattern of what I'll call "token gluing". The legacy parser operates in multiple phases, and so markup which is not a logical token in the first passes may get magically "glued together" and appear to work in the later passes of parsing. But this behavior is extremely fragile, and tends to introduce further edge cases and inconsistencies. For example, the legacy preprocessor is among the first passes, and because it doesn't recognize the glued tokens brace/bracket/tag matching will occur in a different way in later passes of the legacy parser. T54661 contained a large number of these oddities before I fixed them in the preprocessor; T172306 shows another similar case I haven't fixed yet. We're slowly getting rid of them. Parsoid uses a single-pass tokenizer, so generally doesn't expose token gluing.

The example can/should be trivially rewritten to avoid the use of token gluing. Instead of:

::{{#if:x|<span style="color:red;|<span style="}} font-weight: bold">red</span> test.

write

::<span style="{{#if:x|color:red;|}}font-weight: bold">red</span> test.

or

::{{#tag:span|red|style={{#if:x|color:red;|}}font-weight: bold}} test.

or even

::<span style="font-weight:bold"><span {{#if:x|style="color:red"}}>red</span></span> test.

@matmarex, @Whatamidoing-WMF, @cscott , @Esanders, and @ssastry: thank you for adding context in this ticket, and in other channels, about when you might expect to see wikitext like this [1] posted to talk pages and for outlining how Parsoid works in a way that explains the sequence outlined on-wiki here: Topic:Vcwvt3bq03o5gv8h.

Considering there is Parsoid-compatible syntax [2][3][4][5] that achieves the same output as the problematic string and we have yet to find examples where this string is being used in practice, I'm going to do two things:

  1. Near-term: remove this ticket as blocking the release of v1.0 of the replying tool (see: T235923)
  2. Longer-term: assuming we do not discover instances where the problematic string is being used on talk pages [6], decline this ticket.

  1. ⚠️ Problematic string: ::{{#if:x|<span style="color:red;|<span style="}} font-weight: bold">red</span> test.
  2. ::<span style="{{#if:x|color:red;|}}font-weight: bold">red</span> test.
  3. ::{{#tag:span|red|style={{#if:x|color:red;|}}font-weight: bold}} test.
  4. ::<span style="font-weight:bold"><span {{#if:x|style="color:red"}}>red</span></span> test.
  5. ::{{#if:x|<span style="color:red; font-weight: bold">|<span style="font-weight: bold">}}red</span> test.
  6. On-wiki discussion: Topic:Vcwvt3bq03o5gv8h
ppelberg edited projects, added VisualEditor; removed VisualEditor (Current work).
ppelberg moved this task from To Triage to Blocked on the VisualEditor board.