Page MenuHomePhabricator

Close tags are stripped
Closed, DeclinedPublic

Description

$ echo '</b>' | php maintenance/parse.php
<p>&lt;/b&gt;
</p>
$ echo '</b>' | tests/parse.js
<body data-parsoid="{}"><p data-parsoid='{"dsr":[0,4,0,0]}'></p>
</body>

The lonely close tag is stripped by Parsoid, but it is sanitized (treated as literal non-markup text) by the PHP parser.


Version: unspecified
Severity: normal

Details

Reference
bz52760

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:08 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz52760.

[subbu@earth tests] echo "</b>" | node parse --editMode false
<body data-parsoid="{}"><p data-parsoid='{"dsr":[0,4,0,0]}'><meta typeof="mw:Placeholder/StrippedTag" data-parsoid='{"src":"</b>","name":"B","dsr":[0,4,null,null]}'></p>
</body>

We just need to find a way of converting these non-editmode stripped tags into plain text in certain situations.

(02:15:50 PM) subbu: that is because the tree builder removes it.
(02:16:06 PM) subbu: and we recognize the stripping with a dom analysis and add that meta-tag.
(02:16:23 PM) subbu: but we could instead add a text-version of the stripped tag in some cases like this.

Change 78842 had a related patch set uploaded by Cscott:
Improve parser test for bug 52760 (close tags are being stripped).

https://gerrit.wikimedia.org/r/78842

(In reply to comment #1)

[subbu@earth tests] echo "</b>" | node parse --editMode false
<body data-parsoid="{}"><p data-parsoid='{"dsr":[0,4,0,0]}'><meta
typeof="mw:Placeholder/StrippedTag"
data-parsoid='{"src":"</b>","name":"B","dsr":[0,4,null,null]}'></p>
</body>

We just need to find a way of converting these non-editmode stripped tags
into
plain text in certain situations.

In my testing that is not what the PHP parser & tidy are doing, so this would be a change of content semantics.

Cleaning up stray close tags when nearby content is edited is a good thing in my opinion. Selective serialization ensures that end tags in unmodified parts of the page are preserved to avoid dirty diffs. Simply re-inserting stray end tags based on StrippedTag info is not safe in the presence of editing, and making it safe would add a lot of complexity for little gain.

For these reasons I am closing this as WONTFIX. Please reopen this bug if there are cases where the PHP parser renders stray end tags as text, but we don't.

Reopening. See the bug description for an example, as well as https://gerrit.wikimedia.org/r/78842

Interesting ... so, tidy bites us again? http://en.wikipedia.org/wiki/User:Ssastry/bug_52760 says that gwicke is right.

Huh, weird. The PHP parser is definitely emitting the escaped text. How is tidy getting to it to remove it? Hmm.

According to gwicke, "there is a different PHP cleanup pass in the parser that might do the &lt; escaping. that pass is enabled when tidy is not enabled."

Parsoid attempts to be consistent with the tidy-enabled behavior of the PHP parser. See bug 52899 for a better way to document/enforce these behaviors in parserTests.

Change 78842 merged by jenkins-bot:
Improve parser test for bug 52760 (close tags are being stripped).

https://gerrit.wikimedia.org/r/78842