Page MenuHomePhabricator

Parsoid should represent arbitrary comment data in its DOM.
Closed, ResolvedPublic

Description

Wikitext can represent arbitrary data inside a comment by the use of HTML entity escapes, for example:

<!----&lt;--> contains -->.

However, we can't represent that as an HTML comment -- currently (T94055) we sanitize the comment by converting a variety of bad characters to spaces, which then fails to round trip (but selser usually saves us).

We can do better: we should HTML-entity escape the contents of the comment, so that the wikitext <!----&lt;--> serializes as the HTML <!--&#45;&#45;&lt;-->.

Note that, due to HTML5 parsing semantics, we need to entity escape the dash character - as well as the >. We must also escape & since it is the escape character.

Note also that this interpretation of HTML comment data is not strictly given in the HTML5 or DOM specs. The content of comment nodes is a raw DOMString available via the CommentNode#data and is not escaped or interpreted in any way, since it is not intended to be viewed or to have semantics. It might be more true to HTML semantics to use a <span typeof="mw:comment"> node for wikitext comments, in which case the entity escape mechanism would fall out more naturally. But using entity escapes in comment data appears to be a reasonable thing to do, and in tune with the spirit of the HTML and DOM specs. But the user must manually escape and unescape the comment contents, since the DOM does not provide a native means to do so.

Event Timeline

cscott claimed this task.
cscott raised the priority of this task from to Medium.
cscott updated the task description. (Show Details)
cscott added projects: Parsoid, Parsoid-DOM.
cscott added subscribers: cscott, Arlolra.

Oh -- since we use the length of comment nodes for DSR, the DSR code also needs to be tweaked to account for the escaping.

Change 202058 had a related patch set uploaded (by Cscott):
T95039: encode/decode arbitrary data in comments.

https://gerrit.wikimedia.org/r/202058

Change 202058 merged by jenkins-bot:
T95039: encode/decode arbitrary data in comments.

https://gerrit.wikimedia.org/r/202058

ssastry removed a project: Patch-For-Review.
ssastry set Security to None.
ssastry removed a subscriber: gerritbot.