Page MenuHomePhabricator

HTML entities in CommentNode must be manually escaped/unescaped
Closed, ResolvedPublic1 Estimated Story Points

Description

In order to allow "bad" sequences like "-->" in comment data, the contents of a CommentNode should have HTML entities unescaped for display, and then escaped before serialization.

See T95039 for more details and context.

Concretely, a wikitext comment like:

<!-- hello & --&lt; bad contents, dude! -->

will be represented by Parsoid in its DOM as:

<!-- hello &amp; &#45;&#45;&lt; bad contents, dude! -->

But the DOM CommentNode#data property gives the raw DOMString. The HTML entities need to be manually unescaped before displaying the comment contents to the user, and then the characters -, >, and & must be entity-escaped before giving any edited comment contents to document.createComment.

(Note that - is usually considered an HTML-safe character, and may not be escaped by a standard minimal entity encoder.)

Event Timeline

cscott assigned this task to Esanders.
cscott raised the priority of this task from to Medium.
cscott updated the task description. (Show Details)
cscott added a project: VisualEditor.
cscott added subscribers: Arlolra, cscott, Aklapper, Catrope.
cscott set Security to None.

Once T95039 is merged, - will start appearing as &#x2D; in comments until VisualEditor is updated. > and & as well, but they are perhaps less common.

cscott raised the priority of this task from Medium to High.Apr 6 2015, 8:17 PM
cscott added subscribers: Nikerabbit, santhosh.

Raised priority because of user-visible entity escaping which would be visible once T95039 lands.

@Catrope, @Esanders, do we need to hold off deploying T95039 until this task is complete in VE?

@Nikerabbit, @santhosh: This change affects you as well, if you care about what comments look like.

We should be covered by the VE fix. I don't think we handle or display comments anywhere else in Flow.

Raised priority because of user-visible entity escaping which would be visible once T95039 lands.

@Catrope, @Esanders, do we need to hold off deploying T95039 until this task is complete in VE?

Yes.

Change 202311 had a related patch set uploaded (by Cscott):
Encode and decode HTML entities in comment nodes

https://gerrit.wikimedia.org/r/202311

@Nikerabbit, @santhosh: This change affects you as well, if you care about what comments look like.

We do not care about comments as of now.

Change 202311 merged by jenkins-bot:
Encode and decode HTML entities in comment nodes

https://gerrit.wikimedia.org/r/202311

Jdforrester-WMF renamed this task from HTML entities in CommentNode must be manually escaped/unescaped. to HTML entities in CommentNode must be manually escaped/unescaped.Apr 8 2015, 7:39 AM
Jdforrester-WMF closed this task as Resolved.
Jdforrester-WMF edited a custom field.
Jdforrester-WMF moved this task from Nominated to Done on the VisualEditor 2014/15 Q4 blockers board.

@cscott you mention that <!-- foo -- bar --> is invalid XML (which is annoying, but I believe you), but is <!-- foo - bar --> invalid? If single hyphens are allowed then we should allow them too as they are going to be far more common than the double hyphen and as the comment is going to be read in source mode as well we should avoid escaping where possible.

Change 202711 had a related patch set uploaded (by Esanders):
Don't encoding single hyphens in comments, just doubles

https://gerrit.wikimedia.org/r/202711

From what I can see single hyphens are valid so I have provided the above fix.

Change 202711 merged by jenkins-bot:
Encode as few characters in comments as possible

https://gerrit.wikimedia.org/r/202711

T95039 is now deployed in Parsoid.

If a few encoded characters in comments sufficiently annoy your users, you might cherry-pick https://gerrit.wikimedia.org/r/202311 for earlier deploy.

https://gerrit.wikimedia.org/r/202711 isn't worth cherry-picking, it is only relevant for standalone VE.