Page MenuHomePhabricator

Protect Parsoid-generated token-names, about-ids, typeofs from conflicts with user-provided names.
Open, MediumPublic


User can enter html tags or provide tag attributes or attribute values that Parsoid (and downstream clients) treat specially.

Ex: #mwt3 for about-ids, or mw:Object/* for typeofs or <template> tags or any of the special attributes/types found on

If Parsoid doesn't detect these and escape/handle them somehow, Parsoid will mislead clients like VisualEditor and/or incorrectly serialize them.

Version: unspecified
Severity: normal



Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:27 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz48772.

There are three interconnected issues:

  1. <span prefix="mw: http://evil.bad">

is valid wikitext, which would create malformed Parsoid DOM. We should sanitize the wikitext (but that has to happen *before* we create the DOM, since otherwise we can't tell which prefix attributes are good and which are evil.)

  1. VE needs to prevent users from authoring content which sets prefix attributes, etc. Currently it does so, but it would be nice to make Parsoid more robust against malformed DOM, and/or to add layers of protection so that front ends aren't solely responsible for sanitizing user input.
  1. Longer term we should probably think about use cases where the user wants to deliberately author RDFa markup on their content, and ensure that they are able to do so in a safe way.

This bug is primarily about #1 (the short term issue) and I'll tackle it tomorrow.

  1. Is not an issue, as prefix is not allowed in the PHP sanitizer (and not in our sanitizer either).
  1. Is something for later. The risk here is mainly crashes during serialization.
  1. Should be supported. We only want to protect our own values where necessary. The typeof attribute for example is multi-valued, so we only need to strip mw:-prefixed user-supplied values. The about attribute on the other hand is single-valued, so we need to override user-supplied values unconditionally where necessary.

[Parsoid component reorg by merging JS/General and General. See bug 50685 for more information. Filter bugmail on this comment. parsoidreorg20130704]

Change 74586 had a related patch set uploaded by Arlolra:
Protect Parsoid-generated attributes.

[Discussed this on IRC with subbu]

To support case 3 from comment 1 and 2, rather than *strip* the attributes, I suggest encapsulating them instead. For example, if the user authors:

<div about="" typeof="foaf:Person">

<span property="foaf:name">Albert Einstein</span>
<span property="foaf:givenName">Albert</span>


..instead of stripping the RDFa, we're just going to rename the attributes, to:

<div data-user-about="" data-user-typeof="foaf:Person">

<span data-user-property="foaf:name">Albert Einstein</span>
<span data-user-property="foaf:givenName">Albert</span>


That ensures that the user content is safe for Parsoid and VE to manipulate. In the future we can re-enable the RDFa in the output with a DOM post-processing ("render") pass. For example, the user content:

<div about="" typeof="foaf:Person">
{{Albert Einstein}}

might have all sorts of Parsoid RDFa markup inside the {{Albert Einstein}} expansion, which the about="" attribute would screw up the context for. But for rendering to the web, the Parsoid cruft could be elided and the user's RDFa markup unwrapped.

If we do go that route, the fixup for rendering would have to be in client-side JS since we may not use different cached content for editing vs rendering -- or so it seems to me at this time without any sense of how user RDFa markup will need to be supported in MW content.

Or we could write a DOM postprocessor stage that was a bit smarter about which RDFa it unwrapped. ie, it would unwrap the first example from comment 5, but not the final example there, since it would be able to see that there's parsoid RDFa inside the div which would be broken. Or other options.

Change 74586 merged by jenkins-bot:
Protect Parsoid-generated attributes.

Patch landed as suggested in comment 5. However, Subbu mentioned that, "we still need to deal with complexity that templates bring."

Ex: <div {{echo|about}}="foo">bar</div>

One more thing that would be good to fix:

Currently, the tokenizer unconditionally escapes typeof and about attributes to data-x-typeof and data-x-about. But, if RDFa attrs from wikitext will also get escaped in this manner. Not sure if RDFa is actually used in wikitext, but that support does exist in MW. So, the escaping could be refined to leave alone any typeof and about attrs that wont confuse Parsoid and clients.

Related FIXME comments from the sanitizer (ext.core.Sanitizer.js) from that I am pasting here:

// SSS FIXME: There is a test in mediawiki.environment.js that doles out
// and tests about ids. There are probably some tests in mediawiki.Util.js
// as well. We should move all these kind of tests somewhere else.
// Bypass RDFa/whitelisting checks for Parsoid-inserted attrs
// Safe to do since the tokenizer renames about/typeof attrs.
// unconditionally. FIXME: The escaping solution in tokenizer
// maybe aggressive. There is no need to escape typeof strings
// that or about ids that don't resemble Parsoid types/about ids.

Any fix made here should also verify that the sanitization code in sanitizeTagAttrs properly strips dangerous values from unescaped typeofs even when Parsoid updates them with parsoid typeofs.

Ex: <tag typeof="dangerous-text">..</tag> gets updated by Parsoid parser to <tag typeof="mw:ParsoidTypeof dangerous-text">..</tag> which should then be handled by the sanitizer to yield <tag typeof="mw:ParsoidTypeof">...</tag>
Arlolra set Security to None.