Page MenuHomePhabricator

Refactor sanitizer to work on the DOM too
Open, MediumPublic

Description

Currently we always serialize to wikitext and re-parse that to HTML, which runs the sanitizer on the token stream to ensure that our final HTML does not cause bad things to happen.

Soon both us and the Flow team want to store HTML from the VisualEditor directly without first serializing to wikitext. This means that we need to perform the sanitization on the HTML instead of the token stream. For performance, sanitizing on the way in would be preferable. We should however support re-sanitization when new issues were discovered. This could potentially be coupled with the versioning discussed in bug 52937. A new sanitizer could bump the version number, and the upgrade path would then run the new sanitizer on old HTML (and probably update the storage with the newly sanitized version).


Version: unspecified
Severity: normal

Details

Reference
bz52941

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:51 AM
bzimport added a project: Parsoid-DOM.
bzimport set Reference to bz52941.
marcoil added a project: Parsoid.
marcoil set Security to None.

Note that tokens are equivalent to SAX events. This suggests the following option for hooking up the current sanitizer to process a DOM:

  • Traverse the input DOM, emit SAX events
  • feed SAX events through the sanitizer
  • feed the resulting tokens / SAX events back to a HTML / XML DOM tree builder