Page MenuHomePhabricator

Fix mw:DisplaySpace to match PHP "armorFrenchSpaces"
Open, NormalPublic


We handle "space before colon" as a special case in urltext in Parsoid's pegTokenizer.pegjs, but that's actually fundamentally incorrect: the mw:DisplaySpace is actually a result of "french space armoring"; see Id8cdb887182f346acab2d108836ce201626848af and T5158: Parser inserts invalid   in the middle of style attribute (French spaces)/T13874: Enforced   breaks inline CSS with !important.

We should update Parsoid to match Sanitizer::armorFrenchSpaces().

Event Timeline

cscott created this task.Jun 21 2018, 3:34 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 21 2018, 3:34 PM
ssastry triaged this task as Normal priority.Jun 26 2018, 3:38 PM
ssastry moved this task from Backlog to Read Views on the Parsoid board.
Vvjjkkii renamed this task from Fix mw:DisplaySpace to match PHP "armorFrenchSpaces" to oiaaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from oiaaaaaaaa to Fix mw:DisplaySpace to match PHP "armorFrenchSpaces".Jul 2 2018, 7:29 AM
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

Presumably T106561 is related

@ssastry suggested in comments on that Parsoid should probably do this as a post-processing step on the DOM, instead of trying to do this in the tokenizer. That sounds reasonable to me.

From T106561, it appears that our strategy so far (for the space before colons) ends up leaving empty span tags in some VE edits, probably related to some copy/paste operation that isn't preserving the "this is a french space" metadata.

We could potentially run this post-processing step "in reverse" on the DOM before html2wt, removing explicit &nbsp; where they would be inserted automatically by the french-space algorithm. Then we wouldn't have to add explicit <span> tags at all, except perhaps in unusual corner cases (TBD what those might be).

Reedy edited projects, added Parsoid-Read-Views; removed Parsoid.Sep 17 2018, 7:25 PM
cscott added a comment.EditedOct 4 2019, 4:53 PM

To restate, I'm proposing that we take the DisplaySpace hack *out* of the tokenizer, and instead run it as a DOMPostProcessor pass, with a corresponding preprocessor in the html2wt side to reverse that transformation.

Depending on details of the implementation, we might not even need to surround the &nbsp; with a special <span> at all to have it be reversible.