Page MenuHomePhabricator

Fix mw:DisplaySpace to match PHP "armorFrenchSpaces"
Closed, ResolvedPublic

Description

We handle "space before colon" as a special case in urltext in Parsoid's pegTokenizer.pegjs, but that's actually fundamentally incorrect: the mw:DisplaySpace is actually a result of "french space armoring"; see Id8cdb887182f346acab2d108836ce201626848af and T5158: Parser inserts invalid   in the middle of style attribute (French spaces)/T13874: Enforced   breaks inline CSS with !important.

We should update Parsoid to match Sanitizer::armorFrenchSpaces().

Event Timeline

ssastry moved this task from Needs Triage to Read Views on the Parsoid board.
Vvjjkkii renamed this task from Fix mw:DisplaySpace to match PHP "armorFrenchSpaces" to oiaaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from oiaaaaaaaa to Fix mw:DisplaySpace to match PHP "armorFrenchSpaces".Jul 2 2018, 7:29 AM
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

@ssastry suggested in comments on https://gerrit.wikimedia.org/r/441410 that Parsoid should probably do this as a post-processing step on the DOM, instead of trying to do this in the tokenizer. That sounds reasonable to me.

From T106561, it appears that our strategy so far (for the space before colons) ends up leaving empty span tags in some VE edits, probably related to some copy/paste operation that isn't preserving the "this is a french space" metadata.

We could potentially run this post-processing step "in reverse" on the DOM before html2wt, removing explicit &nbsp; where they would be inserted automatically by the french-space algorithm. Then we wouldn't have to add explicit <span> tags at all, except perhaps in unusual corner cases (TBD what those might be).

To restate, I'm proposing that we take the DisplaySpace hack *out* of the tokenizer, and instead run it as a DOMPostProcessor pass, with a corresponding preprocessor in the html2wt side to reverse that transformation.

Depending on details of the implementation, we might not even need to surround the &nbsp; with a special <span> at all to have it be reversible.

This is actually missing functionality in Parsoid; the core feature is "correct" and Parsoid just implemented a broken misunderstanding of how it worked. Core was improved and made more selective about the french-spacing transformation in T197902.

Change 591224 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Move armoring french spaces to a post-processing pass

https://gerrit.wikimedia.org/r/591224

Change 593016 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/core@master] Move french space armoring after doBlockLevels

https://gerrit.wikimedia.org/r/593016

Change 593017 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/core@master] Move french space armoring below language conversion

https://gerrit.wikimedia.org/r/593017

Change 593016 merged by jenkins-bot:
[mediawiki/core@master] Move french space armoring after doBlockLevels

https://gerrit.wikimedia.org/r/593016

Change 593017 merged by jenkins-bot:
[mediawiki/core@master] Move french space armoring below language conversion

https://gerrit.wikimedia.org/r/593017

Change 591224 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Move armoring french spaces to a post-processing pass

https://gerrit.wikimedia.org/r/591224

Change 603571 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump Parsoid to v0.12.0-a16

https://gerrit.wikimedia.org/r/603571

Change 603571 merged by jenkins-bot:
[mediawiki/vendor@master] Bump Parsoid to v0.12.0-a16

https://gerrit.wikimedia.org/r/603571