Page MenuHomePhabricator

Fix mw:DisplaySpace to match PHP "armorFrenchSpaces"
Closed, ResolvedPublic

Description

We handle "space before colon" as a special case in urltext in Parsoid's pegTokenizer.pegjs, but that's actually fundamentally incorrect: the mw:DisplaySpace is actually a result of "french space armoring"; see Id8cdb887182f346acab2d108836ce201626848af and T5158: Parser inserts invalid   in the middle of style attribute (French spaces)/T13874: Enforced   breaks inline CSS with !important.

We should update Parsoid to match Sanitizer::armorFrenchSpaces().

Event Timeline

cscott created this task.Jun 21 2018, 3:34 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 21 2018, 3:34 PM
ssastry triaged this task as Medium priority.Jun 26 2018, 3:38 PM
ssastry moved this task from Needs Triage to Read Views on the Parsoid board.
Vvjjkkii renamed this task from Fix mw:DisplaySpace to match PHP "armorFrenchSpaces" to oiaaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from oiaaaaaaaa to Fix mw:DisplaySpace to match PHP "armorFrenchSpaces".Jul 2 2018, 7:29 AM
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

Presumably T106561 is related

@ssastry suggested in comments on https://gerrit.wikimedia.org/r/441410 that Parsoid should probably do this as a post-processing step on the DOM, instead of trying to do this in the tokenizer. That sounds reasonable to me.

From T106561, it appears that our strategy so far (for the space before colons) ends up leaving empty span tags in some VE edits, probably related to some copy/paste operation that isn't preserving the "this is a french space" metadata.

We could potentially run this post-processing step "in reverse" on the DOM before html2wt, removing explicit &nbsp; where they would be inserted automatically by the french-space algorithm. Then we wouldn't have to add explicit <span> tags at all, except perhaps in unusual corner cases (TBD what those might be).

Reedy edited projects, added Parsoid-Read-Views; removed Parsoid.Sep 17 2018, 7:25 PM
cscott added a comment.EditedOct 4 2019, 4:53 PM

To restate, I'm proposing that we take the DisplaySpace hack *out* of the tokenizer, and instead run it as a DOMPostProcessor pass, with a corresponding preprocessor in the html2wt side to reverse that transformation.

Depending on details of the implementation, we might not even need to surround the &nbsp; with a special <span> at all to have it be reversible.

Ltrlg added a subscriber: Ltrlg.Nov 22 2019, 4:59 PM
Aklapper edited projects, added Parsoid; removed Parsoid-Read-Views.Feb 29 2020, 5:14 PM
ssastry moved this task from Read Views to Known Differences on the Parsoid board.Feb 29 2020, 5:32 PM

This is actually missing functionality in Parsoid; the core feature is "correct" and Parsoid just implemented a broken misunderstanding of how it worked. Core was improved and made more selective about the french-spacing transformation in T197902.

Arlolra claimed this task.Apr 20 2020, 3:02 PM

Change 591224 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Move armoring french spaces to a post-processing pass

https://gerrit.wikimedia.org/r/591224

Change 593016 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/core@master] Move french space armoring after doBlockLevels

https://gerrit.wikimedia.org/r/593016

Change 593017 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/core@master] Move french space armoring below language conversion

https://gerrit.wikimedia.org/r/593017

Change 593016 merged by jenkins-bot:
[mediawiki/core@master] Move french space armoring after doBlockLevels

https://gerrit.wikimedia.org/r/593016

Change 593017 merged by jenkins-bot:
[mediawiki/core@master] Move french space armoring below language conversion

https://gerrit.wikimedia.org/r/593017

Change 591224 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Move armoring french spaces to a post-processing pass

https://gerrit.wikimedia.org/r/591224

Arlolra closed this task as Resolved.Jun 5 2020, 12:29 AM

Change 603571 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump Parsoid to v0.12.0-a16

https://gerrit.wikimedia.org/r/603571

Change 603571 merged by jenkins-bot:
[mediawiki/vendor@master] Bump Parsoid to v0.12.0-a16

https://gerrit.wikimedia.org/r/603571