Maniphest T197902

Be more selective in applying French Space armoring
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cscott
	Jun 21 2018, 8:37 PM

Description

French spacing is described by (eg) https://www.iwillteachyoualanguage.com/learn/french/french-tips/french-punctuation and https://fr.wikipedia.org/wiki/Ponctuation#En_fran%C3%A7ais

Mediawiki currently tries to ensure that the space added before a punctuation mark is a non-breaking space, but the regular expression it uses is broad and introduces errors (cf T5158, T13874).

This task is to try to both improve the rules (adding additional punctuation marks used eg in Swiss French) as well as to make it more selective so that it does not apply in situations where it is clear "French spacing" is not the intent.

Details

	Subject	Repo	Branch	Lines +/-
	Improve efficiency of french-spacing regexp	mediawiki/core	master	+1 -1
	Don't armor french spaces before punctuation followed by word characters	mediawiki/core	master	+65 -29

Customize query in gerrit

Related Objects

Mentioned In: T299478: VisualEditor should display automatically generated non-breaking spaces (support French spacing)
T14752: Space before/after »guillemets« (»/«) converted to non-breaking space ( ) (French spaces)
T222266: Edge case difference processing templated styles in table cells
T90902: Non-breaking space in header ID breaks anchor
T181441: Percent symbol not preceded by non-breaking space.
T197879: Fix mw:DisplaySpace to match PHP "armorFrenchSpaces"
Mentioned Here: T5158: Parser inserts invalid   in the middle of style attribute (French spaces)
T13874: Enforced   breaks inline CSS with !important

Event Timeline

cscott created this task.Jun 21 2018, 8:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 21 2018, 8:37 PM

matmarex added a project: MediaWiki-Parser.Jun 22 2018, 12:16 AM

Change 441410 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Don't armor french spaces before punctuation followed by word characters

https://gerrit.wikimedia.org/r/441410

gerritbot added a project: Patch-For-Review.Jun 22 2018, 8:03 PM

cscott mentioned this in T197879: Fix mw:DisplaySpace to match PHP "armorFrenchSpaces".Jun 22 2018, 10:03 PM

Framawiki subscribed.Jun 23 2018, 5:02 PM

• Vvjjkkii renamed this task from Be more selective in applying French Space armoring to 1haaaaaaaa.Jul 1 2018, 1:03 AM

• Vvjjkkii triaged this task as High priority.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed subscribers: gerritbot, Aklapper.

CommunityTechBot renamed this task from 1haaaaaaaa to Be more selective in applying French Space armoring.Jul 2 2018, 11:11 AM

CommunityTechBot raised the priority of this task from High to Needs Triage.

CommunityTechBot updated the task description. (Show Details)

CommunityTechBot removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

CommunityTechBot added subscribers: gerritbot, Aklapper.

Change 441410 merged by jenkins-bot:
[mediawiki/core@master] Don't armor french spaces before punctuation followed by word characters

https://gerrit.wikimedia.org/r/441410

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)).Jul 13 2018, 6:00 PM

cscott closed this task as Resolved.Aug 3 2018, 9:34 PM

cscott claimed this task.

cscott mentioned this in T181441: Percent symbol not preceded by non-breaking space..Aug 21 2018, 10:15 PM

cscott mentioned this in T90902: Non-breaking space in header ID breaks anchor.Apr 25 2019, 7:25 PM

cscott mentioned this in T222266: Edge case difference processing templated styles in table cells.May 1 2019, 3:19 PM

The code could be optimized as so:

Replace:

'/(\S) (?=[?:;!%»›](?!\w))/u' => "\\1$space"

With:

'/(?<=\S) ([?:;!%»›])(?!\w)/u' => "$space\\1"

Before: The engine matches every non-space character from the beginning, then misses a following space most of the time, then backtracks.

After: The engine matches only spaces from the beginning, which are less frequent than all the non-space characters.

Maintenance_bot removed a project: Patch-For-Review.Oct 25 2019, 1:10 AM

Change 546178 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Improve efficiency of french-spacing regexp

https://gerrit.wikimedia.org/r/546178

gerritbot added a project: Patch-For-Review.Oct 25 2019, 1:20 PM

@cscott I see you have improved it even further :)

@Od1n could you do some quick benchmarks to satisfy the reviewer on the patch above?

$nb = 10000;

$text = str_repeat( 'lorem ipsum dolor sit amet', 1000 );
$space = '&#160;';

$t1 = microtime( true );
for ($i = $nb; $i--; ) {
    preg_replace( '/(\S) (?=[?:;!%»›](?!\w))/u', "\\1$space", $text );
}
$t2 = microtime( true );
for ($i = $nb; $i--; ) {
    preg_replace( '/(?<=\S) (?=[?:;!%»›](?!\w))/u', "$space", $text );
}
$t3 = microtime( true );

echo $t2 - $t1;
echo "\n";
echo $t3 - $t2;

With $text = str_repeat( 'lorem ipsum dolor sit amet', 1000 );
Before: 1.83 seconds
After: 0.64 seconds

With $text = str_repeat( 'lorem : ipsum : dolor : sit : amet', 1000 );
Before: 5.30 seconds
After: 2.95 seconds

With $text = str_repeat( 'lorem: ipsum: dolor: sit: amet', 1000 );
Before: 2.20 seconds
After: 0.69 seconds

@thiemowmde Your benchmarks are above :)

The lookbehind assertion is perfectly fine here: it is executed only when spaces are matched, and it looks only a character back. Actually, it's this lookbehind assertion that makes the code optimized :)