Page MenuHomePhabricator

Be more selective in applying French Space armoring
Closed, ResolvedPublic

Description

French spacing is described by (eg) https://www.iwillteachyoualanguage.com/learn/french/french-tips/french-punctuation and https://fr.wikipedia.org/wiki/Ponctuation#En_fran%C3%A7ais

Mediawiki currently tries to ensure that the space added before a punctuation mark is a non-breaking space, but the regular expression it uses is broad and introduces errors (cf T5158, T13874).

This task is to try to both improve the rules (adding additional punctuation marks used eg in Swiss French) as well as to make it more selective so that it does not apply in situations where it is clear "French spacing" is not the intent.

Event Timeline

Change 441410 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Don't armor french spaces before punctuation followed by word characters

https://gerrit.wikimedia.org/r/441410

Vvjjkkii renamed this task from Be more selective in applying French Space armoring to 1haaaaaaaa.Jul 1 2018, 1:03 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from 1haaaaaaaa to Be more selective in applying French Space armoring.Jul 2 2018, 11:11 AM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: gerritbot, Aklapper.

Change 441410 merged by jenkins-bot:
[mediawiki/core@master] Don't armor french spaces before punctuation followed by word characters

https://gerrit.wikimedia.org/r/441410

cscott claimed this task.

The code could be optimized as so:

Replace:

'/(\S) (?=[?:;!%»›](?!\w))/u' => "\\1$space"

With:

'/(?<=\S) ([?:;!%»›])(?!\w)/u' => "$space\\1"

Before: The engine matches every non-space character from the beginning, then misses a following space most of the time, then backtracks.

After: The engine matches only spaces from the beginning, which are less frequent than all the non-space characters.

Change 546178 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Improve efficiency of french-spacing regexp

https://gerrit.wikimedia.org/r/546178

@cscott I see you have improved it even further :)

@Od1n could you do some quick benchmarks to satisfy the reviewer on the patch above?

$nb = 10000;

$text = str_repeat( 'lorem ipsum dolor sit amet', 1000 );
$space = '&#160;';

$t1 = microtime( true );
for ($i = $nb; $i--; ) {
    preg_replace( '/(\S) (?=[?:;!%»›](?!\w))/u', "\\1$space", $text );
}
$t2 = microtime( true );
for ($i = $nb; $i--; ) {
    preg_replace( '/(?<=\S) (?=[?:;!%»›](?!\w))/u', "$space", $text );
}
$t3 = microtime( true );

echo $t2 - $t1;
echo "\n";
echo $t3 - $t2;

With $text = str_repeat( 'lorem ipsum dolor sit amet', 1000 );
Before: 1.83 seconds
After: 0.64 seconds

With $text = str_repeat( 'lorem : ipsum : dolor : sit : amet', 1000 );
Before: 5.30 seconds
After: 2.95 seconds

With $text = str_repeat( 'lorem: ipsum: dolor: sit: amet', 1000 );
Before: 2.20 seconds
After: 0.69 seconds

@thiemowmde Your benchmarks are above :)

The lookbehind assertion is perfectly fine here: it is executed only when spaces are matched, and it looks only a character back. Actually, it's this lookbehind assertion that makes the code optimized :)

executed only when spaces are matched, and it looks only a character back.

Excellent analysis. Thanks a lot for benchmarking this!

Change 546178 merged by jenkins-bot:
[mediawiki/core@master] Improve efficiency of french-spacing regexp

https://gerrit.wikimedia.org/r/546178