Page MenuHomePhabricator

Be more selective in applying French Space armoring
Closed, ResolvedPublic

Description

French spacing is described by (eg) https://www.iwillteachyoualanguage.com/learn/french/french-tips/french-punctuation and https://fr.wikipedia.org/wiki/Ponctuation#En_fran%C3%A7ais

Mediawiki currently tries to ensure that the space added before a punctuation mark is a non-breaking space, but the regular expression it uses is broad and introduces errors (cf T5158, T13874).

This task is to try to both improve the rules (adding additional punctuation marks used eg in Swiss French) as well as to make it more selective so that it does not apply in situations where it is clear "French spacing" is not the intent.

Event Timeline

cscott created this task.Jun 21 2018, 8:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 21 2018, 8:37 PM

Change 441410 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Don't armor french spaces before punctuation followed by word characters

https://gerrit.wikimedia.org/r/441410

Vvjjkkii renamed this task from Be more selective in applying French Space armoring to 1haaaaaaaa.Jul 1 2018, 1:03 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from 1haaaaaaaa to Be more selective in applying French Space armoring.Jul 2 2018, 11:11 AM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: gerritbot, Aklapper.

Change 441410 merged by jenkins-bot:
[mediawiki/core@master] Don't armor french spaces before punctuation followed by word characters

https://gerrit.wikimedia.org/r/441410

cscott closed this task as Resolved.Aug 3 2018, 9:34 PM
cscott claimed this task.
Od1n added a subscriber: Od1n.EditedOct 25 2019, 12:51 AM

The code could be optimized as so:

Replace:

'/(\S) (?=[?:;!%»›](?!\w))/u' => "\\1$space"

With:

'/(?<=\S) ([?:;!%»›])(?!\w)/u' => "$space\\1"

Before: The engine matches every non-space character from the beginning, then misses a following space most of the time, then backtracks.

After: The engine matches only spaces from the beginning, which are less frequent than all the non-space characters.

Change 546178 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Improve efficiency of french-spacing regexp

https://gerrit.wikimedia.org/r/546178

Od1n added a comment.Oct 26 2019, 8:49 AM

@cscott I see you have improved it even further :)

@Od1n could you do some quick benchmarks to satisfy the reviewer on the patch above?

Od1n added a comment.Oct 30 2019, 7:58 AM
$nb = 10000;

$text = str_repeat( 'lorem ipsum dolor sit amet', 1000 );
$space = '&#160;';

$t1 = microtime( true );
for ($i = $nb; $i--; ) {
    preg_replace( '/(\S) (?=[?:;!%»›](?!\w))/u', "\\1$space", $text );
}
$t2 = microtime( true );
for ($i = $nb; $i--; ) {
    preg_replace( '/(?<=\S) (?=[?:;!%»›](?!\w))/u', "$space", $text );
}
$t3 = microtime( true );

echo $t2 - $t1;
echo "\n";
echo $t3 - $t2;

With $text = str_repeat( 'lorem ipsum dolor sit amet', 1000 );
Before: 1.83 seconds
After: 0.64 seconds

With $text = str_repeat( 'lorem : ipsum : dolor : sit : amet', 1000 );
Before: 5.30 seconds
After: 2.95 seconds

With $text = str_repeat( 'lorem: ipsum: dolor: sit: amet', 1000 );
Before: 2.20 seconds
After: 0.69 seconds

Od1n added a subscriber: thiemowmde.Nov 6 2019, 5:00 PM

@thiemowmde Your benchmarks are above :)

The lookbehind assertion is perfectly fine here: it is executed only when spaces are matched, and it looks only a character back. Actually, it's this lookbehind assertion that makes the code optimized :)

executed only when spaces are matched, and it looks only a character back.

Excellent analysis. Thanks a lot for benchmarking this!

Change 546178 merged by jenkins-bot:
[mediawiki/core@master] Improve efficiency of french-spacing regexp

https://gerrit.wikimedia.org/r/546178