Page MenuHomePhabricator

Automatically convert spaces after section markers (§) into non-breaking spaces
Open, NormalPublic

Description

Currently, Mediawiki automatically converts spaces before various punctuation ( ; ? ! ) into non-breaking spaces. It has been suggested that the same feature be implemented for spaces after section markers (§). For example, the following article currently includes 249 manually encoded non-breaking spaces due to the heavy use of section markers:
https://de.wikipedia.org/wiki/%C2%A7_175

Event Timeline

kaldari created this task.Nov 23 2015, 11:58 PM
kaldari updated the task description. (Show Details)
kaldari raised the priority of this task from to Needs Triage.
kaldari added a project: MediaWiki-Parser.
kaldari added a subscriber: kaldari.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 23 2015, 11:58 PM
Gnom1 added a subscriber: Gnom1.
kaldari set Security to None.

Found this part which seems related to the bug in Parser.php.
$fixtags = [

  1. french spaces, last one Guillemet-left
  2. only if there is something before the space '/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1 ',
  3. french spaces, Guillemet-right '/(\\302\\253) /' => '\\1 ', '/ (!\s*important)/' => ' \\1', ];

Is Section marker to be added here ?

If you "found this part", where did you find it? Clear links and references are always welcome. Thanks!

@Aklapper The section of code is from includes/parser/Parser.php.
Line number 1297.
Should I proceed to add Section marker here ?

Harjotsingh added a comment.EditedMar 3 2016, 5:48 PM

Created a patch to add non-breaking space after §.

Screenshot:

I have also uploaded the change to gerrit, needs review for the code.

I have also uploaded the change to gerrit, needs review for the code.

@Harjotsingh: Thanks for the patch! Please follow https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines and link to this task in your commit message, to automatically get a notification link here.

Change 274770 had a related patch set uploaded (by Harjotsingh):
Convert space after § to non-breaking spaces

https://gerrit.wikimedia.org/r/274770

Change 274770 abandoned by Harjotsingh:
Convert space after § to non-breaking spaces

https://gerrit.wikimedia.org/r/274770

Change 275203 had a related patch set uploaded (by Harjotsingh):
Convert space after § to non-breaking spaces

https://gerrit.wikimedia.org/r/275203

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptMay 11 2016, 7:19 PM

@cscott
You mentioned the parser tests here https://gerrit.wikimedia.org/r/#/c/275203/.
Which tests am I supposed to add and where can I find how to do so ?

Change 275203 abandoned by Harjotsingh:
Convert space after § to non-breaking spaces

https://gerrit.wikimedia.org/r/275203

Change 332037 had a related patch set uploaded (by Harjotsingh):
Convert space after § to non-breaking spaces

https://gerrit.wikimedia.org/r/332037

I don't like this kind of processing in the parser at all. Non-breaking spaces should be added while editing time and saved into the database and not added in parser time while output.

seth added a subscriber: seth.Jan 15 2017, 8:03 AM

Change 332037 had a related patch set uploaded (by Harjotsingh): [...]

https://gerrit.wikimedia.org/r/332037

Imho This patch will fail. The code

'/(§) (.)/' => '§ '

would delete the character right from the space. It should be something like

'/§ (.)/' => '§ \\1'

or

'/§\K (?=.)/' => ' '

or

'/§\K \b/' => ' '

@seth
Yes it was deleting the next character and backreference was needed.
I've done the necessary changes.
Thanks !

Converting spaces to non-breaking spaces based on special replacement rules on parser time generates additional parser errors and sometimes unwanted effects. For some example problems with the current whitespace replacements in the parser see T40797. These problems are syntactical and may be solved by adding additional replacement rules, which makes everything more complex. There are also semantical problems, because a non-breaking space is semantical not wanted at all situations.

Here some real examples:

https://de.wikipedia.org/wiki/DIN_1505-2:

Danach folgt die Kennzeichnung, zum Beispiel § und dann die Zählung, die auch die Untergliederung, wie gezählte Absätze oder ...

https://de.wikipedia.org/wiki/Codepage_437

Dem Steuerzeichenbereich 00hex–1Fhex sind verschiedene, mit Ausnahme des Paragraphenzeichens § nicht druckbare Grafikzeichen zugeordnet, die zum einen ...

https://de.wikipedia.org/wiki/Halbeink%C3%BCnfteverfahren

Beispielsweise befreite Buchstabe d dieses § die Hälfte der Bezüge ...

Of course it is possible to add workarounds in the wikitext to work around these parser errors. When changing the parser these workaround must be inserted in the wiki before changing the parser.

I think it is not worth. I think the better solution is to add this automatic replacement rules in to the wikieditor. When there is a unwanted replacement error then it can fixed in the editor.

cscott edited projects, added Parsoid; removed good first bug.Aug 16 2017, 5:37 PM

Visual Editor makes it very easy to add automatic replacement rules. However, the   might be seen as ugly by wikitext editors. T5461: Syntax extensions: special character, e.g. underscore, for non-breaking space ( ) would make the wikitext look much nicer when explicit   are added, and give editors full control over their placement (or not).

ssastry triaged this task as Normal priority.Sep 11 2017, 7:13 PM

The current automatically replacement for French spacing in the parser generates problems an several places, for example in T5158.

Replacement rules in the editor are better. To avoid the ugly   in the wikieditor the Unicode character U+00A0 should be used. T181677 implements a syntax highlight for U+00A0 in the CodeMirror wikitext editor.