Page MenuHomePhabricator

Automatically convert spaces after section markers (§) into non-breaking spaces
Open, MediumPublic

Description

Currently, Mediawiki automatically converts spaces before various punctuation ( ; ? ! ) into non-breaking spaces. It has been suggested that the same feature be implemented for spaces after section markers (§). For example, the following article currently includes 249 manually encoded non-breaking spaces due to the heavy use of section markers:
https://de.wikipedia.org/wiki/%C2%A7_175

Event Timeline

kaldari raised the priority of this task from to Needs Triage.
kaldari updated the task description. (Show Details)
kaldari added a project: MediaWiki-Parser.
kaldari added a subscriber: kaldari.

Found this part which seems related to the bug in Parser.php.
$fixtags = [

  1. french spaces, last one Guillemet-left
  2. only if there is something before the space '/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1 ',
  3. french spaces, Guillemet-right '/(\\302\\253) /' => '\\1 ', '/ (!\s*important)/' => ' \\1', ];

Is Section marker to be added here ?

If you "found this part", where did you find it? Clear links and references are always welcome. Thanks!

@Aklapper The section of code is from includes/parser/Parser.php.
Line number 1297.
Should I proceed to add Section marker here ?

Created a patch to add non-breaking space after §.

Screenshot:

nbsp.png (720×1 px, 326 KB)

I have also uploaded the change to gerrit, needs review for the code.

I have also uploaded the change to gerrit, needs review for the code.

@Harjotsingh: Thanks for the patch! Please follow https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines and link to this task in your commit message, to automatically get a notification link here.

Change 274770 had a related patch set uploaded (by Harjotsingh):
Convert space after § to non-breaking spaces

https://gerrit.wikimedia.org/r/274770

Change 274770 abandoned by Harjotsingh:
Convert space after § to non-breaking spaces

https://gerrit.wikimedia.org/r/274770

Change 275203 had a related patch set uploaded (by Harjotsingh):
Convert space after § to non-breaking spaces

https://gerrit.wikimedia.org/r/275203

@cscott
You mentioned the parser tests here https://gerrit.wikimedia.org/r/#/c/275203/.
Which tests am I supposed to add and where can I find how to do so ?

Change 275203 abandoned by Harjotsingh:
Convert space after § to non-breaking spaces

https://gerrit.wikimedia.org/r/275203

Change 332037 had a related patch set uploaded (by Harjotsingh):
Convert space after § to non-breaking spaces

https://gerrit.wikimedia.org/r/332037

I don't like this kind of processing in the parser at all. Non-breaking spaces should be added while editing time and saved into the database and not added in parser time while output.

Change 332037 had a related patch set uploaded (by Harjotsingh): [...]

https://gerrit.wikimedia.org/r/332037

Imho This patch will fail. The code

'/(§) (.)/' => '§ '

would delete the character right from the space. It should be something like

'/§ (.)/' => '§ \\1'

or

'/§\K (?=.)/' => ' '

or

'/§\K \b/' => ' '

@seth
Yes it was deleting the next character and backreference was needed.
I've done the necessary changes.
Thanks !

Converting spaces to non-breaking spaces based on special replacement rules on parser time generates additional parser errors and sometimes unwanted effects. For some example problems with the current whitespace replacements in the parser see T40797. These problems are syntactical and may be solved by adding additional replacement rules, which makes everything more complex. There are also semantical problems, because a non-breaking space is semantical not wanted at all situations.

Here some real examples:

https://de.wikipedia.org/wiki/DIN_1505-2:

Danach folgt die Kennzeichnung, zum Beispiel § und dann die Zählung, die auch die Untergliederung, wie gezählte Absätze oder ...

https://de.wikipedia.org/wiki/Codepage_437

Dem Steuerzeichenbereich 00hex–1Fhex sind verschiedene, mit Ausnahme des Paragraphenzeichens § nicht druckbare Grafikzeichen zugeordnet, die zum einen ...

https://de.wikipedia.org/wiki/Halbeink%C3%BCnfteverfahren

Beispielsweise befreite Buchstabe d dieses § die Hälfte der Bezüge ...

Of course it is possible to add workarounds in the wikitext to work around these parser errors. When changing the parser these workaround must be inserted in the wiki before changing the parser.

I think it is not worth. I think the better solution is to add this automatic replacement rules in to the wikieditor. When there is a unwanted replacement error then it can fixed in the editor.

Visual Editor makes it very easy to add automatic replacement rules. However, the   might be seen as ugly by wikitext editors. T5461: Syntax extensions: special character, e.g. underscore, for non-breaking space ( ) would make the wikitext look much nicer when explicit   are added, and give editors full control over their placement (or not).

ssastry triaged this task as Medium priority.Sep 11 2017, 7:13 PM

The current automatically replacement for French spacing in the parser generates problems an several places, for example in T5158.

Replacement rules in the editor are better. To avoid the ugly   in the wikieditor the Unicode character U+00A0 should be used. T181677 implements a syntax highlight for U+00A0 in the CodeMirror wikitext editor.

@Harjotsingh: Hi! This task has been assigned to you a while ago. Could you maybe share an update? Do you still plan to work on this task? Thanks! :)

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

Change 332037 abandoned by Thiemo Kreuz (WMDE):
[mediawiki/core@master] Convert space after § to non-breaking spaces

Reason:
4 years old, disputed and in conflict. This is easy to redo or reopen if it's still needed.

https://gerrit.wikimedia.org/r/332037