Add non-breaking spaces in additional places automatically
OpenPublic

Assigned To
None
Priority
Low
Author
bzimport
Subscribers
matmarex, Nemo_bis, waldyrious and 3 others
Projects
Reference
bz13619
Security
None
Description

Author: ui2t5v002

Description:
As an alternative solution to T5461, non-breaking spaces should be added automatically by Mediawiki on page render in appropriate places:

Don't worry too much about false positives, since an extra non-breaking space won't cause any serious problems unless many of them occur on the same line.

bzimport added a project: MediaWiki-Parser.Via ConduitNov 21 2014, 10:04 PM
bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz13619.
bzimport created this task.Via LegacyApr 5 2008, 8:06 PM
brion added a comment.Via ConduitApr 7 2008, 9:38 PM

Clarifying that this requests an addition to the existing automatic   rules, rather than creating a new feature.

bzimport added a comment.Via ConduitApr 7 2008, 9:42 PM

ui2t5v002 wrote:

(In reply to comment #1)

Clarifying that this requests an addition to the existing automatic  
rules, rather than creating a new feature.

Is there any documentation for the existing rules?

brion added a comment.Via ConduitApr 8 2008, 11:31 PM

Documentation? Don't be silly, this is MediaWiki! ;)

You can find the current rules in Parser::parse(), though:

Clean up special characters, only run once, next-to-last before doBlockLevels

$fixtags = array(

  1. french spaces, last one Guillemet-left
  2. only if there is something before the space

'/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1 \\2',

  1. french spaces, Guillemet-right

'/(\\302\\253) /' => '\\1 ',
'/ (!\s*important)/' => ' \\1', #Beware of CSS magic word !important, bug #11874.
);

bzimport added a comment.Via ConduitApr 9 2008, 1:19 AM

ui2t5v002 wrote:

(In reply to comment #3)

Documentation? Don't be silly, this is MediaWiki! ;)

I wasn't expecting a book. :) just a link to mailing list or prior bug report.

You can find the current rules in Parser::parse(), though:

Ok, so currently all it does is:

  • Changes "some : word" into "some : word" and likewise for ? : ; ! % »
  • Changes "« " into "« "
  • Breaks things inside HTML tags :)

So adding one before dashes is easy enough. Just add a hyphen and the codes for en and em dashes to the ?|:|;|!|% regexp.

I'd like it to also add a nbsp; for anything like "10 kiloohm" or "100 MW". We could either write a huge regular expression for every unit and prefix that exists (http://en.wikipedia.org/wiki/User:Bobblewik/monobook.js/unitformatter.js), or we could just make the rule for any time a number is followed by a space that is followed by a letter. The Manual of Style actually recommends as much:

http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style#Non-breaking_spaces
http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style_%28dates_and_numbers%29#Non-breaking_spaces

bzimport added a comment.Via ConduitApr 10 2008, 11:00 PM

dankindsvater wrote:

Active MoS editors generally believe that something along these lines would be great. If you want to simplify the rule to "number space letter gets replaced by no-break space", then the MoS editors believe that additional markup would be useful for the no-break space, probably a double-comma (,,) (that is, the double-comma would be typed and show it the edit window, and would be rendered as hard-space in the text). The reason is that we don't want automatically-inserted invisible characters to start multiplying in the text, as additions and deletions are made; we want to be able to see them, and easily insert and delete them. On the other hand, if you use very specific rules to insert no-break spaces exactly where most style manuals want them inserted (and I like http://en.wikipedia.org/wiki/User:Bobblewik/monobook.js/unitformatter.js as a good start), then perhaps the double-comma markup is not necessary; we'll be happy to take anything you can give us and try it out.

bzimport added a comment.Via ConduitApr 10 2008, 11:04 PM

dankindsvater wrote:

I should add: I'm talking about en.wikipedia.org. It's my sense that GA and FA article reviewers are included in the long list of people who have approved the idea; if it makes a difference, I'll be happy to survey their opinions.

bzimport added a comment.Via ConduitApr 10 2008, 11:34 PM

ui2t5v002 wrote:

(In reply to comment #5)

If you want to simplify the rule to "number space letter gets replaced
by no-break space", then the MoS editors believe that additional markup would
be useful for the no-break space

No. This is about adding a non-breaking space automatically when the page is rendered. Please don't add even more markup to the already cluttered and confusing syntax. Wiki markup is not like HTML, where you have to specify formatting and detail every little thing. The whole point of a wiki is that you enter semantic information, and it takes care of all the formatting and other little details for you.

The reason is that we don't want
automatically-inserted invisible characters to start multiplying in the text,
as additions and deletions are made

They won't be multiplying over time and they won't be visible in the edit box. This wouldn't affect the code in the edit box at all. It would only affect the HTML of the final rendered article.

bzimport added a comment.Via ConduitApr 11 2008, 2:48 AM

dankindsvater wrote:

Thanks for the explanation; I agree that's more elegant if the wizards can do it. Would anyone like me to survey among article reviewers and MoS editors to see if they see potential problems from a broad rule such as "number space letter never wraps"?

bzimport added a comment.Via ConduitApr 11 2008, 3:00 AM

ui2t5v002 wrote:

(In reply to comment #8)

Would anyone like me to survey among article reviewers and MoS editors to
see if they see potential problems from a broad rule such as "number space
letter never wraps"?

Absolutely. It's recommended in the manual of style to add a non-breaking space for this case (not just units), but there are certainly a few cases that shouldn't be. False positives won't cause much of a problem, though, since it will just prevent things from line wrapping, and it can't happen multiple times in a row to create a page-widening attack. ("1 a 1 a 1 a" --> "1 a 1 a 1 a ")

bzimport added a comment.Via ConduitApr 11 2008, 3:06 AM

ui2t5v002 wrote:

Oh wait. :) "a1 a1 a1 a1" --> "a1 a1 a1 a1"

Maybe we need to worry about that in some rare case? Or make it only for numbers with no letters inside? javascript would be something like: \s[,.0-9]+

brion added a comment.Via ConduitApr 11 2008, 6:49 PM

Why worry about spacing here? You can just write aaaaaaaaaaaaa... and widen to your heart's content. :)

bzimport added a comment.Via ConduitApr 12 2008, 3:46 AM

dankindsvater wrote:

I'm surveying the WP:MOSNUM people now and I gave them the http://en.wikipedia.org/wiki/User:Bobblewik/monobook.js/unitformatter.js list to tweak. Not wrapping at "number space letter" is a non-starter. More than 90% of the time, that will be something we want to wrap, such as "the 1969 Mets World Series" or "9999 bottles of beer".

bzimport added a comment.Via ConduitApr 12 2008, 3:49 AM

ui2t5v002 wrote:

(In reply to comment #12)

More than
90% of the time, that will be something we want to wrap, such as "the 1969 Mets
World Series" or "9999 bottles of beer".

Why would we want those to wrap? MOSNUM currently recommends that they don't.

bzimport added a comment.Via ConduitApr 12 2008, 4:12 PM

gnygaard wrote:

Why would we want to keep them from wrapping? MOSNUM is nonsense, recommending non-breaking spaces in places where they are not needed, and not recommending them in places where they are needed. It is also vague and ambiguous, arguably recommending a nonbreaking space at the star in "Ninety-nine*bottles of beer", and in the first space but not saying anything about the second space in a paper weight of "75 g m<sup>−2</sup>"; if that breaks, it should be between the 5 and the g, NOT between the g and the m, which is not only what the MoS rule says, but it is ALSO what we would get if this bug/feature request were implemented.

bzimport added a comment.Via ConduitApr 12 2008, 4:17 PM

ui2t5v002 wrote:

(In reply to comment #14)

Why would we want to keep them from wrapping?

Why wouldn't we? See:

http://en.wikipedia.org/wiki/Wikipedia_talk:Manual_of_Style#No-break_spaces_discussion_continues_at_bugzilla

bzimport added a comment.Via ConduitApr 12 2008, 4:33 PM

pmanderson wrote:

There is no consensus for not wrapping "9999 bottles of beer". If the letter and number are long, it may well produce clumsy final text; the key question is whether "9999<nowiki> </nowiki>bottles" will disable this feature, so it can be turned off when it does cause trouble.

bzimport added a comment.Via ConduitApr 12 2008, 4:37 PM

ui2t5v002 wrote:

(In reply to comment #16)

There is no consensus for not wrapping "9999 bottles of beer".

Please discuss at http://en.wikipedia.org/wiki/Wikipedia_talk:Manual_of_Style#No-break_spaces_discussion_continues_at_bugzilla

then we can come back here and tell the devs what we want

bzimport added a comment.Via ConduitApr 12 2008, 4:39 PM

ui2t5v002 wrote:

(In reply to comment #16)

the key question is
whether "9999<nowiki> </nowiki>bottles" will disable this feature, so it can be
turned off when it does cause trouble.

It does. Try a long string of « word »« word »« word »« word » vs <nowiki>« word »« word »« word »« word »</nowiki>

DanielFriesen added a comment.Via ConduitApr 13 2008, 3:24 AM

Why not create a MediaWiki: message with a space, comma, or whatever you want, separated list of units.

Then take that message and quote it then convert the separators into |'s turn it into a proper regex list with escaping.

Then just add a &nbsp; with the regex [/(\d+) (<Quoted | list here)/S, "\\1&nbsp;\\2"]

That way only real units have the nbsp added, and additionally wikis may localize the units, and also add any newer or custom units such as fake units which apply only to their wiki. Or instead they can just replace the message with a - and have the whole thing disabled if they don't want it.

bzimport added a comment.Via ConduitApr 13 2008, 3:01 PM

ayg wrote:

Please don't discuss the merits of various ideas here, discuss them on-wiki and report on consensus. Bugzilla is an even worse discussion forum than talk pages. :)

We've had localizable regexes before that were part of the parser, like linktrail, but AFAIK those have been disabled as too scary. They can still be localized per-language, but only in the PHP files, not in the MW-namespace messages.

APPER added a comment.Via ConduitDec 19 2008, 12:21 PM

(In reply to comment #19)

Why not create a MediaWiki: message with a space, comma, or whatever you want,
separated list of units.

Why not create a MediaWiki: message where one could add regular expressions and their replacements? Then every language (this discussion here is very en-focused) could add it's rules, could test them and so on...

bzimport added a comment.Via ConduitFeb 19 2011, 12:03 PM

bugzilla.wikimedia wrote:

For German and many related languages, the "digit space letter" rule would be wrong too often, I believe.
Few examples translated to English, using "_" to represent the nonbreaking space:

  1. word space digit rules: the year 1960 and ==> year_1960 and a class 23354 consumer good ==> a class_23354 consumer good laid down in ISO 4711 and not in ==> in ISO_4711 and an ASA 22 film ==> an ASA_22 film this is in paragraph 16 of the law on ==> in paragraph_16 of but article 3 in the constitution ==> but article_3 in king Henry 8 did ==> king Henry_8 did
  1. more complex: the years 1970 and 71 ==> years 1970_and_71 is 17 and a half miles from home ==> is 17_and_a_half_miles from home was 18 miles and three eighth until ==> was 18_miles and three_eighth until my 22 years old sister ==> my 22_years_old sister took 23 years until ==> took 22_years until

I doubt, that this can be had in a language independent way. We still would have not so few false positives, such as:

found the article 19 feet behind the 
went in that year 1999 soldiers to
according to ISO 1234 people in Spain

(Note that, English word order and comma rules make English much less prone to some of those)

Currencies, and their abbreviations, can appear both in front of, and after the figures they relate to, so we should have both a " curreny space [+-] digit " and a " digit space currency " rule and probably tolerate " In week 17 € 1500 were spent " unless we can make a " 'week' space digit " rule eat the 17 on its own, hiding it from the cureency rules.

Also, there are style rules like these:

we saw 1 young man ==> saw a young man / saw one young man
...
not even 7 sailors ==> not even seven sailors
...
when 12 candles ==> when twelve candles
with 13 grumps ==> with 13_grumps

So I suggest a language specific, or language group specific, kind of treatment.

Bawolff added a comment.Via ConduitJul 29 2012, 5:53 PM
  • Bug 18443 has been marked as a duplicate of this bug. ***
Nemo_bis added a comment.Via ConduitJul 29 2012, 8:28 PM

(In reply to comment #3)

Documentation? Don't be silly, this is MediaWiki! ;)

Heh. I've created https://meta.wikimedia.org/wiki/Help:Newlines_and_spaces#Non-breaking_spaces

(In reply to comment #20)

Please don't discuss the merits of various ideas here, discuss them on-wiki and
report on consensus. Bugzilla is an even worse discussion forum than talk
pages. :)

Perhaps we can summarize on that Meta page (and even discuss in its talk)?

We've had localizable regexes before that were part of the parser, like
linktrail, but AFAIK those have been disabled as too scary. They can still be
localized per-language, but only in the PHP files, not in the MW-namespace
messages.

This still holds true, so I suppose this is the way here too, and I've written it in the above page. I'm not going to summarize anything else from these two bugs because they're too long, but feel free if you find something consensual. :-)

seth added a comment.Via ConduitAug 5 2012, 8:48 AM

fyi: Because of bug #18443 I already started a discussion at w:de concerning German typography.

At https://de.wikipedia.org/wiki/WD:TYP#automatische_leerzeichen there's an unfinished table called 'regexps' which will resolve bug #18443 and this bug at least for w:de.
That table is still under construction. If it's finished I'll inform you here.

seth added a comment.Via ConduitAug 24 2012, 11:23 PM

//de.wikipedia.org/wiki/WD:TYP#automatische_leerzeichen
moved to
https://de.wikipedia.org/wiki/Wikipedia:Typografie/Automatische_Leerzeichen

bzimport added a comment.Via ConduitJan 2 2013, 2:53 PM

sowerk wrote:

I’d like to point out one approach, which was discussed in w:de some years ago (discussion felt asleep back then):

Use of underscores for thin- and non-breaking-spaces within the wiki-code:

One underscore for thin-space: _ ⇒ “ ”
Two underscores for n-b-space: __ ⇒ “ ”

Underscores are hardly ever used, except for links (there a filter can easily be implemented). In those rare remaining cases, the nowiki-tag should be used.

This would allow every user with minimal experience to use the correct typography, avoid long lists of common abbrevations as started on the German project site and ensure, that copy-paste-errors of spaces are easily detectable.

Nemo_bis added a comment.Via ConduitJan 2 2013, 4:33 PM

(In reply to comment #27)

Use of underscores for thin- and non-breaking-spaces within the wiki-code:

This is bug 3461, please continue there.

bzimport added a comment.Via ConduitAug 23 2014, 12:46 PM

matthiasbecker1967 wrote:

It would be helpful to fix this bug at least vor numbers and SI units and perhaps some widely used non-SI units (as ft, kn/kt mph, sm/nm)

bzimport added a comment.Via ConduitAug 23 2014, 3:20 PM

dankindsvater wrote:

Thanks Matthias. Would really be nice to see movement on this after all these years ... it would make VE so much prettier too if we didn't have to deal with some nbsp-equivalent in VE.

Dan

seth added a subscriber: seth.Via WebDec 4 2014, 9:49 PM

we made a few regexps for the German part of the problem:
see https://de.wikipedia.org/wiki/Wikipedia:Typografie/Automatische_Leerzeichen#Regexps

Is it possible to test those regexps somehow in an easy way?

Nemo_bis added a comment.Via WebDec 6 2014, 2:04 PM
In T15619#819739, @seth wrote:

we made a few regexps for the German part of the problem:
see https://de.wikipedia.org/wiki/Wikipedia:Typografie/Automatische_Leerzeichen#Regexps

Is it possible to test those regexps somehow in an easy way?

Depends on the definition of easy. You can set up a test wiki with [[MediaWiki-Vagrant]] and patch LanguageDe.php or (eek) Parser.php after the lines mentioned in https://phabricator.wikimedia.org/T20443#227957

matmarex edited the task description. (Show Details)Via WebSun, Apr 5, 5:10 PM
matmarex edited subscribers, added: matmarex; removed: wikibugs-l.
matmarex set Security to None.
matmarex added a comment.Via WebSun, Apr 5, 5:19 PM

I think this would actually be a pretty great thing to do. However, the way the &nbsp; insertion currently works is less than wonderful; implementing more rules could make T5158: Parser inserts invalid &nbsp; in the middle of style attribute worse.

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.