Page MenuHomePhabricator

Space before/after »guillemets« (»/«) converted to non-breaking space ( ) (French spaces)
Open, MediumPublic

Description

Author: x00000000

Description:
A space before "»" (» - right-pointing double angle quotation mark) or a space after "«" (« - left-pointing double angle quotation mark) will be converted to a no-break space ( ).

This may be appropriate for most french text, but breaks line wrapping in languages where guillemets are used in the opposite order (»quote« instead of «quote» or « quote »). Compare http://en.wikipedia.org/wiki/Guillemets .

Details

Reference
bz12752

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:02 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz12752.
bzimport added a subscriber: Unknown Object (MLST).

Agreed, e.g. the use of guillemets on the Czech Wikisource is quite problematical because of this. This should be applied only if the content language is French. Or, more generally – we should probably have per-language rules. See bug #13619.

See also bug #3158.

x00000000 wrote:

Workaround is to write something like text »quote« text.
MediaWiki doesn't recognize   as space at the point where it replaces them with  s.

Sounds like checking for word breaks should do the job reasonably well here.

Eg:

...quoted » outside
\s»\W -> break

outside »quoted...
\s»\w -> no break

As long as nobody uses this form:
outside » quoted...

in which case it would be much more difficult to distinguish which side the non-break space belongs on, requiring heuristics to try to see where the quote was started.

x00000000 wrote:

Would be better than now, assuming "break" means "nbsp" (i.e. "no break").

But it won't work for cases like "the sign »,« is a comma", citations starting/ending with an ellipsis or other punctuation (like »... text ...« or »[…] text!«) or Spanish-style »¿uh?« (but guillemets aren't common in Spanish).

And it doesn't work for most languages if the replacement operates on bytes instead of chars, like the code snippet in bug 13619 comment 3 suggests. The \w needs to match the appropriate Unicode classes.

BTW, I don't think these simple &nbsp; heuristics are useful at all. E.g., they cause code like <code>x = flag ? 0 : 1;</code> to be unusable after copying and break valid CSS like <span style="color : red ; background : yellow"/>.

x00000000 wrote:

This should fix most occurences in French without breaking much elsewhere:

s/((?:[\s(]|^)«) /$1&nbsp;/
s/ »(?=\.?\)|[.,]?(?:\s|<ref[\s>]|$))/&nbsp;»/

Should also work with raw UTF-8 bytes if « and » are written as \302\253 and \302\273.

BTW, the current code seems to have a bug:

'/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1&nbsp;\\2'

should be either

'/(.) (\\?|:|;|!|%|\\302\\273)/' => '\\1&nbsp;\\2'

or

'/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1&nbsp;'

x00000000 wrote:

I missed the common cases ''« text »'' vs »''text''«, and <ref/>s seem to be already expanded at that stage (by looking at the code; I have no MediaWiki installation to test):

s/((?:[\s(]|<[a-zA-Z]+>|^)«) /$1&nbsp;/
s/ »(?=\.?\)|[.,]?(?:\s|<(?:\/|sup[\s>])|$))/&nbsp;»/

This handles also <blockquote>« citation »</blockquote> and similar (a line break isn't likely to occur at the beginning of a block element, but it makes a difference if text-align:justify (in Unicode compliant browsers)). It doesn't handle start tags with attributes like <span style="...">« text »</span> because that would be very expensive if done properly.

The better solution would be a configuration switch to apply these substitutions only for languages where they make sense. The only one of the current substitutions that makes some sense in most languages is s/ %/&nbsp;%/ (but it still destroys <code>x = y % z</code>).

matmarex renamed this task from space before/after &raquo;/&laquo; »guillemets« converted to &nbsp; to Space before/after »guillemets« (&raquo;/&laquo;) converted to non-breaking space (&nbsp;) (French spaces).May 14 2015, 5:33 PM
matmarex raised the priority of this task from Low to Medium.
matmarex updated the task description. (Show Details)
matmarex set Security to None.
matmarex removed a subscriber: Unknown Object (MLST).
matmarex added subscribers: matmarex, Zdzislaw, Aklapper, Ankry.

Quoting @Ankry from T99034:

Present MediaWiki (1.26wmf5) parser replaces ' »' with '&160;»'. It is unintended behaviour for plwikisource as both types of quoting: »this one« and «this one» are used in Polish language texts, the first being even preferred. Preventing soft line breaking before '»' sign is not correct for Polish texts. How can it be disabled for plwikisource?

Test page for this behaviour: https://pl.wikisource.org/wiki/Wikiskryba:Zdzislaw/brudnopis/test3

Is this still an issue? It definitely needs more detail to survive in modern times.

Still an issue, see the previous comment here for a more detailed summary.

Seems like all other languages suffer from this magic for French (no offense intended). As French is presumably the only language which needs that, this feature should definitely be removed by default.

Either have config variable to turn such behavior on or create an Extension:Guillemets which would handle that on wikis where installed.

matmarex assigned this task to cscott.

Indeed looks fixed, the three test cases in the example page I linked earlier all behave the same now:

Quoting @Ankry from T99034:

Present MediaWiki (1.26wmf5) parser replaces ' »' with '&160;»'. It is unintended behaviour for plwikisource as both types of quoting: »this one« and «this one» are used in Polish language texts, the first being even preferred. Preventing soft line breaking before '»' sign is not correct for Polish texts. How can it be disabled for plwikisource?

Test page for this behaviour: https://pl.wikisource.org/wiki/Wikiskryba:Zdzislaw/brudnopis/test3

matmarex removed cscott as the assignee of this task.