Page MenuHomePhabricator

French spaces armoring not working correctly
Closed, ResolvedPublic

Description

Normal spaces (U+0020) before ! ? : ; » and after « are automatically converted by the parser as non-breaking spaces (U+00A0) on French-speaking wikis. However since a few hours, non-breaking spaces seem to be missing or misplaced when there are <span></span> tags:

For example on https://fr.wikipedia.org/w/index.php?title=Kiev&oldid=180348745

en <a href="/wiki/Ukrainien" title="Ukrainien">ukrainien</a> :&nbsp;<span class="lang-uk" lang="uk">Київ</span>, <i>Kyiv</i>

&nbsp; should be before the semi-colon, not after.

On https://fr.wikipedia.org/w/index.php?title=Visual_novel&oldid=179960076

(abréviation de «&nbsp;<span class="lang-en" lang="en"><i>novel</i></span> »), qui consistent essentiellement en une narration et comportent très peu d'éléments interactifs et les jeux dits «&nbsp;AVG&nbsp;» ou «&nbsp;ADV&nbsp;» (respectivement «&nbsp;<span class="lang-en" lang="en"><i>adventure game</i></span> » et «&nbsp;<span class="lang-en" lang="en"><i>adventure</i></span> »)

&nbsp; is missing before the closing guillemet (»).

Event Timeline

Arlolra triaged this task as High priority.

Oh, did I break this in T255007

Yup

However since a few hours, non-breaking spaces seem to be missing or misplaced when there are <span></span> tags:

Yeah, previously, the regexp looking to add the spacing would consider the entire document as a serialized string. Now, the individual text nodes in each elements are considered individually. The regexp will need to be adjusted to continue with this approach.

Change 667304 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/core@master] [WIP] Don't worry about something before when armoring french spaces

https://gerrit.wikimedia.org/r/667304

Change 667666 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Something before armoring French spacing

https://gerrit.wikimedia.org/r/667666

Change 667304 merged by jenkins-bot:
[mediawiki/core@master] Don't worry about something before when armoring french spaces

https://gerrit.wikimedia.org/r/667304

For example on https://fr.wikipedia.org/w/index.php?title=Kiev&oldid=180348745

en <a href="/wiki/Ukrainien" title="Ukrainien">ukrainien</a> :&nbsp;<span class="lang-uk" lang="uk">Київ</span>, <i>Kyiv</i>

&nbsp; should be before the semi-colon, not after.

After ?action=purgeing the page, this now renders as,

en <a href="/wiki/Ukrainien" title="Ukrainien">ukrainien</a>&#160;:&#160;<span class="lang-uk" lang="uk">Київ</span>, <i>Kyiv</i>

which seems like an improvement but I wasn't expecting the non-breaking space *after* the colon.

The wikitext expands to this though,

en [[ukrainien]] : <span class="lang-uk" lang="uk">Київ</span>

and then running it through hexdump -C,

00000000  65 6e 20 5b 5b 75 6b 72  61 69 6e 69 65 6e 5d 5d  |en [[ukrainien]]|
00000010  20 3a c2 a0 3c 73 70 61  6e 20 63 6c 61 73 73 3d  | :..<span class=|
00000020  22 6c 61 6e 67 2d 75 6b  22 20 6c 61 6e 67 3d 22  |"lang-uk" lang="|
00000030  75 6b 22 3e d0 9a d0 b8  d1 97 d0 b2 3c 2f 73 70  |uk">........</sp|
00000040  61 6e 3e 0a 0a                                    |an>..|
00000045

So there's a c2 a0 in the source which Remex is escaping for us,
https://github.com/wikimedia/mediawiki/blob/master/includes/tidy/RemexCompatFormatter.php#L24-L26

@Arlolra Thank you for the patch!

which seems like an improvement but I wasn't expecting the non-breaking space *after* the colon.

There shouldn’t be a non-breaking space after the colon, should I reopen this ticket?

There shouldn’t be a non-breaking space after the colon, should I reopen this ticket?

Nope. The analysis in T275918#6883745 says that that non-breaking space is in the source. You can edit the template {{lang-uk|Київ}} if it's undesirable, but it's not coming from the armoring in the parser.

A quick look at https://fr.wikipedia.org/w/index.php?title=Module:Langue has code like,

	-- Définition du nom de la langue en français.
	local nom = Langue.lienLangue{ code }

	if texte ~= '' then
		texte = '\194\160' .. Langue.lang{ code, dir = dir, texte = texte, trans = trans }
	end

	wikiText = nom .. ' :' .. texte

	return wikiText

where \194\160 is UTF8 for that non-breaking space.

Change 667666 merged by jenkins-bot:
[mediawiki/services/parsoid@master] French spacing: don't require non-space before French spacing

https://gerrit.wikimedia.org/r/667666

Change 674397 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a29

https://gerrit.wikimedia.org/r/674397

Change 674373 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@wmf/1.36.0-wmf.36] Bump wikimedia/parsoid to 0.13.0-a29

https://gerrit.wikimedia.org/r/674373

Change 674397 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a29

https://gerrit.wikimedia.org/r/674397

Change 674373 merged by jenkins-bot:
[mediawiki/vendor@wmf/1.36.0-wmf.36] Bump wikimedia/parsoid to 0.13.0-a29

https://gerrit.wikimedia.org/r/674373