Page MenuHomePhabricator

Unwanted non-breaking space added before a ':' in TemplateStyles rules
Open, HighPublic

Description

Hello,

There is obviously a problem with TemplateStyles on frwiki.

As bug reported here, after investigation, unwanted content seems to be added in the TemplateStyles CSS.

The bug is visible in TemplateData blocks, like here : Exemple  : Infobox is displayed, instead of: Exemple : Infobox

The CSS TemplateStyles sheet : Modèle:Méta documentation de modèle/styles.css

In particular, this rule:

.mw-templatedata-doc-params dt:after {
	content: "\a0:\a0";

Is modified as follows in the source code in the page:

.mw-parser-output .mw-templatedata-doc-params dt:after{content:"\a0 :\a0 "}

The code:

content: "\a0:\a0";

Is incorrectly replaced by:

content:"\a0 :\a0 "

By investigating a little, it may be possible that this bug is related to the modifications made T197879

Thank you.

(sorry for my bad English)

Event Timeline

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptJun 10 2020, 11:29 AM
Tgr added a subscriber: Tgr.

This is probably caused by rMWa60dcdc2e346: Armor against French spaces detection in HTML attributes (or affected by it; it might have been broken in a different way before, as that was already an attempt to fix things like T5158 and T13874) which makes Sanitizer::safeEncodeAttribute do that kind of conversion. Why would the sanitizer be invoked for the contents of a TemplateStyles <style> tag though, which is not an attribute and has its own sanitizer?

Tgr added a comment.EditedJun 10 2020, 12:23 PM

Given that the CSS rule in the template style page says content: "\a0:\a0";. The CSS rule that actually gets embedded in the document is content:"\a0&#160;:\a0 " (the last character being a regular space) so there seem to be multiple layers of brokenness here. Somehow the \a0 (correct CSS escape for a non-breaking space) is turned into \a0 (extra space at the end) and then the space before the colon gets HTML-escaped into a non-breaking space? (armorFrenchSpaces will do that, if I read it correctly.)

Yeah, this was a result of the changes in T197879

Particularly this patch,
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/593017

Where it says,

The noteworthy things about the patch are that it moves this sanitization to being applied after nowikis are unstripped, like the other sanitization.

Why would the sanitizer be invoked for the contents of a TemplateStyles <style> tag though, which is not an attribute and has its own sanitizer?

Attribute sanitization isn't invoked and, if it was, it would replace with &#32;'

Arlolra claimed this task.Jun 10 2020, 11:54 PM

Somehow the \a0 (correct CSS escape for a non-breaking space) is turned into \a0 (extra space at the end)

That seems to be coming from,
https://github.com/wikimedia/css-sanitizer/blob/a91aa8efccd99f7ef879b4462e233839fc900b2c/src/Objects/Token.php#L389

The space is probably necessary to avoid the escaped character combining in any nefarious way.

so there seem to be multiple layers of brokenness here

Yeah, even before the above patch, the presence of regular space defeated the purpose of the non-breaking space. Given that T133408 was closed in 2017, this hasn't worked as intended for a while.

cscott added a subscriber: cscott.Jun 11 2020, 7:08 AM

The space after the CSS escape is harmless. For it's own baroque reasons, CSS doesn't have separate escape mechanisms for 2-hex-digit and 4-hex-digit unicode escapes, so instead it collects digits until it sees a non-digit, and if the non-digit is a space then it eats it. https://drafts.csswg.org/css-syntax/#consume-escaped-code-point

@Arlolra parsoid can apply french-spacing at the end because it has DOM structure still so it can avoid recursing into attributes or <style> tags. PHP probably needs a hack to do the same; either moving french-spacing back before nowiki, or explicitly adding some hacky code to split on </?style and skip processing anything inside. (LanguageConverter uses hacks of this sort.)

Actually, I take it back: Remex is pretty darn fast, I bet you could actually fold this into tidy or independently hook up Remex so that you can skip tags and their attributes and "bad" tags like <style> the "correct" way. See core/includes/parser/RemexStripTagHandler.php. (That would probably let you get rid of the &#32; hack as well.)

The space after the CSS escape is harmless. For it's own baroque reasons, CSS doesn't have separate escape mechanisms for 2-hex-digit and 4-hex-digit unicode escapes, so instead it collects digits until it sees a non-digit, and if the non-digit is a space then it eats it. https://drafts.csswg.org/css-syntax/#consume-escaped-code-point

That seems true of tokenizing but this is in the content value content: "\a0 :\a0 "; which is going to be literally inserted in the dom.

Arlolra triaged this task as High priority.Jun 15 2020, 5:11 PM
Arlolra moved this task from Needs Triage to Bugs & Crashers on the Parsoid board.

Change 628864 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Handle mw:DisplaySpace when serializing mw:Nowiki content

https://gerrit.wikimedia.org/r/628864

Change 628864 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Handle mw:DisplaySpace when serializing mw:Nowiki content

https://gerrit.wikimedia.org/r/628864

Change 630676 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia\parsoid to 0.13.0-a11

https://gerrit.wikimedia.org/r/630676

Change 630676 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a11

https://gerrit.wikimedia.org/r/630676