Page MenuHomePhabricator

Unwanted non-breaking space added before a ':' in TemplateStyles rules
Closed, ResolvedPublic

Description

Hello,

There is obviously a problem with TemplateStyles on frwiki.

As bug reported here, after investigation, unwanted content seems to be added in the TemplateStyles CSS.

The bug is visible in TemplateData blocks, like here : Exemple  : Infobox is displayed, instead of: Exemple : Infobox

The CSS TemplateStyles sheet : Modèle:Méta documentation de modèle/styles.css

In particular, this rule:

.mw-templatedata-doc-params dt:after {
	content: "\a0:\a0";

Is modified as follows in the source code in the page:

.mw-parser-output .mw-templatedata-doc-params dt:after{content:"\a0 :\a0 "}

The code:

content: "\a0:\a0";

Is incorrectly replaced by:

content:"\a0 :\a0 "

By investigating a little, it may be possible that this bug is related to the modifications made T197879

Thank you.

(sorry for my bad English)

Event Timeline

Tgr added a subscriber: Tgr.

This is probably caused by rMWa60dcdc2e346: Armor against French spaces detection in HTML attributes (or affected by it; it might have been broken in a different way before, as that was already an attempt to fix things like T5158 and T13874) which makes Sanitizer::safeEncodeAttribute do that kind of conversion. Why would the sanitizer be invoked for the contents of a TemplateStyles <style> tag though, which is not an attribute and has its own sanitizer?

Given that the CSS rule in the template style page says content: "\a0:\a0";. The CSS rule that actually gets embedded in the document is content:"\a0&#160;:\a0 " (the last character being a regular space) so there seem to be multiple layers of brokenness here. Somehow the \a0 (correct CSS escape for a non-breaking space) is turned into \a0 (extra space at the end) and then the space before the colon gets HTML-escaped into a non-breaking space? (armorFrenchSpaces will do that, if I read it correctly.)

Yeah, this was a result of the changes in T197879

Particularly this patch,
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/593017

Where it says,

The noteworthy things about the patch are that it moves this sanitization to being applied after nowikis are unstripped, like the other sanitization.

Why would the sanitizer be invoked for the contents of a TemplateStyles <style> tag though, which is not an attribute and has its own sanitizer?

Attribute sanitization isn't invoked and, if it was, it would replace with &#32;'

Somehow the \a0 (correct CSS escape for a non-breaking space) is turned into \a0 (extra space at the end)

That seems to be coming from,
https://github.com/wikimedia/css-sanitizer/blob/a91aa8efccd99f7ef879b4462e233839fc900b2c/src/Objects/Token.php#L389

The space is probably necessary to avoid the escaped character combining in any nefarious way.

so there seem to be multiple layers of brokenness here

Yeah, even before the above patch, the presence of regular space defeated the purpose of the non-breaking space. Given that T133408 was closed in 2017, this hasn't worked as intended for a while.

The space after the CSS escape is harmless. For it's own baroque reasons, CSS doesn't have separate escape mechanisms for 2-hex-digit and 4-hex-digit unicode escapes, so instead it collects digits until it sees a non-digit, and if the non-digit is a space then it eats it. https://drafts.csswg.org/css-syntax/#consume-escaped-code-point

@Arlolra parsoid can apply french-spacing at the end because it has DOM structure still so it can avoid recursing into attributes or <style> tags. PHP probably needs a hack to do the same; either moving french-spacing back before nowiki, or explicitly adding some hacky code to split on </?style and skip processing anything inside. (LanguageConverter uses hacks of this sort.)

Actually, I take it back: Remex is pretty darn fast, I bet you could actually fold this into tidy or independently hook up Remex so that you can skip tags and their attributes and "bad" tags like <style> the "correct" way. See core/includes/parser/RemexStripTagHandler.php. (That would probably let you get rid of the &#32; hack as well.)

The space after the CSS escape is harmless. For it's own baroque reasons, CSS doesn't have separate escape mechanisms for 2-hex-digit and 4-hex-digit unicode escapes, so instead it collects digits until it sees a non-digit, and if the non-digit is a space then it eats it. https://drafts.csswg.org/css-syntax/#consume-escaped-code-point

That seems true of tokenizing but this is in the content value content: "\a0 :\a0 "; which is going to be literally inserted in the dom.

Arlolra triaged this task as High priority.Jun 15 2020, 5:11 PM
Arlolra moved this task from Needs Triage to Bugs & Crashers on the Parsoid board.

Change 628864 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Handle mw:DisplaySpace when serializing mw:Nowiki content

https://gerrit.wikimedia.org/r/628864

Change 628864 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Handle mw:DisplaySpace when serializing mw:Nowiki content

https://gerrit.wikimedia.org/r/628864

Change 630676 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia\parsoid to 0.13.0-a11

https://gerrit.wikimedia.org/r/630676

Change 630676 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a11

https://gerrit.wikimedia.org/r/630676

Despite the above patch, the problem is still present on frwiki…

This specific issue can be recreated with the following permanent links: Template:Graph:Chart/Documentation (copying the wikitext, and using the Special:ExpandTemplates to obtain the wikitext without templates) and Template:Méta documentation de modèle/styles.css (the title of the second must be kept or changed accordingly in the <templatestyles> tag).

I proposed a workaround for frwiki, perhaps the specific example will soon disappear from frwiki (=being workarouned instead of being fixed).

Change 663090 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Don't traverse into raw text elements

https://gerrit.wikimedia.org/r/663090

Change 663226 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/core@master] [WIP] Don't apply french spacing in raw text elements

https://gerrit.wikimedia.org/r/663226

Despite the above patch, the problem is still present on frwiki…

I'm really sorry this fell off my radar for so long

No problem at all!
And the fix seems to be significantly more complex than I thought, so thank you for your work on it!

Change 663090 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Don't apply French spacing in raw text elements

https://gerrit.wikimedia.org/r/663090

@Arlolra I am not sure this is the same issue. I will download the patch and test, but looking at the code I think it is not related.

I have appplied the patches from https://gerrit.wikimedia.org/r/663090
The problem still persists. As I said the problem described here : https://phabricator.wikimedia.org/T274850 Seems different

@Sen-Sai please see the discussion around T255007#6213141

Also, the commit message in Ic8965e81882d7cf024bdced437f684064a30ac86 says,

Note that the test "Nowiki and french spacing" wonders whether this escaping should be applied to nowiki content.

The above core patch doesn't fix the issue for your particular case but this task is where the work is being done to determine what to do about it.

@Sen-Sai Can you say more about what problem the presence of the non-breaking space characters is causing? I see your extension is doing a lot of stuff with forms, so are they ending up in textarea content or something? Thanks

@Arlolra Sure. The extension is called WSForm and is a Form rendering extension, using tags similar to html5. With things like file upload, a custom JavaScript snippet comes rendered with the form. Those now break, since JavaScript cannot handle these type of spaces. It also makes no sense to me why they are used.
I could,, and was already planned,, write a dynamic resource loader. However I can imagine a lot of other extensions are depending on the nowiki markerType. Since WSForm is currently installed on a lot of wiki's, this change of behaviour is unfortunate.

The above core patch doesn't fix the issue for your particular case but this task is where the work is being done to determine what to do about it.

Ah got it!

With things like file upload, a custom JavaScript snippet comes rendered with the form.

How is it rendered? In a <script> tag? Because the above patch does avoid traversing into scripts, so the French spacing should not be applied there. There's a test for that, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/663226/10/tests/parser/parserTests.txt

Those now break, since JavaScript cannot handle these type of spaces.

Or is the JavaScript interaction with some portion of the page now broken?

Maybe you can show me a code snippet so I can better understand.

Sure here's one example.

<script>
function getFileType( ffile ) {
    var extension = ffile.substr((ffile.lastIndexOf('.') + 1));
    switch (extension) {
        case 'jpg'&#160;:
        case 'jpeg'&#160;:
            return 'image/jpeg';
            break;
        case 'png'&#160;:
            return 'image/png';
            break;
    }
}
</script>

I am currently rewriting all the snippets to be included using

/**
 * Add a self-contained script tag with the given contents
 * Internal use only. Use OutputPage::addModules() if possible.
 *
 * @param string $script JavaScript text, no script tags
 */
public function addInlineScript( $script ) {
    $this->mScripts .= Html::inlineScript( "\n$script\n", $this->CSP->getNonce() ) . "\n";
}

as : $out->addInlineScript( $script );

Did some tests and that seems to work.

Sure here's one example.

This case you provided should be fixed by the patch in,
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/663226

Are you sure you applied it correctly? Can you try again to confirm

If not, I'll install your extension locally and see if I can reproduce it

Well perhaps you can access : https://csp.wikibase.nl/index.php/SimpleFileUpload
When you open the console and reload the page, you'll see messages like : Uncaught SyntaxError: private fields are not currently supported and they point to the example result I showed you. Keep in mind I am still not using the $out->addInlineScript( $script ); only returning HTML with the textMarker nowiki

Here's what I did.

git clone https://bitbucket.org/wikibasesolutions/mw-wsform.git
git checkout 316423c

That's the commit of your extension from Special:Version of csp.wikibase.nl. The HEAD of master in your repo is throwing at L295 of field.class.php because of the issue in discussion.

Then I created a local page with the source from SimpleFileUpload,

<wsform action="addToWiki" enctype="multipart/form-data">

<wsfield id="another-test" type="file" name="wsform-test-file" target="WSForm upload Test file" pagecontent="Uploaded for the WSForm upload test" />

<wsfield type="submit" value="Upload" />

</wsform>

Then I checked out the commit before https://gerrit.wikimedia.org/r/c/mediawiki/core/+/663226 (which is essentially the HEAD of master on core's repo, 1.36.0-alpha)

And, as reported, the script is mangled like,

if(typeof enctype === 'undefined' || !enctype || enctype&#160;!== 'multipart/form-data') {

Then I checked out the patch from https://gerrit.wikimedia.org/r/c/mediawiki/core/+/663226 and reloaded the page,

if(typeof enctype === 'undefined' || !enctype || enctype !== 'multipart/form-data') {

As expected, the script tag is no longer being traversed into and no non-breaking space is observed.

I'm not sure what commit you're running https://csp.wikibase.nl/index.php/SimpleFileUpload but it doesn't seem like the patch above is being applied.

Thanks

Well looking at the patch you mention, the one I used had only parsoid changes.
Ah I see.. I took the last one mentioned in the comments : https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/663090/ :-(
Your effort is very much appreciated. If there's anything I can help you out with one day, please let me know.

Change 663226 merged by jenkins-bot:
[mediawiki/core@master] Don't apply French spacing in raw text elements

https://gerrit.wikimedia.org/r/663226

Change 666213 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a25

https://gerrit.wikimedia.org/r/666213

Change 666213 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a25

https://gerrit.wikimedia.org/r/666213

The bug is visible in TemplateData blocks, like here : Exemple &#160;: Infobox is displayed, instead of: Exemple : Infobox

I ran ?action=purge on the page to confirm the issue is resolved https://fr.wikipedia.org/wiki/Mod%C3%A8le:Projet#TemplateData