Page MenuHomePhabricator

the Sanitizer allows only ASCII and a some punctuation in extension tag attributes
Closed, ResolvedPublic

Description

I took a stab at resolving T30980 and adding a non-ASCII tag name.

This was surprisingly easy until I tried to make the tag's parameters non-ASCII as well. Apparently, the Sanitizer only allows ASCII Latin letters, digits and a bit of punctuation, but no Unicode characters.

Would it be disastrous to add support for non-ASCII characters?


Version: unspecified
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:45 AM
bzimport set Reference to bz71386.
bzimport added a subscriber: Unknown Object (MLST).

Hmm, as far as I can tell, even XML allows them (And even if it didn't, I'm not sure that we necessarily should require our xml-like tags to be conforment to XML). For reference, the relavent code in MW land is Sanitizer::getAttribsRegex()

From http://www.w3.org/TR/REC-xml/#NT-Name :

[41] Attribute ::= Name Eq AttValue

[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*

Which presumably is enough of unicode for your purposes (Although it has some weird exclusions, such as ÷, ×, ⬀, ⭐, ∀, ✀, which seem kind of random to exclude, but we don't need them. It also excludes a whole bunch of combining accents, but is ok with precomposed forms (Which we normalize to anyways, but a couple of obscure things that don't have pre-composed forms may be excluded).

(In reply to Bawolff (Brian Wolff) from comment #1)

Hmm, as far as I can tell, even XML allows them (And even if it didn't, I'm
not sure that we necessarily should require our xml-like tags to be
conforment to XML). For reference, the relavent code in MW land is
Sanitizer::getAttribsRegex()

Yes, that's precisely where I found it while debugging core trying to understand where on Earth do my perfectly good Hebrew attribute names disappear :)

From http://www.w3.org/TR/REC-xml/#NT-Name :

[41] Attribute ::= Name Eq AttValue

[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
[#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
[#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
[#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 |
[#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*

Which presumably is enough of unicode for your purposes (Although it has
some weird exclusions, such as ÷, ×, ⬀, ⭐, ∀, ✀, which seem kind of random
to exclude, but we don't need them. It also excludes a whole bunch of
combining accents, but is ok with precomposed forms (Which we normalize to
anyways, but a couple of obscure things that don't have pre-composed forms
may be excluded).

Yes, sounds kinda OK, unless people do very funky things with accents :)
I don't need more than simple letters from languages. Adding all Unicode letter ranges would be OK for my purposes.

So again, does anybody think that this can get us into any troubbble?

(In reply to Amir E. Aharoni from comment #2)

So again, does anybody think that this can get us into any trouble?

Who can answer Amire80's question above?

So again, does anybody think that this can get us into any troubbble?

Is the proposal to change the behavior to just parser extension tags? Or to make the change in getAttribsRegex?

Tags and attribute names, like "ref" and "name" in <ref name="hello" />. I care more about the functionality than about the implementation details.

Change 164223 had a related patch set uploaded (by Esanders; owner: Amire80):
[mediawiki/core@master] Allow attribute names to use any Unicode "Letter" or "Number"

https://gerrit.wikimedia.org/r/164223

Updated patch to use Unicode regex groups (allowing all of Letter and Number)

Change 164223 merged by jenkins-bot:
[mediawiki/core@master] Sanitizer: Allow attribute names to use any Unicode "Letter" or "Number"

https://gerrit.wikimedia.org/r/164223

This comment was removed by cscott.

It would be much nicer for Parsoid, FWIW, to actually implement the HTML5 spec -- otherwise we need to link against libicu or some other unicode library to get the "Letters and Numbers" set, and then update it regularly to track unicode, etc. Yuck.

Change 364779 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] T73386: Allow attribute names to use any Unicode "Letter" or "Number"

https://gerrit.wikimedia.org/r/364779

After discussing this with @cscott , I'm inclined to instead lean towards the XML definition of specific ranges instead of unicode properties, for ease of the parsoid team.

Change 364818 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Use the XML definition for what's a valid attribute name

https://gerrit.wikimedia.org/r/364818

Change 364779 merged by jenkins-bot:
[mediawiki/services/parsoid@master] T73386: Allow attribute names to use any Unicode "Letter" or "Number"

https://gerrit.wikimedia.org/r/364779

After discussing this with @cscott , I'm inclined to instead lean towards the XML definition of specific ranges instead of unicode properties, for ease of the parsoid team.

Arlo came up with a nice patch ( https://gerrit.wikimedia.org/r/364779 ) that addressed my concerns about using unicode properties, and that landed some time ago. So I'm inclined to close this as resolved. @Bawolff , do you have any objection to abandoning your https://gerrit.wikimedia.org/r/364818 patch?

Change 364818 abandoned by Brian Wolff:
Use the XML definition for what's a valid attribute name

https://gerrit.wikimedia.org/r/364818

Bawolff claimed this task.