setHook (et al) allows silly tag names
Open, LowestPublic
Actions

Assigned To

None

Authored By

	RobinHood70
	Jul 25 2016, 5:53 PM

Description

The parser function setHook and its various derivatives only restrict tag names to not including the following characters: "<>\r\n". This is excessively liberal and does not conform to any traditional specification. This means that one could register tag names like the following:

<my tag>, which becomes confusing, since spaces are not normally allowed under the HTML/XML specs, and denote attributes.
<mytag NotAnAttribute="hello">, which anybody would take to be a tag with an attribute, but could, in fact, be an entire tag unto itself, with attributes following after. This again might cause parser issues, depending on the implementation details.
<NotSelfClosed />, which anyone would assume is a self-closed tag...until they see they closing tag </NotSelfClosed />.

In all of these cases, how the tag actually performs compared to expectations would depend on the details of the parser implementation, which is obviously not ideal.

The HTML specification limits tags to 0-9, a-z, A-Z, which has the benefit of being simple, but is probably not appropriate given that extensions may be language-specific. XML provides a much more liberal spec which would probably be more appropriate for MW to use, though it's fairly complex (see https://www.w3.org/TR/2008/REC-xml-20081126/#NT-Name). On the bright side, the complexity is not really an issue, since it's not like you're going to be using setHook thousands of times. Even if neither of those specs is appropriate, I think we should at least restrict identifying characters like space, single- and double-quotes, equals, and slash.

Event Timeline

RobinHood70 created this task.Jul 25 2016, 5:53 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 25 2016, 5:53 PM

RobinHood70 updated the task description. (Show Details)Jul 25 2016, 5:54 PM

[ Removing Parsing-Team--ARCHIVED, setting MediaWiki-Parser project]

To be clear...this is about developers writing extensions with stupidly named hooks? And that MW doesn't prevent them from doing that?

Yes. I realize it's not going to be a priority, but if the code is going to check for stupidly named hooks at all, which it already does, I think it should at least cover off basic correct syntax by excluding spaces, single-quotes, double-quotes, equals signs, and slashes. That would be the easy change, since it's just a matter of typing the extra characters into the Regex (and maybe making that a static Regex or whatever PHP supports, rather than repeating the same one in every function). Using the XML spec would probably be even better from a purely technical standpoint, but is probably overkill in this context.

Legoktm triaged this task as Low priority.Jul 26 2016, 11:55 PM

Aklapper lowered the priority of this task from Low to Lowest.Jul 27 2016, 6:24 AM

Just a note on this: based on the tests in PreprocessorTest.php, it seems MW breaks convention and deliberately allows tags with spaces.

The specific examples given are: <display map> and <display map foo>, where the second one is a "display map" tag with an empty attribute of "foo". In addition, there's </foo><//foo>.

These both seem like really bad ideas to me, so the question would be if there are any extensions that are actually using these conventions. If there are, then I suppose we can't break them, or if we do, not without significant warning.

setHook (et al) allows silly tag namesOpen, LowestPublicActions

Description

Event Timeline

setHook (et al) allows silly tag names
Open, LowestPublic
Actions