Page MenuHomePhabricator

Mechanism to find usages of raw-html messages
Open, LowestPublic

Description

From T2212#25557:

What I would like to is to have a mechanism to detect this automatically, something that can be enabled during development. PHP's taint module seems a candidate, but as of now it is not easy to install.

The remaining uses are hard to find, and extension developers usually don't pay attention to it.


Version: 1.16.x
Severity: normal

Details

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 10:42 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz19291.
bzimport added a subscriber: Unknown Object (MLST).

Switching blocker from bug 209 to bug 212, which is more directly relevant.

Should this be part of the testing infrastructure?

Not really. That is more a general MediaWiki issue and how we do not detect user input being passed directly to output without proper escaping.

The PHP taint extension is exactly what we could use though it is very unlikely we will ever require such an extension as a dependency. I know of facebook/pffff which is an objective caml analyzer for PHP which *might* be able to detect such issues. Anyway not an easy task with the PHP language.

Suggesting WONTFIX here, Niklas. There isn't really a way to find this out. As long as $context->msg() or wfMessage() is used, even Message::text() and Message::plain() can be escaped or parsed later on, so there's not really an indicator.

During the recent updates from wfMsg* to wfMessage, many problems have been resolved (and some new ones have been introduced, overescaping accidentally), so the issue of outputting raw HTML should be smaller now, albeit not gone.

From what I can see, proper auditing on review is the only option for now (and being warned by users).

Siebrand: there are ways as mentioned above. I just don't believe anyone will have time to work on this. I hope this wont come to bite us later.

(In reply to Niklas Laxström from comment #5)

I just don't believe anyone will have time to work on this.

This seems to be the case still. Setting priority to Lowest to reflect this fact.

Change 203299 had a related patch set uploaded (by Nikerabbit):
Message over/underescaping detection

https://gerrit.wikimedia.org/r/203299

I think it might be cool to have an extension that provides ?uselang=qqs.

In testing i often hack MessageCache to turn messages into `" onmouseover="alert('$name')" ><img onerror="alert('$name')" src=x>

The PHP taint module seems like a reasonable tool, although I don't know how waterproof it is. We wouldn't have to require it, but we could support it if installed, and install it on Wikimedia and monitor the logs.

A more comprehensive solution would be to make message-retrieving functions not return raw strings, but rather some class like UntrustedString. This wouldn't auto-convert to a string, you'd have to convert it with escaping methods, like maybe ->forHTML() or ->forSQL() or such. We might not even have to have a method to get the raw string, unless there are valid use-cases. Then the only problem would be if someone writes code that escapes for the wrong type of output, like escaping for SQL then outputting as HTML. I'm not so worried about this getting past code review, but if we wanted extra safety, we could have UntrustedSQLString and so on that get passed around instead of strings, and have HTML output methods only accept raw strings or UntrustedHTMLStrings (and perhaps auto-escape plain UntrustedStrings, if we're in favor of being less explicit).

This is basically the same idea as Nikerabbit's patch, except using classes instead of mangling the strings, and I think it should be usable for production. Once we have such a mechanism in place, we can also use it for Request.

To clarify, my suggestion is not to have a debug system that prints warnings. It's to have a run-time safety check that will either throw or escape the string automatically if there's a violation. This should eliminate the possibility of XSS if it's deployed properly.

The PECL taint module is probably a good start, although an ideal taint module would probably have different scopes (e.g. addshlashes untaints something for use as a JS variable but not for use in HTML); also the list of tainting/untainting methods is not configurable.
There is also an RfC to include it in PHP core in a way that makes it feasible to use in production.

There is also a static analysis tool that claims to be able to track taint; it does not seem very mature though.

UntrustedString would not really be different from how Message behaves now, as far as I can see (although saner function names for the various formats would certainly help).

Message currently will return raw strings for all messages if you call ->text(), and it's up to the caller to escape them (or not) before outputting. Therefore any message could find its way into raw HTML output by mistake. An UntrustedString object would not be convertible to a string without explicitly calling a method to escape it, so it could not be passed to any output method unescaped.

Thinking about it, a similar but much simpler approach would be to make text() (and FORMAT_TEXT, etc.) only work for messages declared as accepting raw HTML, and throw for anything else. Then we'd have to rewrite all the callers of text() to call a different method that does some kind of escaping. Study would be needed to figure out how practical this would be, but it could be implemented incrementally.

In T21291#4468747, @Tgr wrote:

There is also a static analysis tool that claims to be able to track taint; it does not seem very mature though.

We have https://www.mediawiki.org/wiki/Phan-taint-check-plugin which is voting across all bundled extensions (with one exception I think) and a few others.

Change 203299 abandoned by Nikerabbit:
Message over/underescaping detection

Reason:
I believe this patch has served its purpose. There is now a phan tool that does static analysis. While this approach is complementary to that, it is not good enough to get merged into core. If anyone wants to use this, feel free to take over, preferably by submitting a new patch under your ownership.

https://gerrit.wikimedia.org/r/203299

The PHP taint module seems like a reasonable tool, although I don't know how waterproof it is. We wouldn't have to require it, but we could support it if installed, and install it on Wikimedia and monitor the logs.

A more comprehensive solution would be to make message-retrieving functions not return raw strings, but rather some class like UntrustedString. This wouldn't auto-convert to a string, you'd have to convert it with escaping methods, like maybe ->forHTML() or ->forSQL() or such. We might not even have to have a method to get the raw string, unless there are valid use-cases. Then the only problem would be if someone writes code that escapes for the wrong type of output, like escaping for SQL then outputting as HTML. I'm not so worried about this getting past code review, but if we wanted extra safety, we could have UntrustedSQLString and so on that get passed around instead of strings, and have HTML output methods only accept raw strings or UntrustedHTMLStrings (and perhaps auto-escape plain UntrustedStrings, if we're in favor of being less explicit).

This is basically the same idea as Nikerabbit's patch, except using classes instead of mangling the strings, and I think it should be usable for production. Once we have such a mechanism in place, we can also use it for Request.

This is an interesting idea, and I think that dynamic taint checking forms a good complement to static taint checking.

However i think there will always be a need to get the raw string sometimes. For example, outputting stuff to the irc recent changes feed, or if you need to combine multiple messages together before escaping.