Page MenuHomePhabricator

character filter for uselang
Closed, ResolvedPublic


Bug 36938 is fixed and adds escaping of uselang for HTML.

For the JavaScript variable mw.config.get( 'wgUserLanguage' ) still a lots of characters are allowed but some are filtered: >>> "en" >>> " " >>> "!" >>> """ >>> "en" >>> "$" >>> "%" >>> "en" >>> "&" >>> "&amp"; >>> "en"; >>> ";" >>> "en" >>> "=" >>> "=" >>> "en"" >>> """' >>> "'"

Many scripts use wgUserLanguage unescaped. Examples:

When you open the following link on dewiki with activated gadget HotCat

the page is loaded.

Of course this is a bug in the gadget, but there are lots of gadgets which maybe contain the same error.

Expected result:
wgUserLanguage should only be set when uselang contains only necessary allowed characters.

Version: 1.19.1
Severity: normal



Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:23 AM
bzimport set Reference to bz37587.
bzimport added a subscriber: Unknown Object (MLST).

BCP 47 writes in:

language tags use only the characters A-Z, a-z, 0-9, and HYPHEN-MINUS

This should be the only allowed characters.


Thanks for reporting this too! I've been out on leave for a the last two weeks, so apologies for the slow response.

I see exactly what you mean and yes, that is bad. We need to figure out the best place to put in the fix for this, but we will get it addressed asap.

The uselang attribute commonly contains punctuation characters that aren't allowed by BCP-47, due to the {{int:}} hack commonly used on multilanguage wikis. Only the minimum set of characters required for security should be rejected, plus the ones rejected by Language::isValidCode().

(In reply to comment #4)

The uselang attribute commonly contains punctuation characters that aren't
allowed by BCP-47

Such as? I thought it was only used for things like en-upload-ownwork. But always within BCP-47, in general even structer (never numbers or uppercase even).

Working with Tim on this yesterday, he pulled a list of all of the uselang values that hit WMF sites from the cache ( There were several obvious attack strings, and some that looked like they probably were errors. Almost all the rest were a-zA-Z0-9.-+ characters, with a few ?, =, and ncr-encoded characters where it was hard to figure out if they were errors or intentional.

From a security perspective, I think we should at least implement Nikerabbits patch now and if anyone was intentionally using ', ", or &, we can work with the site admins to get those cleaned up. Then we can later look at whitelisting [a-zA-Z+.-] only.

With the rollout of wmf8 today on, the particular issues reported by fomafix appears to be resolved. Thanks everyone!