Page MenuHomePhabricator

XSS in language converter when used with Html class's tricky escaping
Closed, ResolvedPublic

Assigned To
Authored By
Bawolff
Sep 28 2014, 7:16 PM
Referenced Files
F103786: T73394-escape_gt-REL1_19.patch
Mar 24 2015, 10:35 PM
F99618: T73394-escape_gt-wmf21.patch
Mar 17 2015, 10:43 PM
F43922: T73394-escape_gt.patch
Feb 19 2015, 11:32 PM
F16331: bug71394-REL1_19.patch
Nov 26 2014, 1:24 PM
F16330: bug71394-REL1_24_REL1_23_REL1_22.patch
Nov 26 2014, 1:24 PM
F14798: file_71394.txt
Nov 26 2014, 5:56 AM
F14799: bug71394-tokenizer.patch
Nov 22 2014, 3:46 AM

Description

Change to Html escaping method

If the language is set to a language with variants (e.g. zh, sr, etc), then its possible to create an XSS if some other code allows user input to be put in an attribute using the escaping done by Html::element (or Html::rawElement).

Example exploit code:

  1. Set $wgLanguageCode = 'zh'; $wgDefaultLanguageVariant = 'zh-cn'; (Assuming you don't have a different default variant set in prefs)

Create a page with the following wikitext:

-{H|abc123=>zh-cn:" autofocus onfocus="alert('be evil');}-
{{Special:Contributions|target=>abc123}}

And you have your XSS. I imagine there's probably less obvious ways to exploit this once you take into all the extensions using Html:: for their escaping. The $wgDefaultLanguageVariant isn't really necessary as you can set the variant from the url.

If $wgWellFormedXml is off, I believe it becomes possible to directly insert a <script> tag instead of relying on event handlers (but haven't tested).

Solutions:

The fact that language converter basically does macro expansions after the html sanitization step, is pretty sketchy. But I'm not sure if there's a good way around that without redesigning the feature.

I would suggest fixing this by getting rid of the little weird tricks that Html::expandAttributes does. Well its cute that technically one doesn't need to escape that in that context, cute things like that seem like a bad idea in security critical code. (A similar thing occurs in the escaping of Html::element, however I don't think that is exploitable).


Version: unspecified
Severity: normal

Patch:

(fix the attack vectors, we'll address the root of the issue in a public patch)

  • 1.24:
  • 1.23:
  • 1.19:

Affected Versions: Since f16d1e4ed70cd5a8fa6ae6ca8bb71bfe62f4f47e (1.17?)
Type: xss
CVE: CVE-2015-2933

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application changed the visibility from "Public (No Login Required)" to "acl*security (Project)". · View Herald TranscriptNov 22 2014, 3:46 AM
Restricted Application changed the edit policy from "All Users" to "acl*security (Project)". · View Herald Transcript

p.s. In the interest of credit where credit is due for finding this, I should add: thank you to CScott, who helped me understand how language converter actually worked, and Amir who pointed me in this direction during a conversation about a different issue with the language converter.

This is also exploitable (I believe, have not tested) on wikis with a non-variant language as content language (e.g. a normal english wiki), if one can set the page language for a page to be a language with variants. This includes:

  • If $wgPageLanguageUseDB is set to true (Its false by default, I don't think anyone is currently using it)
  • If translate extension is installed (e.g. Commons, meta).
  • People can also do the same thing with pages like MediaWiki:Foo/zh, however only admins can edit those, so not big deal.

Thanks Bawolff! Confirmed it's exploitable, although I'm still trying to figure out exactly how that's working. Adding Niklas in case we can address this on the translation side.

(In reply to Chris Steipp from comment #3)

Thanks Bawolff! Confirmed it's exploitable, although I'm still trying to
figure out exactly how that's working. Adding Niklas in case we can address
this on the translation side.

Its in the code part, not the conversion tables, so not affected by what the translators do (Of course i18n devs are probably familiar with this area of the code). AFAIK the people who know most about this area of the code base and are still active are Liangent and CScott.


I probably should have explained a little better how it works:

  • Language converter is a feature to convert different writing systems (e.g. Serbian written in Latin, vs serbian written in Cyrillic, traditional Chinese vs simplified Chinese). It more or less works by doing find and replace for various letters or strings.
  • In addition to the general pre-defined find and replace strings, users can define their own replacements for a specific page (So in essence it works much like a macro system)
  • Replacements are done after all the HTML sanitization takes place (!!) pretty much at the very end of the parsing process.

*The language converter, in LanguageConverter::autoConvert has a bunch of regexes to exclude HTML tags from being converted, to basically stop people from doing this. Specifically /<[^>]+>/ (plus some other stuff in an alternation)
*However if you have an html tag like <div class=">"> (Which is allowed in html5, > does not have to be escaped if its inside an attribute enclosed in quotes), the regex will stop two characters short and allow you to do lang converter substitutions on stuff after the first >. Thus the challenge becomes making MW insert a tag with an attribute that has a non escaped > in the attribute.
*There are multiple ways html is escaped in MediaWiki. Most stuff from wikitext goes through the sanitizer class, which uses &gt; in the attribute for >. Html::element/Html::rawElement tries to be cool about avoiding unnecessary escapes (To save a miniscule amount of bytes I suppose). Thus Html::element can be used to insert such a tag.
*Very few things controllable from wikitext (or otherwise by the user) use Html::escape as an escaping method, the first one I found was the input box on the Special:Contribution special page. I imagine there are others, probably in extension parser tags, and other transcludable special pages. Lots of transcludable special pages have <input type="hidden" ... > elements that have a controllable value from the user and escaped by Html::element, however I couldn't immediately see a way to execute javascript automatically from an extra attribute on a hidden input, so I went with the text input which has obvious event handler attributes. I imagine there's probably some ingenious way to do the same thing to the hidden form tags.

p.s. Forgot to mention in original comment, languages that are affected: 'gan', 'iu','kk','ku','shi','sr','tg', 'uz', 'zh'. And any wiki where that is the content language or the user can set the page language of a specific page (other than Special pages) to one of those languages (eg a wiki with Translate extension) is vulnerable.

Created attachment 16738
Match html elements with > in attribute

Thanks Brian! I wouldn't mind encoding > in Html::expandAttributes too, but I'm worried that if we depend on the escaping there only, this could come up in other places. I think LanguageConverter also needs to be updated to (more) correctly try and parse the text it's working on and do sane things there. So here's a patch to do that. I'd probably feel more comfortable with just updating the LanguageConverter regex as a security fix, and then do the expandAttributes change publicly as a hardening.

It looks like LanguageConverterTest.php doesn't exercise the conversions, so it's hard to know if the old regex was being abused for some functionality. But this patch stops the xss for me, and appears to allow the rest of the conversion correctly. I'll add liangent and cscott to get their opinions.

attachment bug71394-converter.patch ignored as obsolete

Liangent / CScott, it would be great if you could take a look and see if this is sane.

(In reply to Chris Steipp from comment #5)

Created attachment 16738 [details]
Match html elements with > in attribute

Thanks Brian! I wouldn't mind encoding > in Html::expandAttributes too, but
I'm worried that if we depend on the escaping there only, this could come up
in other places.

Yes definitely.

Would the couple lines above ($htmlfix variable), that catch unclosed html elements also have to be changed? I'll admit, I'm unclear under what circumstances that part of the regex would come into play.


One possible edge case that might be needed to be considered where attachment 16738 would not work, is if someone could insert:

<div title=dog's>foo</div>

into a page. Of course, one could quite convincingly argue that if anything inserted something like that, its already pretty broken (And definitely in violation of the HTML5 spec), however, firefox (or at least the version I tested on) does interpret the title attribute as actually being dog's for that html snippet. So not sure if we should care about that.

It looks like LanguageConverterTest.php doesn't exercise the conversions, so
it's hard to know if the old regex was being abused for some functionality.
But this patch stops the xss for me, and appears to allow the rest of the
conversion correctly. I'll add liangent and cscott to get their opinions.

I don't think lang converter was being abused in a way that would be stopped by this (And if anyone does, I think its an unreasonable use of the feature). I believe most of the lang converter tests are in tests/parser/parserTests.txt

attachment bug71394-converter.patch ignored as obsolete

(In reply to Bawolff (Brian Wolff) from comment #7)

Would the couple lines above ($htmlfix variable), that catch unclosed html
elements also have to be changed? I'll admit, I'm unclear under what
circumstances that part of the regex would come into play.

That comes into play when the an html element is started but not closed in the part being autoconverted. I haven't worked out exactly how the wikitext is cut up and sent to the autoconvert function, but it appears to get sent in chunks. Although I haven't been able to find any wikitext that manages to send unclosed html elements to it.

One possible edge case that might be needed to be considered where
attachment 16738 [details] would not work, is if someone could insert:

<div title=dog's>foo</div>

That is indeed a problem, although even when $wgWellFormedXml is false, we would flag the ' as a blacklisted char if it went through Html::expandAttributes, so it would be title="dog's". But again, looking at the cases where it doesn't go through expandAttributes, I probably should add an extra condition that also matches the naive pattern to catch that.

into a page. Of course, one could quite convincingly argue that if anything
inserted something like that, its already pretty broken (And definitely in
violation of the HTML5 spec), however, firefox (or at least the version I
tested on) does interpret the title attribute as actually being dog's for
that html snippet. So not sure if we should care about that.

It looks like LanguageConverterTest.php doesn't exercise the conversions, so
it's hard to know if the old regex was being abused for some functionality.
But this patch stops the xss for me, and appears to allow the rest of the
conversion correctly. I'll add liangent and cscott to get their opinions.

I don't think lang converter was being abused in a way that would be stopped
by this (And if anyone does, I think its an unreasonable use of the
feature). I believe most of the lang converter tests are in
tests/parser/parserTests.txt

Created attachment 16744
Add matching html elements with > in attribute

This patch adds back in the naive matching (<[^>]+>) at the end of the regex, which, in my testing, handles both.

<div title=foo'bar> and <div title="foo>bar">

This assumes that the |'ed conditions of the regex are evaluated left to right, but I need to verify that we can assume that.

attachment bug71394-converterb.patch ignored as obsolete

I'm suspicious of the approach in chris' patch, I think you'll need something more (haven't yet reviewed bawolff's patch). There are a number of parsing issues in the regexps, for example:
<pre foo="></pre>">
escapes the $prefix match, so:
-{H|abc123=>zh-cn:" onfocus="alert('be evil');}-
<pre foo="></pre>" abc123></pre>

would exploit that, using the same Html::escape trick to get that attribute value past the sanitizer.

A more robust regexp for matching attributes (based on the spec, at http://dev.w3.org/html5/spec-preview/tokenization.html) is:
https://github.com/fgnass/domino/blob/master/lib/HTMLParser.js#L1458

but I'm not sure I've got high confidence this approach will catch all possible cases. I need to think hard about this.

Ok, I've read bawolff's patch. I think I like that approach better than trying to fix language converter here, although I would tweak his comment in the source to say something like,

"it's important to escape < >, even if it's not required in this context, to allow other bits of code (for example Language Converter) to use naive algorithms to extract tags"

To be concrete about the <pre> problem above, the issue is that Html::escape doesn't encode the "<". So to exploit the bad <pre> regexp in language converter, we just need to sneak a < into an attribute value. The vulnerable code in Special:Contributions which bawolff used to exploit the bad attribute-parsing regexp was:

		$input = Html::input(
			'target',
			$this->opts['target'],
                        ...

So to exploit the bad <pre>-parsing regexp we'd be looking for:

$pre = Html::pre( 'anyattribute', $userinput, .... )

or

$code = Html::code( 'anyattribute', $userinput, .... )

since the <code> regexp has the same problem (as does the <script> regexp).

The HTML parsing spec is complex. Until we can move LanguageConverter higher in the parsing stack and get it in front of the sanitizer, we should probably be very conservative in how we allow > to reach the output.

csteipp: So naively, $prefix = '<pre(?:"[^"]*"[\'"]*|\'[^\']*\'[\'"]*|[^\'">])*>.*?<\/pre>|';
cscott: i'm reading the spec to double check
cscott: i'd simplify that to $prefix = '<pre(?:"[^"]*"|\'[^\']*\'|[^\'">])*>.*?<\/pre>|';
cscott: that is, you don't need the [^'"] at the end of each quoted string
cscott: but the weird case you haven't covered is <pre foo=bar's>
cscott: i can confirm that once you've entered the "Attribute value (double-quoted) state" the only thing that will bail you out is a double quote, and similarly for single-quoted/single-quote
cscott: but not every double/single quote will get you into the attribute value state
cscott: in particular, single and double quotes when you are in the "attribute value (unquoted) state" are just appended to the attribute value.

cscott: i think the right solution is to be hyper conservative and bail out on language converter entirely if you see strange stuff
cscott: i think merging bawolff's patch is a good first step, followed by auditing anything which does any other "interesting" html escape. belt and suspenders.
cscott: i think we can do better with the language converter regexp, but i think the solution is going to be something like <[a-z]+((\s+[a-z]+="[^"]")*>|.*)
cscott: that is, just match to the end of the current string and protect all the rest of the input from language converter if you see something in the tag that doesn't look "right".

cscott: i'd simplify that to $prefix =
'<pre(?:"[^"]*"|\'[^\']*\'|[^\'">])*>.*?<\/pre>|';

done.

I also realized we have a more generalized problem in that " is a valid attribute name, so <div "=">"></div> or <pre "=">">foo</pre> should be matched too.

cscott: i think merging bawolff's patch is a good first step, followed by
auditing anything which does any other "interesting" html escape. belt and
suspenders.

I definitely now think we want to include bawolff's patch as part of the security fix. There's too many things that can go wrong without it.

cscott: i think we can do better with the language converter regexp, but i
think the solution is going to be something like
<[a-z]+((\s+[a-z]+="[^"]")*>|.*)
cscott: that is, just match to the end of the current string and protect all
the rest of the input from language converter if you see something in the
tag that doesn't look "right".

I'm not sure I fully understand what you're thinking here, I'll ping you on irc and hopefully get a better idea.

Basically instead of trying to match all possible valid uses of quotes, angle brackets, etc, try to match the common not-crazy patterns -- and then include a fail-safe .* alternate at the end so if we're not certain where the close-tag is, fail-safe and protect the entire remainder of the string from language converter.

Relies on left-to-right evaluation of alternates, which is indeed how PCRE works (http://php.net/manual/en/regexp.reference.alternation.php).

Consider my simple version (assume 'is' regexp flags):

<(/?[a-z]+(\s+[a-z]+="[^"]")*\s*>|.*)

Given the input:

abc<pre>foo</pre>x<div class="foo" style="bar">bat</div>y<pre evil'ness="'">abc123</pre>

It will match as follows:

<pre> (sending 'abc' off to be language converted)
</pre> (sending 'foo' off to be language converted)
<div class="foo" style="bar"> (sending 'x' off to be language converted)
</div> (sending 'bat' off to be language converted)
<pre evil'ness="'">abc123</pre> (sending 'y' off to be language converted)

and we're done. Note that the final <pre> tag didn't match how we expected our attributes to look, and so to be safe we just matched from there to the end of the string and didn't language convert anything in that potentially-dangerous region (since we couldn't trust our simple regex to find the correct end of the tag when things get weird).

(In reply to C. Scott Ananian from comment #15)

<pre evil'ness="'">abc123</pre> (sending 'y' off to be language converted)

Just a reminder that <pre evil'ness="'">abc123</pre> will be processed further (// Translate any alt or title attributes inside the matched element) by Language Converter and that part of code also uses delimiter ">".

Created attachment 16932
Matching html elements with > in attribute, etc.

attachment bug71394-converterc.patch ignored as obsolete

(In reply to Chris Steipp from comment #14)

cscott: i'd simplify that to $prefix =
'<pre(?:"[^"]*"|\'[^\']*\'|[^\'">])*>.*?<\/pre>|';

done.

I realized this is actually wrong. We want to aggressively match anything that looks like it's in <pre> tags (ditto <code> or <script>), to prevent that from being translated.

(In reply to C. Scott Ananian from comment #15)

Consider my simple version (assume 'is' regexp flags):

<(/?[a-z]+(\s+[a-z]+="[^"]")*\s*>|.*)

I want with

<(\/?[a-zA-Z]+(\s+[a-zA-Z]+\s*=("[^"]*"|\'[^\']*\'|[^\'">]*))*\s*>|.*)

So that catch attributes with > in their value, or if the element has an attribute name that contains a ", we just match the rest of the string, since there isn't a regex that can actually parse according to the spec (I wouldn't mind seeing us implement an actual html parser and using that for cases like this, but that's a much larger project).

This should make the first part of $htmlfix redundant, but I left it in case.

I haven't found a good way to write tests for this fix-- I haven't found any simple wikitext to get a > in the attribute value for parserTests.txt.

(In reply to Liangent from comment #16)

Just a reminder that <pre evil'ness="'">abc123</pre> will be processed
further (// Translate any alt or title attributes inside the matched
element) by Language Converter and that part of code also uses delimiter ">".

The regex on line 406 can stay unchanged. If we have something like <img class=">foo" title="something-to-translate", then the title won't be converted, even though it's technically harmless for us to do that.

The regex on line 421 is in case the conversion has injected html into the title or alt attribute values, then it's supposed to be stripped out. With bawolff's patch to expandAttributes, then this is going to be run through Sanitizer::encodeAttribute again, so it shouldn't affect the layout.

Chris: I think your regexp isn't quite right for the "unquoted attribute" case -- it includes whitespace in the unquoted attribute, which isn't right. Does the sanitizer ever emit unquoted attributes? I'd just leave the unquoted attribute case out if possible, or match only [a-z]* with it to be safe. (I could probably read the spec and come up with a more precise regexp for "things that are safe to include in an unquoted attribute -- but I'd rather just skip that case entirely if we expect attributes to be quoted.)

Created attachment 16937
Matching html elements with > in attribute, etc.

It shouldn't make a difference if the subpattern match stops at the whitespace, or if it continues until just before the = in the next attribute, but it might be more efficient, so I updated it. I also made it explicit we're not capturing any of the subpatterns.

Also, http://dev.w3.org/html5/spec-preview/tokenization.html#attribute-name-state has the "Anything else" section in the parsing, so we should account for attribute names that aren't just a-z. We could use Sanitizer::getAttribsRegex'es logic ($attrib = '[:A-Z_a-z-.0-9]';), but I'd rather target the spec.

I also realized the last patch neglected boolean attributes, so tags with those would cause the whole string to be matched.

attachment bug71394-converterd.patch ignored as obsolete

I was told that feedback from LangEng team was requested. Unfortunately, the language converter is not a thing we have worked on. We do not have enough experience to comment whether the proposed patch might cause issues.

Chris: I'm still concerned that your character classes match too much, especially your attribute name match. And you're not allowing whitespace after an equals side currently.

My point in comment 15 was that we *shouldn't* try to match the spec exactly -- the full parser spec has all sorts of crazy corner cases. In order to have maintainable security here we should instead try to be *as conservative as possible* in terms of what we match. Let's only broaden character classes as little as possible to match what we expect the sanitizer to emit, instead of risking bugs (and future maintainability issues) adding extra unnecessary clauses to the regexp.

I think using Sanitizer::getAttribsRegex's value is very reasonable for an attribute name pattern.

And I'd like to see the boolean attribute handled via using a (\s=\s+...)? type of regexp, rather than repeating the attribute name regexp at the end. Basically I'm trying to nail down the hatches here and try to ensure that we don't get weird mismatches creeping in here in the future, since this code turns out to be security-critical.

(In reply to C. Scott Ananian from comment #22)

Chris: I'm still concerned that your character classes match too much,
especially your attribute name match. And you're not allowing whitespace
after an equals side currently.

{{done}}

My point in comment 15 was that we *shouldn't* try to match the spec exactly

  • the full parser spec has all sorts of crazy corner cases. In order to

have maintainable security here we should instead try to be *as conservative
as possible* in terms of what we match. Let's only broaden character
classes as little as possible to match what we expect the sanitizer to emit,
instead of risking bugs (and future maintainability issues) adding extra
unnecessary clauses to the regexp.

Yeah, I was trying to be broad in what we match in the general html part, instead of just letting the fallback .* handle it. I agree, just matching the rest of the string is probably safer.

I'll leave some comments too so if someone tries to "fix" conversions not showing up they'll hopefully do the right thing as well.

I think using Sanitizer::getAttribsRegex's value is very reasonable for an
attribute name pattern.

{{done}}

And I'd like to see the boolean attribute handled via using a (\s=\s+...)?
type of regexp, rather than repeating the attribute name regexp at the end.
Basically I'm trying to nail down the hatches here and try to ensure that we
don't get weird mismatches creeping in here in the future, since this code
turns out to be security-critical.

{{done}}

I also didn't account for self-closing tags, which the sanitizer will emit. So that has the main html matching as:

<(?:\/?[A-Za-z]+(?:\s+[:A-Z_a-z-.0-9]+(\s*=\s*(?:"[^"]*"|\'[^\']*\'|[^\'">\s]*))*)*\s*\/?>|.*)

I ran that over the source of enwiki and zhwikisource, and it seems to parse everything except comments and CDATA. I'm not sure if we should try to match those.

I also realized we don't match the content of style tags. I think that's something we want to prevent conversion in. Hopefully no one is adjusting their styles based on language variant.

Created attachment 16951
Matching html elements with > in attribute, etc.

attachment bug71394-convertere.patch ignored as obsolete

Nested subpattern repetition is a pretty common example of a regex with pathological performance characteristics, and this patch runs such a regex on the whole document. I wonder if you benchmarked it or tested it for backtrack limit exhaustion?

preg_match() silently returns false on limit exhaustion, which causes the proposed patch to pass the whole string through. So it's a security issue, not just a performance issue. Example:

ini_set('pcre.recursion_limit', 1000);
$s = '-{H|abc123=>zh-cn:" autofocus onfocus="alert(\'be evil\');}- <div ';
for ( $i = 0; $i < 10000; $i++ ) {

$s .= "data-$i='x' ";

}
$s.= '> {{Special:Contributions|target=>abc123}}';
$out = $wgParser->parse($s, Title::newMainPage(), new ParserOptions);

This test segfaults with pcre.recursion_limit set to the default, but with it set to 1000 as on WMF, it is an XSS vulnerability.

Created attachment 17067
Properly remove html from conversion text

After Tim's comments on this bug, we chatted on irc, and he recommended handling as much of the parsing in php as possible, to avoid the pcre limitation.

Since there are html5 parsers in php already available, I modified one to do just the parsing we need, and use that. It would be fairly easy to use this in some of the other places we try to parse html with a regex too.

Running Tim's test against master and this patch, this patch is about 5% faster on my dev system.

Attached:

It's quite a big patch to review in public. Could we maybe do Bawolff's patch in the security release and then put the Html5Tokenizer in Gerrit?

A big patch to review in private, rather.

I deployed bawolff's patch today:
20:19 csteipp: patched bugs 71111 and 71394 in wmf7 and wmf8

I put the Html5Tokenizer as a draft in gerrit. I didn't want to make it fully public until we either are ready to merge it, or we release bawolff's patch, in case someone digs into why I would be rewriting that. Comments there welcome.

Chris -- what's the latest version of your regular expression. I just recently fixed a performance regression in a regexp (https://gist.github.com/gruber/249502#comment-1328838 sigh), perhaps I can refactor a bit to get rid of the backtracking.

cscott:
$reg = '/' . $codefix . $scriptfix . $stylefix . $prefix .
'<(?:\/?[A-Za-z]+(?:\s+[:A-Z_a-z-.0-9]+(\s*=\s*(?:"[^"]*"|\'[^\']*\'|[^\'">\s]*))*)*\s*\/?>|.*)|' .
'&[a-zA-Z#][a-z0-9]+;' . $marker . $htmlfix . '/s';

Restricted Application changed the visibility from "acl*security (Project)" to "Custom Policy". · View Herald TranscriptNov 24 2014, 9:27 PM
Restricted Application changed the edit policy from "acl*security (Project)" to "Custom Policy". · View Herald Transcript
Restricted Application changed the visibility from "Custom Policy" to "Custom Policy". · View Herald TranscriptNov 26 2014, 5:56 AM
Restricted Application changed the edit policy from "Custom Policy" to "Custom Policy". · View Herald Transcript
csteipp changed Security from Software security bug to None.Nov 26 2014, 6:09 AM

Here are the backports for older release branches. It would be great if anyone could confirm them today before I do the release

Tested this by calling a few pages, as they all use HTML class in generating their output. On REL1_24, REL1_23, REL1_22 and REL1_19

Here are the backports for older release branches. It would be great if anyone could confirm them today before I do the release

Issue is fixed with this patch

I believe Markus is pulling this patch from the release, since it failed some of the unit tests. It would be difficult to get from that patch to an exploit, so I'm not too worried about it, although I'm leaving WMF servers patched. In practicality, the patch doesn't seem to have broken anything.

I thought I had run his patch through all of the unit tests when working on LanguageConverter, but I guess not. Apologies for that.

Of the tests that fail,

  • "External link containing a single quote" looked wrong, but chrome and firefox handle it just fine. And the html 5 spec should parse it correctly. @cscott, can you comment on this one?
  • "Running test Header with special characters" the test is specifically for what we're trying to do, but looking at the bug and https://test.wikipedia.org/wiki/Header_with_special_characters I'm not seeing the issue that the original bug (25462) specified. Was this fixed already by the move to html5? The new values that we generate should be equivalent when this is parsed.

For the rest, the new value is equivalent in html5. Anyone aware of a reason we can't update them?

The failing parser tests are:

Running test External link containing a single quote. (bug 63947)... FAILED!
--- /tmp/mwParser-2129676043-expected 2014-11-26 16:45:26.445386114 -0800
+++ /tmp/mwParser-2129676043-actual 2014-11-26 16:45:26.445386114 -0800
@@ -1,3 +1,3 @@
-<p><a rel="nofollow" class="external autonumber" href="//foo.org/bar'baz">[1]</a>
-</p><p><a rel="nofollow" class="external text" href="//foo.org/bar'baz">bang</a>
+<p><a rel="nofollow" class="external autonumber" href="//foo.org/bar&#39;baz">[1]</a>
+</p><p><a rel="nofollow" class="external text" href="//foo.org/bar&#39;baz">bang</a>
</p>


Running test Link containing double-single-quotes '' (bug 4598)... FAILED!
--- /tmp/mwParser-1070742882-expected 2014-11-26 16:45:27.336386162 -0800
+++ /tmp/mwParser-1070742882-actual 2014-11-26 16:45:27.336386162 -0800
@@ -1,2 +1,2 @@
-<p><a href="/index.php?title=Lista_d%27%27e_paise_d%27%27o_munno&amp;action=edit&amp;redlink=1" class="new" title="Lista d''e paise d''o munno (page does not exist)">Lista d''e paise d''o munno</a>
+<p><a href="/index.php?title=Lista_d%27%27e_paise_d%27%27o_munno&amp;action=edit&amp;redlink=1" class="new" title="Lista d&#39;&#39;e paise d&#39;&#39;o munno (page does not exist)">Lista d''e paise d''o munno</a>
</p>


Running test Link with double quotes in title part (literal) and alternate part (interpreted)... FAILED!
--- /tmp/mwParser-2035047627-expected 2014-11-26 16:45:27.390386165 -0800
+++ /tmp/mwParser-2035047627-actual 2014-11-26 16:45:27.390386165 -0800
@@ -1,5 +1,5 @@
<p><a href="/index.php?title=Special:Upload&amp;wpDestFile=Denys_Savchenko_%27%27Pentecoste%27%27.jpg" class="new" title="File:Denys Savchenko &#39;&#39;Pentecoste&#39;&#39;.jpg">File:Denys Savchenko <i>Pentecoste</i>.jpg</a>
-</p><p><a href="/index.php?title=%27%27Pentecoste%27%27&amp;action=edit&amp;redlink=1" class="new" title="''Pentecoste'' (page does not exist)">''Pentecoste''</a>
-</p><p><a href="/index.php?title=%27%27Pentecoste%27%27&amp;action=edit&amp;redlink=1" class="new" title="''Pentecoste'' (page does not exist)">Pentecoste</a>
-</p><p><a href="/index.php?title=%27%27Pentecoste%27%27&amp;action=edit&amp;redlink=1" class="new" title="''Pentecoste'' (page does not exist)"><i>Pentecoste</i></a>
+</p><p><a href="/index.php?title=%27%27Pentecoste%27%27&amp;action=edit&amp;redlink=1" class="new" title="&#39;&#39;Pentecoste&#39;&#39; (page does not exist)">''Pentecoste''</a>
+</p><p><a href="/index.php?title=%27%27Pentecoste%27%27&amp;action=edit&amp;redlink=1" class="new" title="&#39;&#39;Pentecoste&#39;&#39; (page does not exist)">Pentecoste</a>
+</p><p><a href="/index.php?title=%27%27Pentecoste%27%27&amp;action=edit&amp;redlink=1" class="new" title="&#39;&#39;Pentecoste&#39;&#39; (page does not exist)"><i>Pentecoste</i></a>
</p>


Running test Internal link with kaa linktrail with apostrophes (bug 27473)... FAILED!
--- /tmp/mwParser-1736818691-expected 2014-11-26 16:45:28.650386234 -0800
+++ /tmp/mwParser-1736818691-actual 2014-11-26 16:45:28.650386234 -0800
@@ -1,2 +1,2 @@
-<p><a href="/index.php?title=Something&amp;action=edit&amp;redlink=1" class="new" title="Something (bet ele jaratılmag'an)">Something'nice</a>
+<p><a href="/index.php?title=Something&amp;action=edit&amp;redlink=1" class="new" title="Something (bet ele jaratılmag&#39;an)">Something'nice</a>
</p>


Running test Link to category... FAILED!
--- /tmp/mwParser-259593883-expected 2014-11-26 16:45:38.598386773 -0800
+++ /tmp/mwParser-259593883-actual 2014-11-26 16:45:38.598386773 -0800
@@ -1,2 +1,2 @@
-<p><a href="/wiki/Category:MediaWiki_User%27s_Guide" title="Category:MediaWiki User's Guide">Category:MediaWiki User's Guide</a>
+<p><a href="/wiki/Category:MediaWiki_User%27s_Guide" title="Category:MediaWiki User&#39;s Guide">Category:MediaWiki User's Guide</a>
</p>


Running test Category with different sort key... FAILED!
--- /tmp/mwParser-1819840247-expected 2014-11-26 16:45:38.686386778 -0800
+++ /tmp/mwParser-1819840247-actual 2014-11-26 16:45:38.686386778 -0800
@@ -1 +1 @@
-<a href="/wiki/Category:MediaWiki_User%27s_Guide" title="Category:MediaWiki User's Guide">MediaWiki User's Guide</a>
+<a href="/wiki/Category:MediaWiki_User%27s_Guide" title="Category:MediaWiki User&#039;s Guide">MediaWiki User's Guide</a>


Running test Category with identical sort key... FAILED!
--- /tmp/mwParser-25929600-expected 2014-11-26 16:45:38.725386780 -0800
+++ /tmp/mwParser-25929600-actual 2014-11-26 16:45:38.725386780 -0800
@@ -1 +1 @@
-<a href="/wiki/Category:MediaWiki_User%27s_Guide" title="Category:MediaWiki User's Guide">MediaWiki User's Guide</a>
+<a href="/wiki/Category:MediaWiki_User%27s_Guide" title="Category:MediaWiki User&#039;s Guide">MediaWiki User's Guide</a>


Running test Header with special characters (bug 25462)... FAILED!
--- /tmp/mwParser-616350119-expected 2014-11-26 16:45:39.143386802 -0800
+++ /tmp/mwParser-616350119-actual 2014-11-26 16:45:39.143386802 -0800
@@ -10,7 +10,7 @@
</ul>
</div>

-<h2><span class="mw-headline" id="text_.3E_text">text &gt; text</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/index.php?title=Parser_test&amp;action=edit&amp;section=1" title="Edit section: text > text">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
+<h2><span class="mw-headline" id="text_.3E_text">text &gt; text</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/index.php?title=Parser_test&amp;action=edit&amp;section=1" title="Edit section: text &gt; text">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<p>section 1
</p>
<h2><span class="mw-headline" id="text_.3C_text">text &lt; text</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/index.php?title=Parser_test&amp;action=edit&amp;section=2" title="Edit section: text &lt; text">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
@@ -19,7 +19,7 @@
<h2><span class="mw-headline" id="text_.26_text">text &amp; text</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/index.php?title=Parser_test&amp;action=edit&amp;section=3" title="Edit section: text &amp; text">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<p>section 3
</p>
-<h2><span class="mw-headline" id="text_.27_text">text ' text</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/index.php?title=Parser_test&amp;action=edit&amp;section=4" title="Edit section: text ' text">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
+<h2><span class="mw-headline" id="text_.27_text">text ' text</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/index.php?title=Parser_test&amp;action=edit&amp;section=4" title="Edit section: text &#039; text">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<p>section 4
</p>
<h2><span class="mw-headline" id="text_.22_text">text " text</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/index.php?title=Parser_test&amp;action=edit&amp;section=5" title="Edit section: text &quot; text">edit</a><span class="mw-editsection-bracket">]</span></span></h2>

Of the tests that fail,

  • "External link containing a single quote" looked wrong, but chrome and firefox handle it just fine. And the html 5 spec should parse it correctly. @cscott, can you comment on this one?
  • "Running test Header with special characters" the test is specifically for what we're trying to do, but looking at the bug and https://test.wikipedia.org/wiki/Header_with_special_characters I'm not seeing the issue that the original bug (25462) specified. Was this fixed already by the move to html5? The new values that we generate should be equivalent when this is parsed.

@cscott, is it ok if I update these tests for 63947 to escape 's inside attributes?

@tstarling, do you know if the tests for bug 25462 are still correct? I can't tell if the original bug was just pre-html5, or if it was being escaped a second time at some point in the past.

It was double-escaped before the test was introduced, the input text says that explicitly, as does the relevant commit message: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/74526 . It's fine to update the expected text for that test.

I modified @Bawolff's patch to only escape > (which is the character causing problems for language converter's current regex), and updated the two parser tests that expected the old output. As I put in the comment, that means the only difference in encoding between the patches (assuming $wgWellFormedXml is set) is ' are not encoded.

I'll add a patch similar to bawolff's publicly after this goes out with the security release, but I think it's going to get resistance from @Yurik and mobile, who are working on https://gerrit.wikimedia.org/r/#/c/182889/, for example. So entity encoding all " and ' characters I don't think is going to work for them, despite being more conservative from a security perspective.

As I put in the comment, that means the only difference in encoding between the patches (assuming $wgWellFormedXml is set) is ' are not encoded.

That does appear to be the case, yes.

I'll add a patch similar to bawolff's publicly after this goes out with the security release, but I think it's going to get resistance from @Yurik and mobile, who are working on https://gerrit.wikimedia.org/r/#/c/182889/, for example. So entity encoding all " and ' characters I don't think is going to work for them, despite being more conservative from a security perspective.

IMO security trumps being "cute" with quotes, but I can't manage to find any way to exploit this with '>' being escaped.

Rebased the parser tests and deployed, replacing

.

- change in comment style, and differences in parser test output

csteipp changed the visibility from "Custom Policy" to "Public (No Login Required)".Mar 31 2015, 9:15 PM
csteipp changed the edit policy from "Custom Policy" to "All Users".

Change 201016 had a related patch set uploaded (by CSteipp):
SECURITY: Escape > in Html::expandAttributes

https://gerrit.wikimedia.org/r/201016

Change 201027 had a related patch set uploaded (by CSteipp):
SECURITY: Escape > in Html::expandAttributes

https://gerrit.wikimedia.org/r/201027

Change 201037 had a related patch set uploaded (by CSteipp):
SECURITY: Escape > in Html::expandAttributes

https://gerrit.wikimedia.org/r/201037

Change 201016 merged by jenkins-bot:
SECURITY: Escape > in Html::expandAttributes

https://gerrit.wikimedia.org/r/201016

Change 201027 merged by jenkins-bot:
SECURITY: Escape > in Html::expandAttributes

https://gerrit.wikimedia.org/r/201027

Change 201037 merged by jenkins-bot:
SECURITY: Escape > in Html::expandAttributes

https://gerrit.wikimedia.org/r/201037

Change 201222 had a related patch set uploaded (by CSteipp):
SECURITY: Escape > in Html::expandAttributes

https://gerrit.wikimedia.org/r/201222

Change 201222 merged by jenkins-bot:
SECURITY: Escape > in Html::expandAttributes

https://gerrit.wikimedia.org/r/201222