Page MenuHomePhabricator

CVE-2025-32699: Potential javascript injection attack enabled by Unicode normalization in Action API
Closed, ResolvedPublicSecurity

Assigned To
Authored By
zoe
Feb 24 2025, 3:30 PM
Referenced Files
F59025694: 02-T387130-2.patch
Apr 9 2025, 2:30 PM
F59025660: 02-T387130.patch
Apr 9 2025, 2:30 PM

Description

TL;DR

The MediaWiki Action API converts output to Unicode Normalization Form C. Unfortunately, for HTML strings this is unsafe, because the sequence ‘>’ + U+0338 gets replaced by U+226F, breaking the tag end and potentially allowing injection attacks.

Steps to reproduce

  1. Visit any wiki page, such that mw.Api() has loaded
  2. Open the console
  3. Perform any mw.Api call that generates HTML, such that you can make the first character inside a tag be U+0338 COMBINING LONG SOLIDUS OVERLAY.

In other words, you want the API to send you some HTML that contains ‘>’ followed by U+0338.
For example, call { action: 'visualeditor', paction: 'parsefragment' } as follows:

const COMBINING_LONG_SOLIDUS = '\u0338';
new mw.Api().post( {
        action:  'visualeditor',
        paction: 'parsefragment',
        page: 'Test',
        wikitext: COMBINING_LONG_SOLIDUS + ' onmouseover="alert(42)" >content'
} ).done( ( data ) => { 
        const content = data.visualeditor.content; 
        document.body.innerHTML = content
        console.log( 'Content:', content );
} ).fail( ( err ) => console.error( err ) );

This can also be reproduced without JavaScript – note the missing > after id="mwAg":

$ curl -s -d action=visualeditor -d paction=parsefragment -d page=Test -d wikitext=$'\u0338 onmouseover="alert(42)" >content ' -d format=json https://en.wikipedia.org/w/api.php | jq -r .visualeditor.content | hexdump -C
00000000  3c 70 20 69 64 3d 22 6d  77 41 67 22 e2 89 af 20  |<p id="mwAg"... |
00000010  6f 6e 6d 6f 75 73 65 6f  76 65 72 3d 22 61 6c 65  |onmouseover="ale|
00000020  72 74 28 34 32 29 22 20  3e 63 6f 6e 74 65 6e 74  |rt(42)" >content|
00000030  20 3c 2f 70 3e 0a                                 | </p>.|
00000036

Compare this with a normal slash in the input:

$ curl -s -d action=visualeditor -d paction=parsefragment -d page=Test -d wikitext='/ onmouseover="alert(42)" >content ' -d format=json https://en.wikipedia.org/w/api.php | jq -r .visualeditor.content | hexdump -C
00000000  3c 70 20 69 64 3d 22 6d  77 41 67 22 3e 2f 20 6f  |<p id="mwAg">/ o|
00000010  6e 6d 6f 75 73 65 6f 76  65 72 3d 22 61 6c 65 72  |nmouseover="aler|
00000020  74 28 34 32 29 22 20 3e  63 6f 6e 74 65 6e 74 20  |t(42)" >content |
00000030  3c 2f 70 3e 0a                                    |</p>.|
00000035

Actual behaviour

The sequence ‘>’ + U+0338 gets replaced with the combined character U+226F ≯ NOT GREATER THAN. This is due to applying Unicode Normalization Form C. But (surprisingly!) that breaks the HTML tag, which potentially allows a Javascript injection attack, for instance:

'<p id="mwAg"\u226F onmouseover="alert(42)" >content</p>'

Expected behaviour

The HTML arrives with the sequence ‘>’ + U+0338 intact.

'<p id="mwAg">\u0338 onmouseover="alert(42)" >content</p>'

Note that this is a regular '>' symbol closing the HTML tag, followed by a U+0338 ◌̸ COMBINING LONG SOLIDUS OVERLAY, such that the contents of the <p> tag are "\u0338 onclick="alert(42)" >content"

Debugging note

Both expected and actual output may look similar or identical when rendered in the console:

<p id="mwAg"≯ onmouseover="alert(42)" >content</p>  # actual
<p id="mwAg"≯ onmouseover="alert(42)" >content</p>  # expected

The best way to see for sure what’s there is to escape non-ASCII characters with a function:

function showUnicode( text ) {
    return text.replace( /[^\x00-\x7F]/g, ( ch ) => '\\u' + ch.charCodeAt( 0 ).toString( 16 ).padStart( 4, '0' ) );
}

text = '<p id="mwAg"≯ onmouseover="alert(42)" >content</p>';
console.log( showUnicode( text ) );
// <p id="mwAg"\u226f onmouseover="alert(42)" >content</p>

Cause

The cause was identified by @dchan in the related ticket. The function MediaWiki::Api::ApiResult::validateValue is not just catching invalid UTF-8. It is also applying Unicode Normalization Form C. Unfortunately, as we have seen, it is unsafe to do this on HTML strings if they might contain ‘>’ + U+0338.

MediaWiki::Api::ApiResult::addValue
  MediaWiki::Api::ApiResult::validateValue
    MediaWiki::Language::normalize
      UtfNormal::Validator::cleanUp
        normalizer_normalize( $string, Normalizer::FORM_C )

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@cscott I think you haven't attached the OutputTransform patches correctly, here I can only see their file names

Here is a patch for 1b from above (T387130#10584304). This patches wikimedia/utfnormal to prefix isolated combining characters, using the unicode database [...]

+ $canPrecedeCombining = [];
[...]
+ public function testU0338() {
+     $text = "\u{0338}<\u{0338}>\u{0338}";
+     $expect = "\u{25CC}\u{0338}\u{226E}\u{226F}";
+     $this->assertEquals(
+         bin2hex( $expect ),
+         bin2hex( Validator::cleanUp( $text ) )
+     );
+ }

Please can we remove '<' and '>' from $canPrecedeCombining? that would render html invulnerable, without breaking any combining sequence except ">\x{0338}" and "<\x{0338}" — both of which have precomposed alternatives and are bad things to be floating around our ecosystem.

Also that way I don't have to fix anything in ApiVisualEditor.php 😁

Then the correct test would be:

$text = "\u{0338}<\u{0338}>\u{0338}";
$expect = "\u{25CC}\u{0338}<\u{25CC}\u{0338}>\u{25CC}\u{0338}";

Yes, I agree that would be better. What I don't know is why we choose to normalize html (or wikitext) to NFC there. Is that a step we actually need?

I looked it up. It was to fix T17261 and T18262 (r45749)

That said, it does seem like it only really needs valid unicode not normalized unicode.

Nag on the back of my mind: can we have a potential vector around source ranges?

We have things like T346197 (and related "bad UTF-8" issues) that hint that we occasionally try to access character ranges that were not the ones we actually wanted to; there might be a way to craft things around these?

I think we might need a patch on PHPUtils::safeSubstr in Parsoid to avoid returning a string starting with \u0338? (which we do right now:

$ php run.php shell

> use Wikimedia\Parsoid\Utils\PHPUtils;
> PHPUtils::safeSubstr("\u{0338}aaa", 0, 5);
= "̸aaa"

The requirements are:

  • combining slash is sometimes valid, so we cant outright ban it
  • < followed by combining slash will almost always be malicious in output because most keyboards output the precomposed form and we also run NFC on all user input
  • we cant generically replace combining slash with entity at the normalization stage as we dont know if the output is html or not.

Yes, very clearly put. However if we have unicode data available, we could insert U+25CC '' inside any sequence where '>' is followed by a combining character. I think that's reasonable for a function called cleanUp.

What if on any output normalization (so excluding normalization of user input in WebRequest), we first count the number of precomposed "not greater than" signs, normalize, and then count again. If the number changes we know an attack is happening since we assume at this point the decomposed not greater than is always malicious.

That's an ingenious idea. Maybe we should do it if we end up unconvinced that our coverage is good enough?

Thinking about this

The requirements are:

  • combining slash is sometimes valid, so we cant outright ban it
  • < followed by combining slash will almost always be malicious in output because most keyboards output the precomposed form and we also run NFC on all user input
  • we cant generically replace combining slash with entity at the normalization stage as we dont know if the output is html or not.

What if on any output normalization (so excluding normalization of user input in WebRequest), we first count the number of precomposed "not greater than" signs, normalize, and then count again. If the number changes we know an attack is happening since we assume at this point the decomposed not greater than is always malicious. At this point we throw an exception or maybe go back to the unnormalized string, replace all combining slash with unicode replacement and try again (its ok to be a little lossy here since we assume this code path only happens during an attack). Thoughts?

If we do strip these characters, i think it is less confusing to the user to replace them with unicode replacement character then to just silently delete.

Attempt at implementing this idea, although maybe it belongs as a method of UTFValidator instead of in API. I only covered as I don't think ≮ is a security risk, but maybe it would make more sense to cover both just in case.

What if on any output normalization (so excluding normalization of user input in WebRequest), we first count the number of precomposed "not greater than" signs, normalize, and then count again. If the number changes we know an attack is happening since we assume at this point the decomposed not greater than is always malicious.

Headline: I think this would work.

We are relying on the properties of the characters, not the NFC algorithm itself. If we can find something that composes with U+003E > and which comes earlier in the canonical ordering algorithm than U+0338 ◌̸ then we could create a string which normalises with fewer copies of and bypass such a check.

I'm having trouble finding details of the normalization algorithm, but experimentally we can gain confidence:

[...new Array(0xffff)].map((_, i) => ('\u226f' + String.fromCodePoint(i)).normalize("NFC")).filter((c) => c.codePointAt(0) !== 0x226f)
//[]

We are relying on the properties of the characters, not the NFC algorithm itself. If we can find something that composes with U+003E > and which comes earlier in the canonical ordering algorithm than U+0338 ◌̸ then we could create a string which normalises with fewer copies of and bypass such a check.

I'm having trouble finding details of the normalization algorithm, but experimentally we can gain confidence:

[...new Array(0xffff)].map((_, i) => ('\u226f' + String.fromCodePoint(i)).normalize("NFC")).filter((c) => c.codePointAt(0) !== 0x226f)
//[]

Oh yes, that's an important thing to consider. You're right that we're safe: the 4th field of UnicodeData.txt shows U+0338 has Canonical Combining Class 1, which is the lowest possible for a combining character. Therefore no combining character can be moved before U+0338 by the Canonical Ordering Algorithm.

0338;COMBINING LONG SOLIDUS OVERLAY;Mn;1;NSM;;;;;N;NON-SPACING LONG SLASH OVERLAY;;;;

Here is a patch for 1b from above (T387130#10584304). This patches wikimedia/utfnormal to prefix isolated combining characters, using the unicode database [...]

+ $canPrecedeCombining = [];
[...]
+ public function testU0338() {
+     $text = "\u{0338}<\u{0338}>\u{0338}";
+     $expect = "\u{25CC}\u{0338}\u{226E}\u{226F}";
+     $this->assertEquals(
+         bin2hex( $expect ),
+         bin2hex( Validator::cleanUp( $text ) )
+     );
+ }

Please can we remove '<' and '>' from $canPrecedeCombining? that would render html invulnerable, without breaking any combining sequence except ">\x{0338}" and "<\x{0338}" — both of which have precomposed alternatives and are bad things to be floating around our ecosystem.

This code applies *after* NFC normalization has been done. So < and > will never appear as a preceding character. That's not the point of this code -- the point of this code is to eliminate "hanging" combining characters that are then time bimbs that cause trouble when they are pasted inside an HTML tag.

I am (currently) proposing the following two patches:
To mediawiki-core: (updated to include Tidy/Html/Message/OutputTransform)


To Parsoid: (unchanged from T387130#10584604)

The Parsoid patch will have to be deployed to prod as a patch to the vendor directory.

All other patches are additional hardenings which are nice-to-have but shouldn't be on the critical path here. In particular, I think hardening UtfNormal\Validator::cleanUp() is worthwhile, but isn't necessary to stop the injection attack. I have a number of other patches in my tree which shift code to using the Html::* helper classes instead of string concatenation which is again helpful, but unnecessary given a final postprocessing pass to entity-escape all remaining U+0338, which is what the above implements.

I also considered a patch to OutputPage::output() to do a final fail-safe against U+0338, but that also doesn't seem strictly necessary as all the attack vectors go through the Action API (which is where NFC normalization is performance) not direct HTML output (which is what OutputPage::output() does).

The Parsoid patch will have to be deployed to prod as a patch to the vendor directory.

All other patches are additional hardenings which are nice-to-have but shouldn't be on the critical path here.

Ok, once we have all of the relevant patches completed and code-reviewed, we should categorize them as "this patch needs a discrete security deployment to Wikimedia production to fix the active security issues there" and "this patch is code-hardening". The latter should all be able to go through gerrit. And of course Parsoid (and similar) patches can go through gerrit, where we don't have a defined, discrete Wikimedia production security deployment process.

I am (currently) proposing the following two patches:
To mediawiki-core: (updated to include Tidy/Html/Message/OutputTransform)

My IDE shows that there is a typo in this patch in Message .. "escapeCombiningChars" instead of "escapeCombiningChar". Is it easy to update Message tests to cover format* functions you updated in this patch?

Please can we remove '<' and '>' from $canPrecedeCombining? that would render html invulnerable, without breaking any combining sequence except ">\x{0338}" and "<\x{0338}" — both of which have precomposed alternatives and are bad things to be floating around our ecosystem.

This code applies *after* NFC normalization has been done. So < and > will never appear as a preceding character. That's not the point of this code -- the point of this code is to eliminate "hanging" combining characters that are then time bimbs that cause trouble when they are pasted inside an HTML tag.

Oh sorry, I missed that. Then shouldn't we fix U+0338 before NFC normalization?

+ $string = self::prependIsolatedCombining( $string );
$norm = normalizer_normalize( $string, Normalizer::FORM_C );

...

- return self::prependIsolatedCombining( self::NFC( $string ) );
+ return self::NFC( self::prependIsolatedCombining( $string ) );

I think we need it, because otherwise MediaWiki::Request::WebRequest::getValues breaks valid HTML / wikitext. For example, right now (even with the above patches), If I use VisualEditor to save a source page containing

"FOO<!-- x -->\x{0338}BAR"

then MediaWiki::Request::WebRequest::getValues mangles it into

"FOO<!-- x --\x{226F}BAR"

Of course our VisualEditor patch could fix it on our side, but if this affects VisualEditor then it probably affects many other uses too. And the content we're sending isn't obviously invalid. I think if MediaWiki::Request::WebRequest is going to apply NFC normalization, then it's essential that it fix '>' + U+0338 first.

@dchan that's the input side, and T266140. You need to use the raw API flag from I2e78e660ba1867744e34eda7d00ea527ec016b71 on input. That's not a security bug, though, because the broken HTML is rejected on input. Let's keep this task for output-side issues.

@ssastry Thanks for the catch! Updated core patch, plus Message tests:

One more question about Message.php changes: you added the escaping on line 1060 in the format(..) function. why is it not required on other code paths? And could replaceParameters(..) re-introduce it? Would it better instead to do the sanitization after replaceParameters on line 1065? replaceParameters replaces keys with message values .. so if someone added the combining char as first char of a translation on translatewiki, it could recombine with a >.

I'm having trouble finding details of the normalization algorithm, but experimentally we can gain confidence:

Full details in section 3.11 of unicode spec https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G49537

Dchan write

Oh yes, that's an important thing to consider. You're right that we're safe: the 4th field of UnicodeData.txt shows U+0338 has Canonical Combining Class 1, which is the lowest possible for a combining character. Therefore no combining character can be moved before U+0338 by the Canonical Ordering Algorithm.

Just keep in mind that the canonical composition algorithm (3.11.6) https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G50628 looks at characters besides the one that comes immediately after the starter character. So e.g. <\u0301\u0338 has NFC form \u226e\u0301

Nag on the back of my mind: can we have a potential vector around source ranges?

We have things like T346197 (and related "bad UTF-8" issues) that hint that we occasionally try to access character ranges that were not the ones we actually wanted to; there might be a way to craft things around these?

I think we might need a patch on PHPUtils::safeSubstr in Parsoid to avoid returning a string starting with \u0338? (which we do right now:

$ php run.php shell

> use Wikimedia\Parsoid\Utils\PHPUtils;
> PHPUtils::safeSubstr("\u{0338}aaa", 0, 5);
= "̸aaa"

Since all of Parsoid's serialization goes through XMLSerializer and Scott patched that to handle escaping in text nodes (plus the HardenNFC pass in core for post-cache transforms), I don't think we need to worry about this.

Since all of Parsoid's serialization goes through XMLSerializer and Scott patched that to handle escaping in text nodes (plus the HardenNFC pass in core for post-cache transforms), I don't think we need to worry about this.

It could be an issue if we stripped U+0338 on input, like we do with U+0000 and control characters, but in our latest versions we don't actually remove the U+0338 from the HTML at all, just insist on it being represented as an entity escape so that NFC normalization won't touch it. The character offsets are all on the "real" string values of DOM Nodes, ie after entity decoding, so adding extra entities doesn't affect them at all.

One more question about Message.php changes: you added the escaping on line 1060 in the format(..) function. why is it not required on other code paths? And could replaceParameters(..) re-introduce it? Would it better instead to do the sanitization after replaceParameters on line 1065? replaceParameters replaces keys with message values .. so if someone added the combining char as first char of a translation on translatewiki, it could recombine with a >.

I added U+0338 escaping to every place in Message.php which had a call to htmlspecialchars, excluding only two places which were generating log warnings. ::replaceParameters is the wrong place to do escaping (in general) because that leads to double-escaping issues -- in particular, you can have a 'raw' parameter type (Message::rawParams()) which is explicitly supposed to bypass the escaping of the main message, which is how display title and some other things work (T343994).

There are also "before" and "after" replacement parameter types. So for a "before" parameter type (like num), the value is replaced *before* parsing (Message::format() line 1046), and then the entire thing (parameter and all) will get HTML escaped in Message::format() line 1049 and following, either by the full Parser or by the htmlspecialchars+escapeCombiningChars` call there. For "after" parameter types (like raw), the value is replaced on line 1064 after the rest of the message is escaped, but the replaced value is formatted by passing $format through to extractParam. So (eg) if there's a recursive Message being substituted, it is replaced "after" HTML escaping is done, but the parameter value is itself HTML escaped before being inserted.

So for RawMessage('$1', [ new RawMessage("\u{0338}") ]) it is true that the parameter is inserted after the call to htmlspecialchars, as you noted, but the fact is that the parameter message is itself formatted (to &#x338;) before it is inserted.

One more question about Message.php changes: you added the escaping on line 1060 in the format(..) function. why is it not required on other code paths? And could replaceParameters(..) re-introduce it? Would it better instead to do the sanitization after replaceParameters on line 1065? replaceParameters replaces keys with message values .. so if someone added the combining char as first char of a translation on translatewiki, it could recombine with a >.

...

So for RawMessage('$1', [ new RawMessage("\u{0338}") ]) it is true that the parameter is inserted after the call to htmlspecialchars, as you noted, but the fact is that the parameter message is itself formatted (to &#x338;) before it is inserted.

Ah, this is the part I missed .. I didn't dig one level down where extractParam handles the escaping. Ok, LGTM.

Change #1123486 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump wikimedia/parsoid to v0.21.0-a18

https://gerrit.wikimedia.org/r/1123486

Change #1123488 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Bump wikimedia/parsoid to 0.21.0-a18

https://gerrit.wikimedia.org/r/1123488

sbassett added a parent task: Restricted Task.Feb 27 2025, 10:49 PM

Backport deployments of c1123486 and c1123488 have been successfully completed. Thanks everyone. I know there are several code-hardening patches to follow, but I believe that should get us mitigated in Wikimedia production.

Backport deployments of c1123486 and c1123488 have been successfully completed. Thanks everyone. I know there are several code-hardening patches to follow, but I believe that should get us mitigated in Wikimedia production.

I think that covers the visual editor case but not the other cases like category tree i mentioned in an earlier comment T387130#10577699 (this ticket is a bit of a mess to follow)

I think that covers the visual editor case but not the other cases like category tree i mentioned in an earlier comment T387130#10577699 (this ticket is a bit of a mess to follow)

Ah, ok, I guess we should compile a list of the patches that still warrant discrete Wikimedia production deployments. I assume the patch from T387130#10577699 does, since CategoryTree is both production-deployed and bundled...

I looked at Category Tree and I believe this does actually mitigate that, since the attack Html in the Category Tree example was being routed through the Html::* helper classes, which were patched and hardened. @Bawolff could you double check that?

My analysis: CategoryTree vulernability is because client-side JS in CategoryTree:modules/ext.categoryTree/ext.categoryTree.js does:

	new mw.Api().get( {
		action: 'categorytree',
		category: ctTitle,
		options: ctOptions,
		uselang: mw.config.get( 'wgUserLanguage' ),
		formatversion: 2
	} ).done( ( data ) => {
		data = data.categorytree.html;

		let $data;
		if ( data === '' ) {
			$data = $( '<i>' ).addClass( 'CategoryTreeNotice' )
				// eslint-disable-next-line mediawiki/msg-doc
				.text( mw.msg( {
					0: 'categorytree-no-subcategories',
					10: 'categorytree-no-pages',
					100: 'categorytree-no-parent-categories'
				}[ mode ] || 'categorytree-nothing-found' ) );
		} else {
			$data = $( $.parseHTML( data ) );
			attachHandler( $data );
		}

		$children.empty().append( $data );
	} ).fail( error );

In particular, parsing the HTML retrieved from a call to the category tree action API. That HTML comes from CategoryTree:includes/ApiCategoryTree.php in the getHTML function, which calls CategoryTree::renderChildren. This is where it gets a little hairy, but CategoryTree:include/CategoryTree.php renderChildren calls renderNodeInfo, and that method appears to use Html::rawElement and Html::element to create its HTML, which should be hardened. There are a few instances of:

$s = Html::openElement(....);
$s .= ...;
$s .= Html::closeElement(...);

however, which *could* cause a bad character to sneak in. But the key places where links are rendered appear to use the safe forms of Html::* and/or LinkRenderer::makeLink which goes through LinkRenderer::buildAElement which uses Html::rawElement('a'...) and thus should be safe.

Of course I could have missed something, so double-checking requested please!

Sorry, my bad. I got confused reading the backscroll and thought it was only the parsoid patches that got deployed.

I agree that the patch to Html:: should be sufficient for categorytree issue.

My proposal is to keep this task quiet for another week, which will give @ABreault-WMF time to apply these same two fixes to the stable releases (1.43, LTS) and make security releases for them. After which we should be ok to push patches for this issue in public and probably fork off a bunch of specific "hardening" tasks. For example, although as mentioned above the fact that LinkRenderer::makeLink() has been hardened ought to mitigate any issue with titles containing U+0338, it still is somewhat concerning to me that https://mnw.wiktionary.org/wiki/%CC%B8 exists (that U+0338 as a title!) and I feel like we ought to eventually deploy the utfnormal fix to prevent that (it would rename the page to U+25CC U+0338 which would display better as well). But we can handle that as a separate related task, since it would involve running the cleanupTitles script etc. (But luckily this is the only title containing U+0338 which https://global-search.toolforge.org/?q=.*&regex=1&namespaces=&title=%5C%CC%B8 turns up.)

^ Sounds good, @cscott. Just to clarify: you did clear/purge the necessary page caches for U+0338 (I believe we confirmed that in Slack). It sounds like Wikimedia production should be well-patched at this point. Normally we'd keep something like the core patch protected until the core security release later this quarter, but I'd imagine that touches fairly volatile code and that we don't want to deal with merge conflicts every week for the deployed patch.

Change #1124909 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@REL1_39] Update wikimedia/parsoid to 0.16.5

https://gerrit.wikimedia.org/r/1124909

Change #1124915 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@REL1_42] Update wikimedia/parsoid to 0.19.2

https://gerrit.wikimedia.org/r/1124915

Change #1124919 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@REL1_43] Update wikimedia/parsoid to 0.20.2

https://gerrit.wikimedia.org/r/1124919

Change #1124919 merged by Subramanya Sastry:

[mediawiki/vendor@REL1_43] Update wikimedia/parsoid to 0.20.2

https://gerrit.wikimedia.org/r/1124919

Change #1124915 merged by Subramanya Sastry:

[mediawiki/vendor@REL1_42] Update wikimedia/parsoid to 0.19.2

https://gerrit.wikimedia.org/r/1124915

Change #1124909 merged by Subramanya Sastry:

[mediawiki/vendor@REL1_39] Update wikimedia/parsoid to 0.16.5

https://gerrit.wikimedia.org/r/1124909

Here's a backport for the core patch in T387130#10588539 for REL1_43

It applied pretty cleanly, other than includes/Message/Message.php having been moved to includes/language/Message/Message.php in 1.44

And a backport for the core patch in T387130#10588539 for REL1_42

It applied cleanly except for these files,

autoload.php
includes/Html/Html.php
includes/Message/Message.php
includes/OutputTransform/DefaultOutputPipelineFactory.php

They were all trivial differences except for DefaultOutputPipelineFactory which needed accounting for the changes in T363764.

Running the test HardenNFCTest.php also required a namespace change to ParserOptions in HardenNFC.php

Finally, a backport for the core patch in T387130#10588539 for REL1_39

The patch mostly applied cleanly, a few files like Message.php and Html.php were found in different directories.

No changes to autoload.php were needed since the new files weren't added.

HtmlHelperTrait.php didn't exist in 1.39 so those changes were dropped.

The OutputTransform pipeline didn't exist in 1.39 so the changes to DefaultOutputPipelineFactory.php and HardenNFC.php were dropped. However, as the commit message says, the text transform of HardenNFC is moved into ParserOutput::getText().

The RemexCompatFormatter.php constructor was lightly commented in a patch from T354361 so those changes were retained when applying the patch to that file.

The html/php parsertests section in badCharacters.txt was updated to reflect that in 1.39 the wgParserEnableLegacyHeadingDOM config has no effect, the legacy output is always generated.

@ABreault-WMF - Thanks for the patches. I'll leave it to @Reedy whether he wants to push those up now or wait for the proper MW release (T382316).

Thanks for the backports!

Do we know offhand how long this has been around for?

Marking as resolved to make it more obvious for me tracking.

Thanks for the backports!

Do we know offhand how long this has been around for?

Since jan 14 2009.

The behaviour in unicode is from 1993 afaict

Are we good to make this task publicly visible? Any objections?

Please wait for the security release to go out... It should be this week, but with other stuff going on, it may have to wait till next week...

There's still patches that only exist on this task.

Reedy renamed this task from Potential javascript injection attack enabled by Unicode normalization in Action API to CVE-2025-32699: Potential javascript injection attack enabled by Unicode normalization in Action API.Apr 9 2025, 12:58 PM
Reedy removed a project: Patch-For-Review.

@ssastry Thanks for the catch! Updated core patch, plus Message tests:

So this patch didn't apply on master, so I looked what was in deployment...

reedy@Sams-Mac-mini Downloads % md5sum 02-T387130.patch 
fb81b166a6556deca5f06906add9c309  02-T387130.patch
reedy@Sams-Mac-mini Downloads % md5sum master-0001-Ensure-emitted-HTML-is-safe-against-Unicode-NFC-norm.patch 
9b140d5ee71aa62fdd9b4483230a4a07  master-0001-Ensure-emitted-HTML-is-safe-against-Unicode-NFC-norm.patch

That current version is

However, this also doesn't apply on master..

$ git am ~/02-T387130.patch
Applying: SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization
error: patch failed: includes/language/Message/Message.php:1019
error: includes/language/Message/Message.php: patch does not apply
Patch failed at 0001 SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.

But does with a -3, so the updated version is going to be

Change #1135770 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/core@REL1_39] SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization

https://gerrit.wikimedia.org/r/1135770

Change #1135775 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/core@REL1_42] SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization

https://gerrit.wikimedia.org/r/1135775

Change #1135783 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/core@REL1_43] SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization

https://gerrit.wikimedia.org/r/1135783

Change #1135770 merged by jenkins-bot:

[mediawiki/core@REL1_39] SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization

https://gerrit.wikimedia.org/r/1135770

Change #1135794 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/core@master] SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization

https://gerrit.wikimedia.org/r/1135794

Change #1135783 merged by jenkins-bot:

[mediawiki/core@REL1_43] SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization

https://gerrit.wikimedia.org/r/1135783

Change #1135794 merged by jenkins-bot:

[mediawiki/core@master] SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization

https://gerrit.wikimedia.org/r/1135794

Change #1135775 merged by jenkins-bot:

[mediawiki/core@REL1_42] SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization

https://gerrit.wikimedia.org/r/1135775

Looks like now we're good to make this task publicly visible? Any objections remaining?

sbassett changed the visibility from "Custom Policy" to "Public (No Login Required)".
sbassett changed the edit policy from "Custom Policy" to "All Users".
sbassett removed a project: Patch-For-Review.

Change #1163476 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[utfnormal@master] Replace isolated combining characters

https://gerrit.wikimedia.org/r/1163476

Change #1163476 merged by jenkins-bot:

[utfnormal@master] Replace isolated combining characters

https://gerrit.wikimedia.org/r/1163476