Page MenuHomePhabricator

CVE-2025-61638: Sanitizer::validateAttributes data-XSS
Closed, ResolvedPublicSecurity

Description

Reproduce

{{#tag:pre|CLICK pre|data-/onclick=alert("鬼影233")}}
{{#categorytree:CLICK categorytree|data-/onclick=alert("向世界问好")}}

On the MediaWiki, Wikipedia, and most other wiki sites, preview and click on these. Expected to see the alert message.

Research

Sanitizer::validateAttributes and Sanitizer::validateTagAttributes (Sanitizer.php#L510-L520) Expected to clean up attributes. However, attributes beginning with data- will almost completely bypass the cleanup.

This simple vulnerability can be reproduced in the latest and sufficiently old versions (such as 1.38.6).

Credits

@gui-ying233 (Reporter): Discovered this vulnerability in a custom extension.

@Func & @dragon-fish: Help identify, locate, test this vulnerability, and provide more examples of its exploitation.

BTW

Sorry, I'm not good at PHP or English. And if possible, I would like to be added to the Hall of Fame.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Also note that Parsoid has its own copy of Sanitizer.php and the same patch should be applied there as well.

Hm, that’s going to be interesting. I guess that means a separate security patch to mediawiki/vendor?

Also note that Parsoid has its own copy of Sanitizer.php and the same patch should be applied there as well.

Is that an exact copy, or just a similar file? Because so far I’ve only found wikimedia/parsoid/src/Core/Sanitizer.php, and it has some similar-looking code (like isReservedDataAttribute()), but I don’t see a validateAttributes() method in there.

FWIW, so far I can’t reproduce the issue on Parsoid – my {{#tag:pre|CLICK pre|data-/onclick=alert("T401099")}} page gets parsed as <pre data-="" typeof="mw:Extension/pre mw:Transclusion" about="#mwt1" id="mwAg" data-mw="{[much json]}">CLICK pre</pre> instead.

FYI
@AmeroHan (one of our volunteers) provided a standard definition of the characters allowed in XML attribute names:
https://www.w3.org/TR/xml/#NT-Name
https://www.w3.org/TR/xml11/#NT-Name

Regarding HTML, it has a specific definition:
https://html.spec.whatwg.org/#syntax-attribute-name

@

Same here, i can't reproduce this on parsoid. This is the output for the affected article:

<pre data-="" typeof="mw:Extension/pre mw:Transclusion" about="#mwt1" id="mwAg" data-mw="{&quot;name&quot;:&quot;pre&quot;,&quot;attrs&quot;:{&quot;data-&quot;:&quot;&quot;,&quot;onclick&quot;:&quot;alert(\&quot;鬼影233\&quot;)&quot;},&quot;body&quot;:{&quot;extsrc&quot;:&quot;barfoo&quot;},&quot;parts&quot;:[{&quot;template&quot;:{&quot;target&quot;:{&quot;wt&quot;:&quot;#tag:pre&quot;,&quot;function&quot;:&quot;tag&quot;},&quot;params&quot;:{&quot;1&quot;:{&quot;wt&quot;:&quot;barfoo&quot;},&quot;data-/onclick&quot;:{&quot;wt&quot;:&quot;alert(\&quot;鬼影233\&quot;)&quot;}},&quot;i&quot;:0}}]}">barfoo</pre>
<p id="mwAw"><span about="#mwt3" typeof="mw:Transclusion" id="mwBA" data-mw="{&quot;parts&quot;:[{&quot;template&quot;:{&quot;target&quot;:{&quot;wt&quot;:&quot;#categorytree:CLICK categorytree&quot;,&quot;function&quot;:&quot;categorytree&quot;},&quot;params&quot;:{&quot;data-/onclick&quot;:{&quot;wt&quot;:&quot;alert(\&quot;向世界问好\&quot;)&quot;}},&quot;i&quot;:0}}]}">{{#categorytree:CLICK categorytree|data-/onclick=alert("向世界问好")}}</span></p>
在T401099#11057173中,@SomeRandomDeveloper写道:

Note that this issue appears to have been publicly disclosed in the code of the extension mentioned in the task description: https://github.com/moegirlwiki/mediawiki-extension-MoeImgTag/commit/daec7111d39feb8befa3b7b155fef269e47c011b

@dragon-fish please remove this commit or any comments explaining the issue from the repository

The GitHub archive is difficult to remove, and I have already submitted a ticket requesting the removal of the commit archive.
We initially thought this was just a vulnerability caused by our custom extension, but upon checking the MW source code, we found that it affects a wide range. 😰

FYI
@AmeroHan (one of our volunteers) provided a standard definition of the characters allowed in XML attribute names:
https://www.w3.org/TR/xml/#NT-Name
https://www.w3.org/TR/xml11/#NT-Name

Yes, do NOT use this defintion. See the github issue I linked above.

Regarding HTML, it has a specific definition:
https://html.spec.whatwg.org/#syntax-attribute-name

This isn't quite right either. What's relevant is the set of characters actually allowed by the HTML parser. See the github issue.

Lucas, I'd prefer *not* to do the cleanup in the security patch, if possible. Can we get a slimmed down security patch that *just* adds [^\t \r\n/>\0:]*$ to the regexp. I'd prefer to review the clean up patch on gerrit once we've patched the immediate security issue.

Okay, here’s a shorter version. Though I still want to test what happens if you have " or ' in the attribute name.

This should be a minimal patch to address the issue:

Same patch, to Parsoid:
{F65711120}

I believe Parsoid is hotpatched for security in prod the same way that core is. Once we get the security fixes deployed to prod, we can merge a gerrit patch and deploy to vendor as usual.

Perhaps we should release a shameful fix first. At least ensure it is no longer usable before this security issue becomes widely known. Since it puts many websites running MediaWiki at actual risk.

Maybe like this?

- !preg_match( '/^data-[^:]*$/i', $attribute ) &&
+ !preg_match( '/^data-[^:\s\/<>]*$/i', $attribute ) &&

Or even further:

diff
- !preg_match( '/^data-[^:]*$/i', $attribute ) &&
+ !preg_match( '/^data-[a-z0-9-]+$/i', $attribute ) &&

This should be a minimal patch to address the issue:

That fails because the / isn’t escaped (“Unknown modifier '>'”).

Though I still want to test what happens if you have " or ' in the attribute name.

Yeah, no, this doesn’t look safe to me yet.

> Sanitizer::validateTagAttributes(['data-=""onclick="alert()"' => 'foo'], 'div')
= [
    "data-=""onclick="alert()"" => "foo",
  ]

I expect we want to at least filter out = as well? Not 100% sure about ' or " in isolation.

Yes, sorry, forgot to change the delimiters:

Please don't do this:

^data-[a-z0-9-]+$/i', $

that's going to break a lot of content, in addition to breaking CI. The set of bad characters is actually quite small.

AFAICT that patch still has the same issue as my last patch, i.e. the second half of T401099#11057866.

Note that the root cause here is not actually in the Sanitizer, but in the code which "splits" attributes but doesn't treat non-attribute characters as terminators. We're making a minimal patch here, but actually fixing this properly should avoid sending non-attributes to the Sanitizer in the first place. Both category tree and {{#tag}} are assuming that any valid "transclusion named argument name" is also a valid "html attribute name" and that's the root cause here. We don't have this issue for HTML tags in general in wikitext (both core and Parsoid) because (I believe) we correctly split attribute names according to the HTML parser rules in both of those cases.

New “semi-minimal” patch, with the regex based on @cscott (pipe-delimiters are nicer than backslash-slash-escapes), but with = added to the regex to also disallow (can never be in an attribute name, since in a real parse it would terminate it), and with the basic tests still included.

AFAICT that patch still has the same issue as my last patch, i.e. the second half of T401099#11057866.

Sorry, you're right. I'm trying to do this while en-route to Wikimania. For reference, this is the spec which the patch should match:

LenientAttributeNameStartChar := anything except tab, LF, CR, FF, space, /, >, NULL. (R
LenientAttributeNameChar := LenientAttributeNameStartChar but also exclude =

And here are updated versions of the patches:

We don't have this issue for HTML tags in general in wikitext (both core and Parsoid) because (I believe) we correctly split attribute names according to the HTML parser rules in both of those cases.

That matches my understanding as well FWIW.

Ok, since I'm travelling I'm happy to hand this off to @Lucas_Werkmeister_WMDE to take across the finish line. And for the record, I approve of the "non-minimal" clean up patch too, I just tend towards conservativism and try to make the security patches as small as possible, especially since they need to get backported to all active branches etc. Happy to review and merge the non-minimal versions on gerrit once the immediate hole is plugged.

I don't think the Parsoid patch is strictly necessary, but we do try to keep the core and Parsoid versions of the Sanitizer in sync. Eventually we're going to replace the Sanitizer in core with Parsoid's copy to eliminate the redundancy, but we're not quite there yet. (Parsoid's Sanitizer works on Tokens and core's works on strings, so they are not completely compatible yet even though they are as parallel as possible). But in terms of the security backport we could probably get away with a backport only to core's Sanitizer. The "public" fix in gerrit we can apply to core/Parsoid together to keep them in sync.

Okay, here’s a core patch that we can deploy (i.e. with a commit message), src/ changes by @cscott, but with my tests still included (I think it’s valuable to include them – if there’s a merge failure, running the tests should help in figuring out whether the resolution was correct or not):

And here’s @cscott’s Parsoid patch as it will apply to mediawiki/vendor (i.e. with a barebones commit message, and wikimedia/parsoid/ prepended to the paths; no tests here):

If I don’t hear any objections, I’ll deploy this quite soon.

Though I still want to test what happens if you have " or ' in the attribute name.

By the way, the answer to this seems to be that " and ' on their own are harmless, they just become literal parts of the attribute name (and confuse some syntax highlighters). This writes “baz” to the document:

<script data-foo"bar'="baz">document.write(document.currentScript.dataset['foo"bar\''])</script>

Yeah, the set of characters accepted by the HTML parser is quite surprisingly large. In retrospect, this is an obvious place for a security bug to creep in, because the set of characters specified in the most "obvious" documentation (the XML name spec, the HTML attribute syntax spec, the set of characters accepted by DOM methods) is quite different from the (poorly-documented) set of characters actually accepted by the HTML parser. PHP also has various issues here, since it (for historical reasons) uses an XML library to parse HTML and as we've seen the character sets aren't quite the same. There is work on-going in the standards bodies to address this (https://github.com/RickAllMighty8195/dom/pull/57), and I believe the DOM methods and HTML spec will "soon" agree. Sometime after that PHP might agree as well. But we'll be cleaning up for quite a while. From a security perspective it is "lucky" that the HTML parser spec is a superset of all the various competing standards, otherwise there would be even more corners to sneak through.

FWIW, so far I can’t reproduce the issue on Parsoid – my {{#tag:pre|CLICK pre|data-/onclick=alert("T401099")}} page gets parsed as <pre data-="" typeof="mw:Extension/pre mw:Transclusion" about="#mwt1" id="mwAg" data-mw="{[much json]}">CLICK pre</pre> instead.

See transcript below.

╭─subbu@earth ~/work/wmf/parsoid  ‹master*› 
╰─➤  echo "{{#tag:pre|CLICK pre|data-/onclick=alert("T401099")}}" | php bin/parse.php --trace peg --dump tplsrc
0-[peg]        | ---->    [{"type":"SelfclosingTagTk","name":"template","attribs":[{"k":"#tag:pre","v":"","srcOffsets":[2,10,10,10]},{"k":"","v":"CLICK pre","srcOffsets":[11,11,11,20]},{"k":"data-/onclick","v":"alert(T401099)","srcOffsets":[21,34,35,49]}],"dataParsoid":{"tsr":[0,51],"src":"{{#tag:pre|CLICK pre|data-/onclick=alert(T401099)}}"}}]
[dump] ============================ template source ============================
TEMPLATE:tagTRANSCLUSION:"{{#tag:pre|CLICK pre|data-/onclick=alert(T401099)}}"
--------------------------------------------------------------------------------
<pre data-/onclick="alert(T401099)">CLICK pre</pre>
--------------------------------------------------------------------------------

1-[peg]        | ---->    [{"type":"SelfclosingTagTk","name":"extension","attribs":[{"k":"typeof","v":"mw:Extension"},{"k":"name","v":"pre"},{"k":"source","v":"<pre data-/onclick=\"alert(T401099)\">CLICK pre</pre>"},{"k":"options","v":[{"k":"data-","v":"","srcOffsets":[5,10,10,10],"_type_":"Wikimedia\\Parsoid\\Tokens\\KV"},{"k":"onclick","v":"alert(T401099)","srcOffsets":[11,18,20,34],"vsrc":"alert(T401099)","_type_":"Wikimedia\\Parsoid\\Tokens\\KV"}]}],"dataParsoid":{"tsr":[0,51],"stx":"html","src":"<pre data-/onclick=\"alert(T401099)\">CLICK pre</pre>","extTagOffsets":[0,51,36,6]}}]
0-[peg]        | ---->    [{"type":"NlTk","dataParsoid":{"tsr":[51,52]}}]

This is because the PEG tokenizer in Parsoid splits the attribute name on "/". The grammar rules in Parsoid are:

generic_attribute_name_piece [inline] =
	$[^ \t\r\n\0/=><&{}\-!|]+
	/ !inline_breaks
	// \0/=> is the html5 attribute name set we do not want.
	@(
		directive
		/ less_than
		/ $( !( space_or_newline / [\0/=><] ) . )
	)

EDIT: In an earlier version of this comment, I had pointed to $[^ \t\r\n\0/=><&{}\-!|]+. But @ABreault-WMF clarified that there is an alternative rule, and so the real set of excluded chars are [\0/=><] which is why the PEG tokenizer rejects the data-/onclick name.

So, the bad attributes won't make it to the sanitizer in Parsoid to trigger this issue there.

Note that the root cause here is not actually in the Sanitizer, but in the code which "splits" attributes but doesn't treat non-attribute characters as terminators. We're making a minimal patch here, but actually fixing this properly should avoid sending non-attributes to the Sanitizer in the first place. Both category tree and {{#tag}} are assuming that any valid "transclusion named argument name" is also a valid "html attribute name" and that's the root cause here. We don't have this issue for HTML tags in general in wikitext (both core and Parsoid) because (I believe) we correctly split attribute names according to the HTML parser rules in both of those cases.

The root cause is treating the name of a parser function/template/transclusion "named attribute" as a valid HTML attribute name. I think Parsoid's <pre> implementation is different enough, but I wouldn't swear that there's not a similar vulnerability in some place where Parsoid invokes extension code directly *without* forming a literal extension tag string and tokenizing it.

In any case, @Lucas_Werkmeister_WMDE is patching Parsoid as well so we don't need to think too hard about this.

This comment was removed by cscott.

Where Parsoid has a native implementation, the args are tokenized in Parsoid by the PEG tokenizer which is protected as above. But, where Parsoid doesn't have a native implementation and calls the legacy parser to process the extension, the content should have been sanitized before we get the HTML string back from core. Parsoid simply tunnels the HTML as a DOM fragment through without sanitizing it again. So, if core's Sanitizer is patched, Parsoid is also effectively patched. So, I think it should be safe either way.

But, never hurts to be extra careful - patching Parsoid as well sounds good to me.

I deployed both patches (SAL), hopefully successfully.

Lucas_Werkmeister_WMDE lowered the priority of this task from Unbreak Now! to High.EditedAug 4 2025, 4:34 PM

Okay, I tried the wikitext from T401099#11057765 on a sandbox page (preview only) and saw no bad HTML. (Admittedly, I didn’t test it this way before syncing, so I never saw the vulnerable version. But I don’t see why it shouldn’t have been reproducible.) I think this is patched in WMF production now.

The core patch in T401099#11057962 applies cleanly to REL1_39, REL1_43, REL1_44. I can prepare Parsoid releases when we're ready to make MediaWiki point releases, since it'll require making the patch public.

Based on Reporting security bugs, do we need a CVE?

A CVE number will be assigned by the security team, usually around the time when the task and patches are prepared for publication.


Ideally we should try to match the HTML5 rules for valid attribute characters, rather than use an adhoc regexp. That said, being *more* restrictive than HTML5 is certainly better than being *less* restrictive. We do have non-ASCII attribute names used legitimately by our projects, though -- see T349310 for an example (data-ij).

I don’t know why I didn’t realize this yesterday, but this is of course a good idea to include in the test cases. Here’s a tiny diff to add that to the test cases (input and expected output):

diff --git i/tests/phpunit/includes/parser/SanitizerTest.php w/tests/phpunit/includes/parser/SanitizerTest.php
index 68f85fc471..50587b9791 100644
--- i/tests/phpunit/includes/parser/SanitizerTest.php
+++ w/tests/phpunit/includes/parser/SanitizerTest.php
@@ -164,6 +164,7 @@ public static function provideValidateTagAttributes() {
 				[
 					'data-wikitext' => 'wikitext',
 					'DATA-WIKITEXT-2' => 'WIKITEXT-2',
+					'data-wikitext-ij' => 'wikitext-IJ',
 					'data-mw' => 'disallow impersonating parsoid',
 					'DATA-mw' => 'disallow impersonating PARSOID',
 					'data-mw-extension' => 'disallow impersonating extension',
@@ -176,6 +177,7 @@ public static function provideValidateTagAttributes() {
 				[
 					'data-wikitext' => 'wikitext',
 					'DATA-WIKITEXT-2' => 'WIKITEXT-2',
+					'data-wikitext-ij' => 'wikitext-IJ',
 					# other attributes removed
 				]
 			],

Maybe I’ll try to update the patch in production later, so this gets automatically included once we publish that on Gerrit. (The updated test still passes locally, btw.)

Maybe I’ll try to update the patch in production later, so this gets automatically included once we publish that on Gerrit.

Done (I updated the patch in /srv/patches and it got applied to wmf.12+13 core as part of this scap).

sbassett changed the task status from Open to In Progress.Aug 11 2025, 4:51 PM
sbassett assigned this task to Lucas_Werkmeister_WMDE.
sbassett edited projects, added SecTeam-Processed; removed Patch-For-Review.
sbassett moved this task from Incoming to Watching on the Security-Team board.

Thank you everyone for this, this was a huge effort. We will track this with the next quarterly releases and prepare the HOF when it's completed.

Should we also prepare an FAQ similar to the 2021-12 security release/FAQ?

I believe this vulnerability has existed for 15 years (commit 4ad22c9), it is very easy and widely exploitable, although it requires editing and viewing actions…

sbassett changed Author Affiliation from N/A to Wikimedia Communities.Aug 13 2025, 4:19 PM
sbassett changed the visibility from "Custom Policy" to "Custom Policy".
sbassett changed Risk Rating from N/A to High.

Should we also prepare an FAQ similar to the 2021-12 security release/FAQ?

I believe this vulnerability has existed for 15 years (commit 4ad22c9), it is very easy and widely exploitable, although it requires editing and viewing actions…

We need to keep this issue private until the next core mediawiki security release, due out at the end of September 2025. I've subscribed #acl_release_security_pre_announce to this task for external operators to patch early if they so choose. @Reedy, who manages the core mediawiki security releases, can decide if a special supplemental announcement regarding this issue should accompany the core mediawiki security release.

sbassett added a parent task: Restricted Task.Aug 14 2025, 3:14 PM
Reedy renamed this task from Sanitizer::validateAttributes data-XSS to CVE-2025-61638: Sanitizer::validateAttributes data-XSS.Sep 29 2025, 1:22 PM

Change #1192139 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Sanitize data- attributes

https://gerrit.wikimedia.org/r/1192139

Change #1192139 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Sanitize data- attributes

https://gerrit.wikimedia.org/r/1192139

Change #1192151 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/services/parsoid@REL1_44] Sanitize data- attributes

https://gerrit.wikimedia.org/r/1192151

Change #1192152 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/services/parsoid@REL1_43] Sanitize data- attributes

https://gerrit.wikimedia.org/r/1192152

Change #1192154 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/services/parsoid@REL1_39] Sanitize data- attributes

https://gerrit.wikimedia.org/r/1192154

Change #1192154 merged by jenkins-bot:

[mediawiki/services/parsoid@REL1_39] Sanitize data- attributes

https://gerrit.wikimedia.org/r/1192154

Change #1192151 merged by jenkins-bot:

[mediawiki/services/parsoid@REL1_44] Sanitize data- attributes

https://gerrit.wikimedia.org/r/1192151

Change #1192152 merged by jenkins-bot:

[mediawiki/services/parsoid@REL1_43] Sanitize data- attributes

https://gerrit.wikimedia.org/r/1192152

Change #1192167 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a25

https://gerrit.wikimedia.org/r/1192167

Change #1192167 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a25

https://gerrit.wikimedia.org/r/1192167

Change #1193148 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/core@REL1_39] SECURITY: Sanitize data- attributes

https://gerrit.wikimedia.org/r/1193148

Change #1193148 merged by jenkins-bot:

[mediawiki/core@REL1_39] SECURITY: Sanitize data- attributes

https://gerrit.wikimedia.org/r/1193148

Change #1193172 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/core@REL1_43] SECURITY: Sanitize data- attributes

https://gerrit.wikimedia.org/r/1193172

Change #1193172 merged by jenkins-bot:

[mediawiki/core@REL1_43] SECURITY: Sanitize data- attributes

https://gerrit.wikimedia.org/r/1193172

Change #1193196 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/core@REL1_44] SECURITY: Sanitize data- attributes

https://gerrit.wikimedia.org/r/1193196

Change #1193196 merged by jenkins-bot:

[mediawiki/core@REL1_44] SECURITY: Sanitize data- attributes

https://gerrit.wikimedia.org/r/1193196

Change #1193218 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/core@master] SECURITY: Sanitize data- attributes

https://gerrit.wikimedia.org/r/1193218

Change #1193218 merged by jenkins-bot:

[mediawiki/core@master] SECURITY: Sanitize data- attributes

https://gerrit.wikimedia.org/r/1193218

T407617: The data HTML attribute name can't include the underscore sign anymore might be due to the fixes here? (Commenting here rather than on that task because this one isn’t public yet.)

T407617: The data HTML attribute name can't include the underscore sign anymore might be due to the fixes here? (Commenting here rather than on that task because this one isn’t public yet.)

No, T407131

We need to keep this issue private until the next core mediawiki security release, due out at the end of September 2025. I've subscribed #acl_release_security_pre_announce to this task for external operators to patch early if they so choose. @Reedy, who manages the core mediawiki security releases, can decide if a special supplemental announcement regarding this issue should accompany the core mediawiki security release.

It's November, so I want to know why we are still keeping this ticket and CVE private?

It's November, so I want to know why we are still keeping this ticket and CVE private?

I'll make it public now. Sometimes these tasks stay private a bit longer than intended, especially when the security releases are as large and complex as they were last time around.

sbassett changed the visibility from "Custom Policy" to "Public (No Login Required)".
sbassett changed the edit policy from "Custom Policy" to "All Users".