Page MenuHomePhabricator

Loading a particular diff on ckbwiki causes PHP to segfault
Closed, ResolvedPublicSecurity

Description

List of steps to reproduce (step by step, including full links if applicable):

What happens?:

You will get an error: "Service Temporarily Unavailable
Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes." No further information.

Details

Risk Rating
High
Author Affiliation
Wikimedia Communities

Event Timeline

Urbanecm subscribed.

Can (reliably) reproduce. This is what I can see in Logstash when I tried to reproduce (link for those with access: https://logstash.wikimedia.org/goto/9a067a6e696ce7d5b0e8b2a10ce3fdf8):

{
  "_index": "logstash-2021.09.12",
  "_type": "_doc",
  "_id": "uMzK23sBHB1njYG3BvUM",
  "_score": 1,
  "_source": {
    "host": "cp3058.esams.wmnet",
    "@timestamp": "2021-09-12T20:54:12.667Z",
    "tags": [
      "input-kafka-rsyslog-shipper",
      "rsyslog-shipper",
      "kafka",
      "es",
      "es",
      "normalized_message_untrimmed"
    ],
    "uri_path_decoded": "/w/index.php",
    "uri_host": "ckb.wikipedia.org",
    "response_size": 597,
    "x_analytics": "WMF-Last-Access=12-Sep-2021;WMF-Last-Access-Global=12-Sep-2021;https=1;client_port=50366",
    "normalized_message": "%{message}",
    "uri_path": "/w/index.php",
    "uri_query": "?diff=616224",
    "user_agent": "redacted",
    "@version": "1",
    "x_cache": "cp3064 miss, cp3058 miss",
    "time_firstbyte": 2.2754,
    "backend": "mw2387.codfw.wmnet",
    "http_status": "503",
    "logsource": "centrallog1001",
    "dt": "2021-09-12T20:54:04Z",
    "program": "webrequest",
    "sequence": 12108825965,
    "ip": "redacted",
    "referer": "-",
    "accept_language": "en-us,en;q=0.5",
    "timestamp": "2021-09-12T20:54:12.611771+00:00",
    "content_type": "text/html",
    "range": "-",
    "tls": "vers=TLSv1.3;keyx=X25519;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=h2;sess=new",
    "level": "NOTICE",
    "http_request": "GET https://ckb.wikipedia.org/w/index.php?diff=616224",
    "uri_query_decoded": "?diff=616224",
    "type": "webrequest",
    "cache_status": "miss",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "facility": "user",
    "http_method": "GET"
  },
  "fields": {
    "@timestamp": [
      "2021-09-12T20:54:12.667Z"
    ]
  }
}

Quick look over the 5xx Logstash dashboard shows that this URI erroring out is not limited to a single (cache) host or a single edge DC. Trying to request this URI directly from a random appserver shows the 503 comes from MediaWiki itself:

[urbanecm@mw2310 ~]$ curl -I --connect-to "::$HOSTNAME" 'https://ckb.wikipedia.org/w/index.php?diff=616224'
HTTP/1.1 503 Service Unavailable
Date: Sun, 12 Sep 2021 23:09:04 GMT
Server: mw2310.codfw.wmnet
Last-Modified: Wed, 29 Mar 2017 21:22:29 GMT
Etag: "3ec-54be52c758984"
Accept-Ranges: bytes
Content-Length: 1004
Backend-Timing: D=354086 t=1631488144657366
Content-Type: text/html
X-Envoy-Upstream-Service-Time: 354

[urbanecm@mw2310 ~]$

Then I tried to use X-Wikimedia-Debug with verbose logs. This is what caught my attention:

2021-09-12 23:14:13 [c7359ae6-f58b-47db-a5bf-6e36dadcf8fc] mwdebug2001 ckbwiki 1.37.0-wmf.21 wfDebug DEBUG: Connected to pool counter server: 10.192.16.129
2021-09-12 23:14:13 [c7359ae6-f58b-47db-a5bf-6e36dadcf8fc] mwdebug2001 ckbwiki 1.37.0-wmf.21 wfDebug DEBUG: Sending pool counter command: ACQ4ANY ckbwiki:rcache:616224:dateformat=default!userlang=en 2 100 0
2021-09-12 23:14:13 [c7359ae6-f58b-47db-a5bf-6e36dadcf8fc] mwdebug2001 ckbwiki 1.37.0-wmf.21 wfDebug DEBUG: Got pool counter response: LOCKED

It looks poolcounter is refusing to accept a new parser process?

Adding some initial tags to get attention from more people, feel free to adjust as desired.

So, the revid in question belongs to this page: https://ckb.wikipedia.org/w/index.php?title=%D9%84%DB%8C%D8%A8%D8%B1%DB%95_%D8%A6%DB%86%D9%81%DB%8C%D8%B3. Looking at its history, this is because of InternetArchiveBot going rogue and increasing size of the page to 2 millions bytes in 2019. Edits from https://ckb.wikipedia.org/w/index.php?title=%D9%84%DB%8C%D8%A8%D8%B1%DB%95_%D8%A6%DB%86%D9%81%DB%8C%D8%B3&oldid=615740 to https://ckb.wikipedia.org/w/index.php?title=%D9%84%DB%8C%D8%A8%D8%B1%DB%95_%D8%A6%DB%86%D9%81%DB%8C%D8%B3&oldid=616696 inclusive cannot be displayed currently.

615740, the smallest revision on this page that cannot be displayed, has 183260 bytes (per rev_len field in the DB). Other revisions of greater size, like https://ckb.wikipedia.org/w/index.php?oldid=756522 (1 214 516 bytes), work fine.

What's special about rev 616224?

Legoktm raised the priority of this task from High to Needs Triage.Sep 13 2021, 4:38 AM
Legoktm set Security to Software security bug.
Legoktm added projects: Security, Security-Team.
Legoktm changed the visibility from "Public (No Login Required)" to "Custom Policy".
Legoktm changed the subtype of this task from "Bug Report" to "Security Issue".

Protecting as a security issue because PHP is segfaulting:

Sep 12 23:14:06 mwdebug2001 kernel: [18331232.287900] php-fpm7.2[26845]: segfault at 7ffe56adef58 ip 000055d5af1cdb8b sp 00007ffe56adef60 error 6 in php-fpm7.2[55d5af084000+28d000]
Sep 12 23:14:06 mwdebug2001 kernel: [18331232.287910] Code: 49 8d 57 1f 4c 29 e2 48 85 c0 0f 84 67 09 00 00 48 8b 48 10 4c 8d 34 0a 4c 3b 73 08 0f 83 58 09 00 00 48 8d 7c 08 18 4c 89 e6 <e8> 90 66 eb ff 48 8b 03 4c 89 70 10 48 8b 03 48 85
 c0 0f 84 18 09
Sep 12 23:14:06 mwdebug2001 kernel: [18331232.288219] Core dump to |/usr/lib/systemd/systemd-coredump 26845 33 33 11 1631488446 0 php-fpm7.2 pipe failed
Sep 12 23:14:07 mwdebug2001 kernel: [18331233.129816] php-fpm7.2[26851]: segfault at 7ffe56adef58 ip 000055d5af1cdb8b sp 00007ffe56adef60 error 6 in php-fpm7.2[55d5af084000+28d000]
Sep 12 23:14:07 mwdebug2001 kernel: [18331233.129826] Code: 49 8d 57 1f 4c 29 e2 48 85 c0 0f 84 67 09 00 00 48 8b 48 10 4c 8d 34 0a 4c 3b 73 08 0f 83 58 09 00 00 48 8d 7c 08 18 4c 89 e6 <e8> 90 66 eb ff 48 8b 03 4c 89 70 10 48 8b 03 48 85
 c0 0f 84 18 09
Sep 12 23:14:07 mwdebug2001 kernel: [18331233.130140] Core dump to |/usr/lib/systemd/systemd-coredump 26851 33 33 11 1631488447 0 php-fpm7.2 pipe failed
Sep 12 23:14:13 mwdebug2001 kernel: [18331238.802208] php-fpm7.2[26844]: segfault at 7ffe56adef58 ip 000055d5af1cdb8b sp 00007ffe56adef60 error 6 in php-fpm7.2[55d5af084000+28d000]
Sep 12 23:14:13 mwdebug2001 kernel: [18331238.802218] Code: 49 8d 57 1f 4c 29 e2 48 85 c0 0f 84 67 09 00 00 48 8b 48 10 4c 8d 34 0a 4c 3b 73 08 0f 83 58 09 00 00 48 8d 7c 08 18 4c 89 e6 <e8> 90 66 eb ff 48 8b 03 4c 89 70 10 48 8b 03 48 85
 c0 0f 84 18 09
Sep 12 23:14:13 mwdebug2001 kernel: [18331238.802458] Core dump to |/usr/lib/systemd/systemd-coredump 26844 33 33 11 1631488453 0 php-fpm7.2 pipe failed
Sep 12 23:14:16 mwdebug2001 kernel: [18331242.097408] php-fpm7.2[8579]: segfault at 7ffe56adef58 ip 000055d5af1cdb8b sp 00007ffe56adef60 error 6 in php-fpm7.2[55d5af084000+28d000]
Sep 12 23:14:16 mwdebug2001 kernel: [18331242.097419] Code: 49 8d 57 1f 4c 29 e2 48 85 c0 0f 84 67 09 00 00 48 8b 48 10 4c 8d 34 0a 4c 3b 73 08 0f 83 58 09 00 00 48 8d 7c 08 18 4c 89 e6 <e8> 90 66 eb ff 48 8b 03 4c 89 70 10 48 8b 03 48 85
 c0 0f 84 18 09
Sep 12 23:14:16 mwdebug2001 kernel: [18331242.097688] Core dump to |/usr/lib/systemd/systemd-coredump 8579 33 33 11 1631488456 0 php-fpm7.2 pipe failed

Because it's a diff, I suspect wikidiff2 is segfaulting when given these two particular revisions. cc'ing people who've worked on wikidiff2 recently.

Legoktm renamed this task from Unspecified fatal error when loading a particular diff on ckbwiki to Loading a particular diff on ckbwiki causes PHP to segfault.Sep 13 2021, 4:43 AM
Legoktm triaged this task as High priority.

@Legoktm It's not diff-specific though. https://ckb.wikipedia.org/w/index.php?oldid=616224 has the same error. Something segfault-y also happens when trying to parse the revision from the CLI:

[urbanecm@mwmaint2002 ~]$ mwscript shell.php ckbwiki
Psy Shell v0.10.5 (PHP 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+icu63 — cli) by Justin Hileman
>>> $page = \MediaWiki\MediaWikiServices::getInstance()->getWikiPageFactory()->newFromTitle(Title::newFromText('لیبرە ئۆفیس'))
=> WikiPage {#3137
     +mTitle: Title {#5569
       +mTextform: "لیبرە ئۆفیس",
       +mUrlform: "%D9%84%DB%8C%D8%A8%D8%B1%DB%95_%D8%A6%DB%86%D9%81%DB%8C%D8%B3",
       +mDbkeyform: "لیبرە_ئۆفیس",
       +mNamespace: 0,
       +mInterwiki: "",
       +mFragment: "",
       +mArticleID: -1,
       +mRestrictions: [],
       +mCascadeRestriction: null,
       +mCascadingRestrictions: null,
       +mCascadeSources: null,
       +mRestrictionsLoaded: false,
       +prefixedText: null,
       +mTitleProtection: null,
       +mDefaultNamespace: 0,
       +mRedirect: null,
     },
     +mDataLoaded: false,
     +mLatest: false,
     +mPreparedEdit: false,
   }
>>> $page->getParserOutput($page->makeParserOptions(RequestContext::getMain()), 616224)
[urbanecm@mwmaint2002 ~]$

To me, it looks like something from the parser logic segfaulting rather than wikidiff2 -- or maybe diff is somehow invoked when retrieving a single revision as well?

At least it cannot be caused by the recent changes to wikidiff2, since they weren't deployed yet (T285857). That said, this looks interesting, so I'm going to take a look.

For the record, this is the content of the revision.

This can be also reproduced locally via:

urbanecm@notebook  ~/tmp
$ php ~/unsynced/gerrit/mediawiki/core/maintenance/edit.php --wiki=awiki --user='Martin Urbanec' --summary='Test' 616224 < 616224.wikitext
Saving...Segmentation fault
urbanecm@notebook  ~/tmp
$

where 616224.wikitext is a file I uploaded above. My devwiki is an almost empty MW instance, with no templates/similar.

Attempting to open this file even crashes my gedit. I had a closer look. Apparently the file contains about 6200 nested template calls. Yes, a nesting level of 6200. And all of this is in a single line.

I'm not sure if this helps. Just wanted to leave this here.

I can also reproduce this locally, but only when viewing the page (i.e. I can save it just fine). The segfault happens while parsing the page. Using xdebug for the page view I can tell that it reaches Parser::internalParse() -> Parser::replaceVariables() -> ::preprocessToDom() -> Preprocessor_Hash::buildDomTreeArrayFromText and then it dies somewhere in there. I really didn't want to step through the whole parsing.

FTR, I'm using PHP 7.4.1.

Actually... I switched to lame debugging (adding die() calls to get an idea of where execution stops), and I'm seeing something weird... Parser::parse() calls ::internalParse(). If I add a die() after the internalParse() call, it's not reached, meaning it exits somewhere during the call. Then internalParse() has two exit points, one is after the onParserBeforeInternalParse hook is run, but you can comment this out and the segfault is still there. The other one is at the end of the method body, and it's a simple return $text. If I add an exit() before that line, it IS reached. Unless I'm doing something wrong, it seems like PHP is segfaulting while returning control to the caller. I've also tried changing the return line to

try {
	return $text;
} finally {
	if($GLOBALS['debugsegfault']??false) die('dead');
}

just to confirm that accessing $text doesn't have weird side effects, but it doesn't. It really seems to be happening while PHP is processing the return.

Other things I've tried:

  • Changing that return line to return a literal empty string. It doesn't matter.
  • Unsetting all local variables before the return (e.g. to catch RAII-style scoped closures and the like). Doesn't matter.
  • Forcing a GC run before the return / disabling the GC. Doesn't matter.
  • Opcache is already disabled.

I hope this is helpful if somebody wants to write a minimal test case.

Just to confirm that this is a parser thing:

daimona@mwmaint2002:~$ mwscript shell.php ckbwiki
Psy Shell v0.10.5 (PHP 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+icu63 — cli) by Justin Hileman
>>> $page = \MediaWiki\MediaWikiServices::getInstance()->getWikiPageFactory()->newFromTitle(Title::newFromText('لیبرە ئۆفیس'))
=> WikiPage {#3136 [snip]
>>> $rs = \MediaWiki\MediaWikiServices::getInstance()->getRevisionStore()
=> MediaWiki\Revision\RevisionStore {#3143}
>>> $rev = $rs->getRevisionByTitle($page->getTitle(),616224)
=> MediaWiki\Revision\RevisionStoreRecord {#3182}
>>> $cnt = $rev->getContent('main')
=> WikitextContent {#5613}
>>>  $parser = \MediaWiki\MediaWikiServices::getInstance()->getParser();
=> Parser {#5632 [snip]
>>> $parser->parse($cnt->getText(),$page->getTitle(),ParserOptions::newFromAnon(),true,true,616224)

You can always count on me to find the strangest bugs. Sorry about this.

Actually... I switched to lame debugging (adding die() calls to get an idea of where execution stops), and I'm seeing something weird... Parser::parse() calls ::internalParse(). If I add a die() after the internalParse() call, it's not reached, meaning it exits somewhere during the call. Then internalParse() has two exit points, one is after the onParserBeforeInternalParse hook is run, but you can comment this out and the segfault is still there. The other one is at the end of the method body, and it's a simple return $text. If I add an exit() before that line, it IS reached. Unless I'm doing something wrong, it seems like PHP is segfaulting while returning control to the caller. I've also tried changing the return line to

try {
	return $text;
} finally {
	if($GLOBALS['debugsegfault']??false) die('dead');
}

just to confirm that accessing $text doesn't have weird side effects, but it doesn't. It really seems to be happening while PHP is processing the return.

Other things I've tried:

  • Changing that return line to return a literal empty string. It doesn't matter.
  • Unsetting all local variables before the return (e.g. to catch RAII-style scoped closures and the like). Doesn't matter.
  • Forcing a GC run before the return / disabling the GC. Doesn't matter.
  • Opcache is already disabled.

I hope this is helpful if somebody wants to write a minimal test case.

Can you try dumping backtraces throughout this function and see if it changes at some point? It sounds like the return address may have gotten overwritten at some point in that function.

Eurgh... I just upgraded my MW install to git master, and the segfault changed. It still happens, but now it's inside the replaceVariables() call (as xdebug was showing), not when returning from internalParse. I also cannot reproduce the issue by running the commands from T290831#7348176.

While debugging, the segfault started happening after the return once again, without I think any actual code modification. However, I just realized that internalParse() can be called in extensions code. In my case, Ext:Cite was calling it, which is why I though the segfault happened on the return line. I've disabled the extension, and can now confirm that the faulty call is the one to Parser::replaceVariables, and the segfault is inside Preprocessor_Hash::buildDomTreeArrayFromText.

sbassett changed Author Affiliation from N/A to Wikimedia Communities.Sep 13 2021, 3:42 PM
sbassett changed Risk Rating from N/A to High.
sbassett added a project: Vuln-DoS.
tstarling claimed this task.
tstarling subscribed.

Sep 12 23:14:06 mwdebug2001 kernel: [18331232.287900] php-fpm7.2[26845]: segfault at 7ffe56adef58 ip 000055d5af1cdb8b sp 00007ffe56adef60 error 6 in php-fpm7.2[55d5af084000+28d000]

Segfault at7ffe56adef58
sp7ffe56adef60

You can see the data address is 2 bytes below the stack pointer (sp). The stack grows down. Most likely the stack is failing to grow because the stack size limit was exhausted.

Note that PHP 7+ has a separate VM stack, so plain infinite recursion is not sufficient to trigger this.

Protecting as a security issue because PHP is segfaulting

It's hard to see any security aspect to stack exhaustion in our current infrastructure. The PHP-FPM worker will die and be restarted, which doesn't affect any other requests. It's faster to start up a new worker than to parse a megabyte of broken wikitext, so it's not a DoS vulnerability. It's faster to segfault than to keep running.

@Legoktm It's not diff-specific though. https://ckb.wikipedia.org/w/index.php?oldid=616224 has the same error.

Works for me. On preview it has the category "Pages where expansion depth is exceeded", which is the limit which is meant to prevent excessive recursion.

>>> $parser->parse($cnt->getText(),$page->getTitle(),ParserOptions::newFromAnon(),true,true,616224)

Works for me.

Can't reproduce even with tideways_xhprof. Probably the parser was fixed at some point.

sbassett subscribed.

Given the previous comment, is there any reason to keep this task protected at this point? I'm not seeing one.

sbassett changed the visibility from "Custom Policy" to "Public (No Login Required)".Jul 25 2023, 1:23 PM