Page MenuHomePhabricator

PHP Warning: preg_match(): Compilation failed: two named subpatterns have the same name at offset 62
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.35.0-wmf.1

message
PHP Warning: preg_match(): Compilation failed: two named subpatterns have the same name at offset 62

Impact

Notes

Details

Request ID
XaJqMwpAICoAAHe-eQwAAAAH
Request URL
/wiki/Wikimedia_Taiwan/wiki/index.php5/$1
Stack Trace
exception.trace
#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /includes/PathRouter.php(311): preg_match(string, string, NULL)
#2 /includes/PathRouter.php(285): PathRouter::extractTitle(string, stdClass)
#3 /includes/PathRouter.php(260): PathRouter->internalParse(string)
#4 /includes/WebRequest.php(205): PathRouter->parse(string)
#5 /includes/WebRequest.php(361): WebRequest::getPathInfo(string)
#6 /includes/Setup.php(780): WebRequest->interpolateTitle()
#7 /includes/WebStart.php(81): require_once(string)
#8 /index.php(41): require(string)
#9 /srv/mediawiki/w/index.php(3): require(string)
#10 {main}
Related Changes in Gerrit:

Event Timeline

Quick glance:

  • The wgArticlePath is configured as /wiki/$1 which means anything after /wiki/ that isn't a query string, should be considered as the page title. And titles like "My_$1_coin" should (and generally, do) work fine.
  • The PathRouterTest cases also cover this case.
  • In PathRouter::extractTitle the code correctly uses preg_quote() to escape the configured router path, and never embeds or substitutes the user input URL into the regex pattern (Good).

Some ad-hoc debugging on mwdebug1002 from this function reveals it tries three configured router paths:

PathRouter::extractTitle
 path = /wiki/Wikimedia_Taiwan/wiki/index.php5/$1  pattern=/wiki/$1  SCRIPT_NAME=/wiki/Wikimedia_Taiwan/wiki/index.php5/$1

PathRouter::extractTitle:
 path = /wiki/Wikimedia_Taiwan/wiki/index.php5/$1  pattern=/w/index.php/$1  SCRIPT_NAME=/wiki/Wikimedia_Taiwan/wiki/index.php5/$1

PathRouter::extractTitle:
 path = /wiki/Wikimedia_Taiwan/wiki/index.php5/$1  pattern=/wiki/Wikimedia_Taiwan/wiki/index.php5/$1/$1  SCRIPT_NAME=/wiki/Wikimedia_Taiwan/wiki/index.php5/$1

This last one is the likely cause and appears to be due to something also feeding the url as router path, which seems bad. And worse, it doesn't escape it in any way.

I imagine this might also explain why requests for static.php are calling this path, and more over, trying to match for a /w/static.php/$1 path which definitely isn't meant to happen, e.g. from https://meta.wikimedia.org/w/skins/Vector/images/arrow-down.svg

PathRouter::extractTitle:
 path = /w/skins/Vector/images/arrow-down.svg  pattern=/wiki/$1  SCRIPT_NAME=/w/static.php

PathRouter::extractTitle:
 path = /w/skins/Vector/images/arrow-down.svg  pattern=/w/index.php/$1  SCRIPT_NAME=/w/static.php

PathRouter::extractTitle:
 path = /w/skins/Vector/images/arrow-down.svg  pattern=/w/static.php/$1  SCRIPT_NAME=/w/static.php
path = /wiki/Wikimedia_Taiwan/wiki/index.php5/$1  pattern=/wiki/$1  SCRIPT_NAME=/wiki/Wikimedia_Taiwan/wiki/index.php5/$1

That seems wrong. $_SERVER['SCRIPT_NAME'] is supposed to be the current script's path, without the pathinfo part.

I imagine this might also explain why requests for static.php are calling this path, and more over, trying to match for a /w/static.php/$1 path which definitely isn't meant to happen, e.g. from https://meta.wikimedia.org/w/skins/Vector/images/arrow-down.svg

It seems that's intentional in WebRequest's code.

if ( isset( $_SERVER['SCRIPT_NAME'] )
    && strpos( $_SERVER['SCRIPT_NAME'], '.php' ) !== false
) {
    // Check for SCRIPT_NAME, we handle index.php explicitly
    // But we do have some other .php files such as img_auth.php
    // Don't let root article paths clober the parsing for them
    $router->add( $_SERVER['SCRIPT_NAME'] . "/$1" );
}

That was added in rMWae1d5aefbf45: Update img_auth.php and WebRequest code to handle non index.php scripts like…, which mentions T34486: WebRequest::getPathInfo() broken in img_auth.php on DreamHost.

daniel triaged this task as High priority.Nov 11 2019, 8:06 PM

@D3r1ck01 - just for clarity, are you still proposing 565762: PathRouter: Allow duplicate names for subpatterns in RegEx as a fix for this one?

[…] this seems like it might obscure some other underlying issue, […]

I believe you're right. The regular expression seems fine. One URL to MediaWiki can only have 1 primary title, extracted from one $-placeholder in the url path-pattern. The code that consumes the result is not prepared to handle an array of multiple values, and there can't be any way to process such array, since that is fundmantally is how MediaWiki works.

The fact that we concatenate two strings and accidentally create another "$1" pattern is an accident. The solution will need to be in preventing that problem from reaching the regular expression (higher level, the input to the regex). The solution cannot be to let the problem sink deeper (lower level, the regex).

@D3r1ck01 - just for clarity, are you still proposing 565762: PathRouter: Allow duplicate names for subpatterns in RegEx as a fix for this one?

Yes! I saw a review you left, anything needed from my end ATM?

I note that something like this just happend in group1 for 1.35.0-wmf.22:

2020-03-04T17:07:19 on metawiki
url /wiki/Wikimedia_Taiwan/wiki/index.php5/$1
unique id Xl-gRwpAICEAAAVeBEIAAAEY

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2020.03.04/mediawiki?id=AXCmhCNgarkxubcm-g_o&_g=h@44136fa should be a direct link.

path = /wiki/Wikimedia_Taiwan/wiki/index.php5/$1  pattern=/wiki/$1  SCRIPT_NAME=/wiki/Wikimedia_Taiwan/wiki/index.php5/$1

That seems wrong. $_SERVER['SCRIPT_NAME'] is supposed to be the current script's path, without the pathinfo part.

I can reproduce this locally. PATH_INFO is not set, so SCRIPT_NAME just gets set to the same thing as REQUEST_URI. In ap_add_cgi_vars(), if r->path_info is set, the URI is split and SCRIPT_NAME receives the first part of the path, but it's not set.

mod_proxy_fcgi has some code for setting PATH_INFO, depending on the proxy-fcgi-pathinfo environment variable, but this code is not used when mod_proxy_fcgi is invoked via SetHandler. The proxy_fcgi_canon hook is never called.

Note that SCRIPT_NAME is the pre-rewrite URL, it does not even include the script name. Pretty much every usage of SCRIPT_NAME in core is broken by this.

Change 584817 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Stop using SCRIPT_NAME where possible, rely on statically configured routing

https://gerrit.wikimedia.org/r/584817

Change 584817 merged by jenkins-bot:
[mediawiki/core@master] Stop using SCRIPT_NAME where possible, rely on statically configured routing

https://gerrit.wikimedia.org/r/584817