Page MenuHomePhabricator

preg_replace regex bug in function sanitizeKeyForAPI in module FormatMetadata.php with huge performance impact
Closed, InvalidPublic

Description

Probably an undetected stupid and untested bug since release 1.23 with a huge performance impact on all systems, and extremely easy to solve.

Performance impact

  • Very large Apache error.log file (100s of GB, or worse depending on the webserver traffic)
  • 100% CPU per core
  • Very slow application

Root cause: bad placement of regex "-" code in character range

Functionally, the application continues to run, but with a major impact on system performance (CPU and error log) and the user experience. The problem occurs during the search, editing, saving, etc. of pages...

The problem seem to exist in all versions of Mediwiki since 1.23. It is very odd that this problem has not been discovered for years...

https://doc.wikimedia.org/mediawiki-core/master/php/classFormatMetadata.html#af64baf435520d2d4ee6c901e4c18196b

Errors in Apache error.log

PHP Warning: preg_replace(): Compilation failed: invalid range in character class at offset 4 in includes/media/FormatMetadata.php on line 1864

Patch

It can be easily solved by moving the - just before the closing ]

rcsdiff -u includes/media/FormatMetadata.php
RCS file: RCS/FormatMetadata.php,v
retrieving revision 1.1
diff -u -r1.1 FormatMetadata.php
--- FormatMetadata.php	2020/11/18 10:30:28	1.1
+++ FormatMetadata.php	2020/11/18 10:30:54
@@ -1877,7 +1877,7 @@
 		// be used so we take the easy way
 		$key = preg_replace( '/[^a-zA-z0-9_:.\-]/', '', $key );
 		// drop characters which are invalid at the first position
-		$key = preg_replace( '/^[\d\-.]+/', '', $key );
+		$key = preg_replace( '/^[\d\.-]+/', '', $key );
 		if ( $key == '' ) {
 			$key = '_';
 		} 
 		// special case for an internal keyword
-		if ( $key == '_element' ) {
+		elif ( $key == '_element' ) {
 			$key = 'element';
 		}

Event Timeline

I note elif is not valid PHP...

Reedy, this is not the core of the problem... the real problem is the bad regex construct (bad placement of the -)

The remark about elif was only a slight CPU-friendly optimisation.
elif is indeed a Perl construct.
Could be coded instead as a nested else ... if with {}.

Sure it's not the core problem, but you're providing a patch that won't work as it's not valid PHP. So it's added noise.

But elif is not valid PHP syntax. It does not lint on PHP 7.3

reedy@Reedys-MBP mediawiki % php -v
PHP 7.3.11 (cli) (built: Jun  5 2020 23:50:40) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.3.11, Copyright (c) 1998-2018 Zend Technologies
reedy@Reedys-MBP mediawiki % cat test.php 
<?php

$key = 'foo';

if ( $key == '' ) {
	$key = '_';
}
// special case for an internal keyword

elif ( $key == '_element' ) {
	$key = 'element';
}
reedy@Reedys-MBP mediawiki % php -l test.php

Parse error: syntax error, unexpected ';' in test.php on line 11
Errors parsing test.php

elseif would be fine, and indeed would make sense as an optimisation.

I'm confused by your patch. - is for a range (especially in a regex), and as such \- would escape it and make it literal.

What version of PHP? What version of PCRE?

If it was as prolific as you say, I'd presume we'd have seen it on Wikimedia wikis a long time ago

And if we test it, the current version is fine, no errors. If we remove that escaping of the - we get an error (as expected)... Sure, your version also works fine, but when the current one is fine... Why would we change it?

php > $key = preg_replace( '/^[\d\-.]+/', '', $key );
php > $key = preg_replace( '/^[\d-.]+/', '', $key );

Warning: preg_replace(): Compilation failed: invalid range in character class at offset 4 in php shell code on line 1
php > $key = preg_replace( '/^[\d.\-]+/', '', $key );
php > $key = preg_replace( '/^[\d.-]+/', '', $key );
php > $key = preg_replace( '/^[\d\.-]+/', '', $key );
php >

The line you're changing doesn't even seemingly to match the error.

PHP Warning: preg_replace(): Compilation failed: invalid range in character class at offset 4 in includes/media/FormatMetadata.php on line 1864

What version of MW are you running?

On MW-1.35-release 1864 is a blank line

https://github.com/wikimedia/mediawiki/blob/REL1_35/includes/media/FormatMetadata.php#L1864

It looks like you might be basically complaining about what was fixed in rMWd88e924b6e5a: Fix PHP warnings "preg_replace(): [...] invalid range in character class" which is/was line 1864

The fixed was merged into master for 1.32.0, and backported into 1.31 for 1.31.2.. Which would have been all supported versions of MediaWiki at the time

What version of MediaWiki are you using? Is it a supported version? And have you updated to the latest patch release for that version?

I am running a personal server on (a pre-release version of) MW 1.31 where indeed the . was not yet backslashed... so therefore I had to reverse the -. into .-

I indeed see that the final 1.31 release had the problem fixed as https://github.com/wikimedia/mediawiki/blob/REL1_31/includes/media/FormatMetadata.php#L1875.

I can still see the problem in https://github.com/wikimedia/mediawiki/blob/REL1_30/includes/media/FormatMetadata.php#L1859 (which does not harm since not supported any more).

I have another instance 1.31.7 running where I have seen that the \. construction was already applied - that is equally good, of course.

I plan to upgrade to 1.35 soon. Sorry for the confusion. Thanks for showing me the github link to verify the application code.

Here I repeat the correct patch for 1.31:

rcsdiff -u includes/media/FormatMetadata.php
RCS file: includes/media/RCS/FormatMetadata.php,v
retrieving revision 1.1
diff -u -r1.1 includes/media/FormatMetadata.php
--- includes/media/FormatMetadata.php	2020/11/18 09:29:23	1.1
+++ includes/media/FormatMetadata.php	2020/11/18 13:12:19
@@ -1861,14 +1861,14 @@
 		// be used so we take the easy way
 		$key = preg_replace( '/[^a-zA-z0-9_:.-]/', '', $key );
 		// drop characters which are invalid at the first position
-		$key = preg_replace( '/^[\d-.]+/', '', $key );
+		$key = preg_replace( '/^[\d.-]+/', '', $key );
 
 		if ( $key == '' ) {
 			$key = '_';
 		}
 
 		// special case for an internal keyword
-		if ( $key == '_element' ) {
+		elseif ( $key == '_element' ) {
 			$key = 'element';
 		}

I've put the swapping of the if to elseif into https://gerrit.wikimedia.org/r/c/mediawiki/core/+/641719 (but not tagged this task explicitly)

As the fix for the regex is already in 1.31, I don't think there's anything else further to do here?

Sam, thanks for your help. Everything set now...