Page MenuHomePhabricator

contains_any runs slow
Closed, InvalidPublic

Description

For some reason contains_any(S, s1, s2, … sN) tends to run much slower than the equivalent S rlike 's1|s2|…|sN'. With 20-30 words the difference in filter execution time can be 5 ms vs. 0.5 ms.
This should be further investigated and, if true in general, contains_any should internally do whatever rlike does by joining (rescape'd?) sN's.
Why I like contains_any better than rlike is because long regex patterns are too hard to read and debug, and while doing S rlike "s1|s2|s3|" + "s4|s5|" + "s6|s7" somewhat improves readability (one can group similar patterns/words), it's easy to make a mistake by forgetting one of the vertical bars, or adding one where it's not supposed to be.
Ideally, contains_any would accept an array of strings as its second argument so that the same word lists could be used several times (for removed and added lines, for example).

Event Timeline

I ran a quick and hacky benchmark:

<?php

use MediaWiki\Extension\AbuseFilter\AbuseFilterServices;
use MediaWiki\Extension\AbuseFilter\Parser\FilterEvaluator;
use MediaWiki\Extension\AbuseFilter\Variables\VariableHolder;
use Psr\Log\NullLogger;
use Wikimedia\Equivset\Equivset;

$IP = getenv( 'MW_INSTALL_PATH' );
if ( $IP === false ) {
	$IP = __DIR__ . '/../..';
}
require_once "$IP/maintenance/Maintenance.php";

class TestBench extends Maintenance {
	/**
	 * @inheritDoc
	 */
	public function execute() {

		$haystack = wfRandomString( 5000 );
		$needles = array_map( fn ( $x ) => wfRandomString( 10 ), range( 1, 50 ) );

		$this->containsAnyRule = "contains_any( '$haystack', " . implode( ', ', array_map( fn ( $x ) => "'$x'", $needles ) ) . " )";
		$this->rlikeRule = "'$haystack' rlike '" . implode( '|', $needles ) . "'";
		$this->runWithCache( new EmptyBagOStuff() );
		$this->runWithCache( new HashBagOStuff() );
	}

	private function runWithCache( BagOStuff $cache ) {
		$cachedStr = $cache instanceof EmptyBagOStuff ? 'no cache' : 'cached';
		$evaluator = new FilterEvaluator(
			Language::factory( 'en' ),
			$cache,
			new NullLogger(),
			AbuseFilterServices::getKeywordsManager(),
			AbuseFilterServices::getVariablesManager(),
			new NullStatsdDataFactory(),
			new Equivset(),
			10**10,
			new VariableHolder()
		);
		$numRep = 10000;

		$start = hrtime( true );
		for ( $i = 0; $i < $numRep; $i++ ) {
			$evaluator->parse( $this->containsAnyRule );
		}
		$this->output( "contains_any ($cachedStr): " . ( ( hrtime( true ) - $start ) / 10**6 ) . "\n" );

		$start = hrtime( true );
		for ( $i = 0; $i < $numRep; $i++ ) {
			$evaluator->parse( $this->rlikeRule );
		}
		$this->output( "rlike ($cachedStr): " . ( ( hrtime( true ) - $start ) / 10**6 ) . "\n" );
	}
}

$maintClass = TestBench::class;
require_once RUN_MAINTENANCE_IF_MAIN;

The results:

contains_any (no cache): 3243.211878
rlike (no cache): 2279.308648
contains_any (cached): 794.011129
rlike (cached): 1737.911601

The rlike version seems faster without caching, but with caching, it's the opposite. I would kind of expect this, since the contains_any version has more AST nodes and the AST-construction step takes longer, but if the AST is cached, the evaluation is faster.

contains_any also needs to validate all arguments (ensure they're strings etc.), whereas rlike only does this for one argument, so contains_any being slower wouldn't be impossible, but the overhead of the regex engine seems to compensate. Changing contains_any to use regexps wouldn't help, because it would still have to validate all arguments, and then it would also need to regex-escape them, and then invoke the regex engine. It would definitely become slower.

That said, the benchmark above is very simple and does not take many factors into account (lengths of the strings, whether a match is found, etc.), so real-world results may vary. However, I don't think a normal 20-30 words scenario would show a 10x slowdown in normal conditions. The difference might be explained by other factors tied to that specific filter.

Thanks @Daimona. The execution times are what I typically see in filter statistics, though the numbers vary. All filters that I'm managing run in under 0.5 ms, but the one with a list of 30-something words is always above 2 ms, and on some days up to 5 ms. A filter with about 15 rlike invocations (short patterns - historical reasons) on a ccnorm'ed string runs in under 0.5 ms. A similar filter with a ccnorm_contains_any shows an average of nearly 3 ms today. Btw, is ccnorm_contains_any() just contains_any( ccnorm() )?
I don't know if this is much of a problem IRL. I see people trying to optimize their filters (make them 'green'?), so I though this might be something worth investigating further.

Thanks @Daimona. The execution times are what I typically see in filter statistics, though the numbers vary.

Those are a good starting point, but I doubt they'd be reliable for smaller-scale differences, due to noise.

All filters that I'm managing run in under 0.5 ms, but the one with a list of 30-something words is always above 2 ms, and on some days up to 5 ms. A filter with about 15 rlike invocations (short patterns - historical reasons) on a ccnorm'ed string runs in under 0.5 ms. A similar filter with a ccnorm_contains_any shows an average of nearly 3 ms today.

This might be caused by various factors, potentially including contains_any being slower than rlike in your specific case. It's hard to tell; I may be able to help if you tell me the ID of the slow filter.

I don't know if this is much of a problem IRL. I see people trying to optimize their filters (make them 'green'?), so I though this might be something worth investigating further.

We're talking about few milliseconds, so it's not a big deal. Faster is better, but in this case I think it's acceptable.

Btw, is ccnorm_contains_any() just contains_any( ccnorm() )?

It's equivalent to contains_any( ccnorm( $haystack ), ccnorm( $needle1 ), ..., ccnorm( $needleN ) ), but more performant (and more readable!).

I'm closing this as invalid per above. @Ponor and I have been discussing the specific filter (which is not publicly visible) in private channels. The gist of it is that the slowness is due to the usage of ccnorm_contains_any, which applies ccnorm to all its arguments, whereas the version with rlike only applied it to the haystack.