Page MenuHomePhabricator

Improve TitleValue, TitleFactory, TitleParser performance to be on-par with Title
Closed, ResolvedPublic

Description

As demonstrated in https://gerrit.wikimedia.org/r/452278, TitleValue and it's related services are significantly slower than Title. As we move towards using it and phasing out Title, we have to fix the performance of it to be on-par with Title.

Event Timeline

Some notes from IRC with @tstarling:

[22:30:25] <legoktm> TimStarling: I have some ideas on how to optimize getPrefixedText()/parseTitleString(), but I'm not really sure what to do about all the Assert:: stuff besides re-implementing the makeTitle/makeTitleSafe distinction for TitleValue, where only the latter will do all the sanity checks
[22:33:04] <TimStarling> the Assert::parameterType() calls could probably be replaced by scalar type hints
[22:33:21] <TimStarling> maybe the effect is not exactly the same but it's not really important
[22:33:55] <TimStarling> the regex thing should definitely be removed
[22:34:15] <TimStarling> the last one, $dbkey !== '', could be inlined
[22:34:33] <TimStarling> if ( $dbkey==='' ) { throw ... }
[22:34:35] <legoktm> we can't use scalar type hints until we drop HHVM support
[22:38:08] <TimStarling> well, all of them could be profitably inlined, it would be verbose but faster
...
[22:41:21] <TimStarling> it's obviously faster to use is_int() than Assert::parameterType(), look at the implementation of the latter
[22:41:57] <TimStarling> not that anyone cares if an integer is passed as a float, coercing is the right thing to do in that case

not that anyone cares if an integer is passed as a float […]

I would like to disagree with that. Let's please not lightly give away the chance to stay clean when switching over from a messy (5000+ lines) infrastructure to a clean one.

Here is a possible rewrite for the TitleValue constructor:

if ( !is_int( $namespace )
	|| !is_string( $dbkey )
	|| !is_string( $fragment )
	|| !is_string( $interwiki )
) {
	throw new InvalidArgumentException( 'Unexpected parameter type' );
}

// Sanity check, no full validation or normalization applied here!
if ( $dbkey === ''
	|| $dbkey[0] === '_'
	|| substr( $dbkey, -1 ) === '_'
	|| !ctype_graph( $dbkey )
) {
	throw new InvalidArgumentException( "Invalid DB key '$dbkey'" );
}

Change 452887 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/core@master] TitleValue: Don't use Assert for basic type checks

https://gerrit.wikimedia.org/r/452887

The above patch makes the TitleValue constructor twice as fast as Title::makeTitle in my benchmarking. I'm OK with that for now (though it seems clear there's more room for optimization).

Up next is TitleParser::parseTitle() vs Title::newFromText(). I'm leaning towards finishing all the TODOs and unifying secureAndSplit and splitTitleString (and porting Title's perf optimizations), because they're pretty close to being identical, but differ in subtle ways that don't make me surprised that the implementor didn't want to deal with them :)

Change 452887 merged by jenkins-bot:
[mediawiki/core@master] TitleValue: Don't use Assert for basic type checks

https://gerrit.wikimedia.org/r/452887

Change 453090 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: Thiemo Kreuz (WMDE)):
[mediawiki/core@master] Remove slow regular expression from TitleValue constructor

https://gerrit.wikimedia.org/r/453090

Change 453555 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/core@master] Optimize TitleFormatter::getPrefixedText() to Title performance

https://gerrit.wikimedia.org/r/453555

Change 453555 merged by jenkins-bot:
[mediawiki/core@master] Optimize TitleFormatter::getPrefixedText() to Title performance

https://gerrit.wikimedia.org/r/453555

Change 453090 abandoned by Thiemo Kreuz (WMDE):
Remove slow regular expression from TitleValue constructor

Reason:
Thanks for the benchmark script, that's really nice. I played with it for a while and tried to understand better if there is even a problem with this regex. It turns out: No, this regex performs absolutely fine with all inputs. Even the /_$/ part at the end does not cause any slow backtracking, as I feared before.

The difference I can see on my (a little outdated) PHP 5 system is still in favor of the "str" approach, but so small, it's absolutely not worth it.

https://gerrit.wikimedia.org/r/453090

WDoranWMF claimed this task.
WDoranWMF subscribed.

Resolving but please feel free to reopen if still occurring.