Page MenuHomePhabricator

Timeout in testApiMethods calling mediawiki_messages
Open, LowPublic

Description

Two failures in a row, both on ar.wiktionary

https://travis-ci.org/wikimedia/pywikibot-core/jobs/78662177#L4606
https://travis-ci.org/wikimedia/pywikibot-core/jobs/78674956#L4609

The next build passed, so the problem may be heavy load.

Event Timeline

jayvdb raised the priority of this task from to Needs Triage.
jayvdb updated the task description. (Show Details)
jayvdb added projects: Pywikibot-tests, Pywikibot.
jayvdb subscribed.

Another one, also on ar.wiktionary.
https://travis-ci.org/jayvdb/pywikibot-core/jobs/78687606#L4632

The test is calling mysite.mediawiki_messages('*')

which invokes https://ar.wiktionary.org/w/api.php?action=query&meta=allmessages&ammessages=*&amlang=ar

which is not the nicest query to run, fetching 5.55M, and taking around 50s to one minute for me.

The slowness of this query seems to be specific to Arabic Wiktionary.

I guess it could be a site specific problem? e.g. lots of custom messages in the MediaWiki: namespace?

IMO pywikibot-core could avoid this problem by rewriting the unit test so it doesnt test '*', and even deprecate '*' as a valid way to download all messages. A caller should always specify what messages they need.

Change 236245 had a related patch set uploaded (by John Vandenberg):
Deprecate fetching all mediawiki_messages using *

https://gerrit.wikimedia.org/r/236245

Change 236245 merged by jenkins-bot:
Deprecate fetching all mediawiki_messages using *

https://gerrit.wikimedia.org/r/236245

jayvdb triaged this task as Low priority.Sep 7 2015, 9:31 AM
jayvdb removed a project: Patch-For-Review.
jayvdb set Security to None.
Anomie added subscribers: ori, Anomie.

After digging deep into this issue, the slowness for Arabic-language wikis compared to other languages is because LanguageAr::normalize() is relatively slow, multiplied by almost 100000 calls.

Digging a little deeper, on PHP 5.5.9-1ubuntu4.11 using FastStringSearch instead of strtr for ReplacementArray is a fair bit faster for the replacement array from serialized/normalize-ar.ser. I can't compare the speeds in HHVM due to T101418. rMWbdb17a79a4bc landing shortly before this task was opened is almost certainly the immediate cause.

@ori: This sounds like something you'd want to look into. My most-reduced test cases are:

$data = unserialize( file_get_contents( "/srv/mediawiki/php-1.26wmf21/serialized/normalize-ar.ser" ) );
$fss = fss_prep_replace( $data );
for ( $i = 0; $i < 1000; $i++ ) {
    fss_exec_replace( $fss, "foo" );
}

versus

$data = unserialize( file_get_contents( "/srv/mediawiki/php-1.26wmf21/serialized/normalize-ar.ser" ) );
for ( $i = 0; $i < 1000; $i++ ) {
    strtr( "foo", $data );
}

The former takes about 0.034 seconds while the latter takes 0.422 seconds when run with time php5 on mw1017. Increasing the number of iterations to 100000 (which is about where the API query here is at), FSS goes to 0.074s while strtr jumps to over 20s. HHVM's behavior with strtr is in line with Zend PHP's.

The advantage for FSS seems mainly to be due to the ability to do the fss_prep_replace() once where strtr (presumably) has to do the equivalent for every iteration. Putting that inside the loop brings the FSS version up to around 18s for 100000 iterations.