Page MenuHomePhabricator

Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s
Closed, ResolvedPublic

Description

While investigating T354222: reassignMenteesJob is not able to finish in time when a mentor has too many mentees assigned, I noticed that one of the Growth team's jobs, reassignMenteesJob, is getting terminated after 200 seconds (Logstash):

Jan 2, 2024 @ 20:31:27.773 Exception executing job: reassignMenteesJob Special: mentorId=30481996 performerId=36735309 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message-removed reassignMessageAdditionalParams=["Martin Urbanec (WMF)"] requestId=0e019bc7-fe10-4702-97a3-95e530fadcb9 : Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 200 seconds was exceeded

from /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(59)
#0 /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/Detail/ExcimerRequestTimeout.php(34): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->exitCriticalSection(integer)
#1 /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/CriticalSectionProvider.php(134): Wikimedia\RequestTimeout\Detail\ExcimerRequestTimeout->exitCriticalSection(integer)
#2 /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/CriticalSectionScope.php(57): Wikimedia\RequestTimeout\CriticalSectionProvider->Wikimedia\RequestTimeout\{closure}(integer)
#3 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(3150): Wikimedia\RequestTimeout\CriticalSectionScope->exit()
#4 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(725): Wikimedia\Rdbms\Database->completeCriticalSection(string, Wikimedia\RequestTimeout\CriticalSectionScope)
#5 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(640): Wikimedia\Rdbms\Database->executeQuery(Wikimedia\Rdbms\Query, string, integer)
#6 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(1347): Wikimedia\Rdbms\Database->query(Wikimedia\Rdbms\Query, string)
#7 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(1356): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#8 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(119): Wikimedia\Rdbms\Database->selectRow(array, array, array, string, array, array)
#9 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(369): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#10 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/querybuilder/SelectQueryBuilder.php(756): Wikimedia\Rdbms\DBConnRef->selectRow(array, array, array, string, array, array)
#11 /srv/mediawiki/php-1.42.0-wmf.10/includes/user/UserSelectQueryBuilder.php(262): Wikimedia\Rdbms\SelectQueryBuilder->fetchRow()
#12 /srv/mediawiki/php-1.42.0-wmf.10/includes/user/ActorStore.php(259): MediaWiki\User\UserSelectQueryBuilder->fetchUserIdentity()
#13 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/Provider/StructuredMentorProvider.php(175): MediaWiki\User\ActorStore->getUserIdentityByUserId(integer)
#14 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/MentorPageMentorManager.php(231): GrowthExperiments\Mentorship\Provider\StructuredMentorProvider->getWeightedAutoAssignedMentors()
#15 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/ReassignMentees.php(140): GrowthExperiments\Mentorship\MentorPageMentorManager->getRandomAutoAssignedMentor(MediaWiki\User\UserIdentityValue)
#16 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/ReassignMenteesJob.php(88): GrowthExperiments\Mentorship\ReassignMentees->doReassignMentees(string, string)
#17 /srv/mediawiki/php-1.42.0-wmf.10/extensions/EventBus/includes/JobExecutor.php(80): GrowthExperiments\Mentorship\ReassignMenteesJob->run()
#18 /srv/mediawiki/rpc/RunSingleJob.php(60): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#19 {main}

The server field in Logstash is mw-jobrunner.discovery.wmnet for this job, but jobrunner.discovery.wmnet for the non-k8s jobs (see JobExecutor logs in Logstash). If I'm understanding things correctly, the following snippet from CommonSettings.php in our config will set $wgRequestTimeLimit to 1200 seconds while in the non-k8s jobrunner and to 200 seconds while in the k8s jobrunner:

switch ( $_SERVER['HTTP_HOST'] ?? '' ) {
        case 'videoscaler.svc.eqiad.wmnet':
        case 'videoscaler.svc.codfw.wmnet':
        case 'videoscaler.discovery.wmnet':
                $wgRequestTimeLimit = 86400;
                break;

        case 'jobrunner.svc.eqiad.wmnet':
        case 'jobrunner.svc.codfw.wmnet':
        case 'jobrunner.discovery.wmnet':
                $wgRequestTimeLimit = 1200;
                break;

        default:
                if ( $_SERVER['REQUEST_METHOD'] === 'POST' ) {
                        $wgRequestTimeLimit = 200;
                } else {
                        $wgRequestTimeLimit = 60;
                }
}

For more details regarding my findings, please see T354222.

It doesn't seem logical to me to have different time limits in different jobrunners. Can someone clarify whether this is an intentional change or a side-effect of the MW-on-K8s project?

Event Timeline

Yes, your understanding is correct; I had a patch fixing this that never got merged, I should just make a new version of that.

Change 987400 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/mediawiki-config@master] Fix timeouts detection on mw on k8s jobrunners

https://gerrit.wikimedia.org/r/987400

Change 987400 merged by jenkins-bot:

[operations/mediawiki-config@master] Fix timeouts detection on mw on k8s jobrunners

https://gerrit.wikimedia.org/r/987400

Mentioned in SAL (#wikimedia-operations) [2024-01-03T11:36:05Z] <oblivian@deploy2002> Started scap: Backport for [[gerrit:987400|Fix timeouts detection on mw on k8s jobrunners (T354229)]]

Mentioned in SAL (#wikimedia-operations) [2024-01-03T11:37:48Z] <oblivian@deploy2002> oblivian: Backport for [[gerrit:987400|Fix timeouts detection on mw on k8s jobrunners (T354229)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-01-03T11:47:43Z] <oblivian@deploy2002> Finished scap: Backport for [[gerrit:987400|Fix timeouts detection on mw on k8s jobrunners (T354229)]] (duration: 11m 38s)

Urbanecm_WMF assigned this task to Joe.