While investigating T354222: reassignMenteesJob is not able to finish in time when a mentor has too many mentees assigned, I noticed that one of the Growth team's jobs, reassignMenteesJob, is getting terminated after 200 seconds (Logstash):
Jan 2, 2024 @ 20:31:27.773 Exception executing job: reassignMenteesJob Special: mentorId=30481996 performerId=36735309 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message-removed reassignMessageAdditionalParams=["Martin Urbanec (WMF)"] requestId=0e019bc7-fe10-4702-97a3-95e530fadcb9 : Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 200 seconds was exceeded from /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(59) #0 /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/Detail/ExcimerRequestTimeout.php(34): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->exitCriticalSection(integer) #1 /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/CriticalSectionProvider.php(134): Wikimedia\RequestTimeout\Detail\ExcimerRequestTimeout->exitCriticalSection(integer) #2 /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/CriticalSectionScope.php(57): Wikimedia\RequestTimeout\CriticalSectionProvider->Wikimedia\RequestTimeout\{closure}(integer) #3 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(3150): Wikimedia\RequestTimeout\CriticalSectionScope->exit() #4 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(725): Wikimedia\Rdbms\Database->completeCriticalSection(string, Wikimedia\RequestTimeout\CriticalSectionScope) #5 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(640): Wikimedia\Rdbms\Database->executeQuery(Wikimedia\Rdbms\Query, string, integer) #6 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(1347): Wikimedia\Rdbms\Database->query(Wikimedia\Rdbms\Query, string) #7 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(1356): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array) #8 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(119): Wikimedia\Rdbms\Database->selectRow(array, array, array, string, array, array) #9 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(369): Wikimedia\Rdbms\DBConnRef->__call(string, array) #10 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/querybuilder/SelectQueryBuilder.php(756): Wikimedia\Rdbms\DBConnRef->selectRow(array, array, array, string, array, array) #11 /srv/mediawiki/php-1.42.0-wmf.10/includes/user/UserSelectQueryBuilder.php(262): Wikimedia\Rdbms\SelectQueryBuilder->fetchRow() #12 /srv/mediawiki/php-1.42.0-wmf.10/includes/user/ActorStore.php(259): MediaWiki\User\UserSelectQueryBuilder->fetchUserIdentity() #13 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/Provider/StructuredMentorProvider.php(175): MediaWiki\User\ActorStore->getUserIdentityByUserId(integer) #14 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/MentorPageMentorManager.php(231): GrowthExperiments\Mentorship\Provider\StructuredMentorProvider->getWeightedAutoAssignedMentors() #15 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/ReassignMentees.php(140): GrowthExperiments\Mentorship\MentorPageMentorManager->getRandomAutoAssignedMentor(MediaWiki\User\UserIdentityValue) #16 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/ReassignMenteesJob.php(88): GrowthExperiments\Mentorship\ReassignMentees->doReassignMentees(string, string) #17 /srv/mediawiki/php-1.42.0-wmf.10/extensions/EventBus/includes/JobExecutor.php(80): GrowthExperiments\Mentorship\ReassignMenteesJob->run() #18 /srv/mediawiki/rpc/RunSingleJob.php(60): MediaWiki\Extension\EventBus\JobExecutor->execute(array) #19 {main}
The server field in Logstash is mw-jobrunner.discovery.wmnet for this job, but jobrunner.discovery.wmnet for the non-k8s jobs (see JobExecutor logs in Logstash). If I'm understanding things correctly, the following snippet from CommonSettings.php in our config will set $wgRequestTimeLimit to 1200 seconds while in the non-k8s jobrunner and to 200 seconds while in the k8s jobrunner:
switch ( $_SERVER['HTTP_HOST'] ?? '' ) { case 'videoscaler.svc.eqiad.wmnet': case 'videoscaler.svc.codfw.wmnet': case 'videoscaler.discovery.wmnet': $wgRequestTimeLimit = 86400; break; case 'jobrunner.svc.eqiad.wmnet': case 'jobrunner.svc.codfw.wmnet': case 'jobrunner.discovery.wmnet': $wgRequestTimeLimit = 1200; break; default: if ( $_SERVER['REQUEST_METHOD'] === 'POST' ) { $wgRequestTimeLimit = 200; } else { $wgRequestTimeLimit = 60; } }
For more details regarding my findings, please see T354222.
It doesn't seem logical to me to have different time limits in different jobrunners. Can someone clarify whether this is an intentional change or a side-effect of the MW-on-K8s project?