Page MenuHomePhabricator

reassignMenteesJob is not able to finish in time when a mentor has too many mentees assigned
Closed, ResolvedPublic

Description

Filling as a long-term solution counterpart to the T354220: User:FormalDude quit from mentorship, but their mentees were not reassigned issue. When a mentor has a lot of mentees assigned, the reassignMenteesJob is unable to finish in time and logs the following exception instead (Logstash):

Jan 2, 2024 @ 20:31:27.773 Exception executing job: reassignMenteesJob Special: mentorId=30481996 performerId=36735309 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message-removed reassignMessageAdditionalParams=["Martin Urbanec (WMF)"] requestId=0e019bc7-fe10-4702-97a3-95e530fadcb9 : Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 200 seconds was exceeded

from /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(59)
#0 /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/Detail/ExcimerRequestTimeout.php(34): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->exitCriticalSection(integer)
#1 /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/CriticalSectionProvider.php(134): Wikimedia\RequestTimeout\Detail\ExcimerRequestTimeout->exitCriticalSection(integer)
#2 /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/CriticalSectionScope.php(57): Wikimedia\RequestTimeout\CriticalSectionProvider->Wikimedia\RequestTimeout\{closure}(integer)
#3 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(3150): Wikimedia\RequestTimeout\CriticalSectionScope->exit()
#4 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(725): Wikimedia\Rdbms\Database->completeCriticalSection(string, Wikimedia\RequestTimeout\CriticalSectionScope)
#5 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(640): Wikimedia\Rdbms\Database->executeQuery(Wikimedia\Rdbms\Query, string, integer)
#6 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(1347): Wikimedia\Rdbms\Database->query(Wikimedia\Rdbms\Query, string)
#7 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(1356): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#8 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(119): Wikimedia\Rdbms\Database->selectRow(array, array, array, string, array, array)
#9 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(369): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#10 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/querybuilder/SelectQueryBuilder.php(756): Wikimedia\Rdbms\DBConnRef->selectRow(array, array, array, string, array, array)
#11 /srv/mediawiki/php-1.42.0-wmf.10/includes/user/UserSelectQueryBuilder.php(262): Wikimedia\Rdbms\SelectQueryBuilder->fetchRow()
#12 /srv/mediawiki/php-1.42.0-wmf.10/includes/user/ActorStore.php(259): MediaWiki\User\UserSelectQueryBuilder->fetchUserIdentity()
#13 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/Provider/StructuredMentorProvider.php(175): MediaWiki\User\ActorStore->getUserIdentityByUserId(integer)
#14 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/MentorPageMentorManager.php(231): GrowthExperiments\Mentorship\Provider\StructuredMentorProvider->getWeightedAutoAssignedMentors()
#15 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/ReassignMentees.php(140): GrowthExperiments\Mentorship\MentorPageMentorManager->getRandomAutoAssignedMentor(MediaWiki\User\UserIdentityValue)
#16 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/ReassignMenteesJob.php(88): GrowthExperiments\Mentorship\ReassignMentees->doReassignMentees(string, string)
#17 /srv/mediawiki/php-1.42.0-wmf.10/extensions/EventBus/includes/JobExecutor.php(80): GrowthExperiments\Mentorship\ReassignMenteesJob->run()
#18 /srv/mediawiki/rpc/RunSingleJob.php(60): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#19 {main}

As a result, the now-former mentor's mentees stay assigned to them, which leads to unwanted questions appearing at the former mentor's talk page.

Within this task, we should find a long-term solution to ensure reassignMenteesJob is able to process mentors with any amount of mentees assigned.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The code responsible for the reassignment is in ReassignMentees::doReassignMentees, and it is wrapped by IDatabase::lock and IDatabase::unlock. At first, I thought that might be the problem (not sure how long such locks stay on), but it doesn't appear to be the case, for a couple of reasons:

  1. manually running extensions/GrowthExperiments/maintenance/reassignMentees.php from mwmaint2002 works just fine (for more than 200 seconds), and it runs the same underlying code,
  2. refreshLinkRecommendations.php, which uses the same locking mechanism, works fine

After I ruled out the first hypothesis, I checked how the $wgRequestTimeLimit variable is configured in Wikimedia production. The request time limit is defined based on HTTP_HOST as follows:

switch ( $_SERVER['HTTP_HOST'] ?? '' ) {
        case 'videoscaler.svc.eqiad.wmnet':
        case 'videoscaler.svc.codfw.wmnet':
        case 'videoscaler.discovery.wmnet':
                $wgRequestTimeLimit = 86400;
                break;

        case 'jobrunner.svc.eqiad.wmnet':
        case 'jobrunner.svc.codfw.wmnet':
        case 'jobrunner.discovery.wmnet':
                $wgRequestTimeLimit = 1200;
                break;

        default:
                if ( $_SERVER['REQUEST_METHOD'] === 'POST' ) {
                        $wgRequestTimeLimit = 200;
                } else {
                        $wgRequestTimeLimit = 60;
                }
}

The 200 timeout that ended up being used for this job suspiciously matches the default value for POST requests. According to the Logstash message, the job ran within the k8s cluster.

According to JobExecutor logs in Logstash, server (which I presume matches HTTP_HOST in the switch above) is jobrunner.discovery.wmnet for the non-k8s jobrunner, but mw-jobrunner.discovery.wmnet for the k8s one. If I'm interpreting this correctly, jobs running in the k8s jobrunner have $wgRequestTimeLimit = 200;, but jobs running in the old cluster have a much longer limit, 1200 seconds.

I'm still confused by the fact that the job was scheduled and started, but did not perform any write actions. Does reaching the timeout cause a rollback in some way that I'm not understanding? Or is the job progressing very slowly when in k8s, so that it is unable to get to the first writing action by the time the allocated 200 seconds pass? The debug logs for GrowthExperiments would likely give us the answer, but we do not log those in production.

That being said, since the reassignMentees.php maintenance script I invoked for FormalDude (T354220#9431083) is running for more than 1200 seconds, this (likely) difference in timeouts within the k8s world is interesting, but would not fix this issue. In any case, I filled those findings as T354229.


It's clear that we need to introduce some sort of batching to reassignMenteesJob to make it work for an arbitrary amount of mentees. We would likely need to split the job into multiple jobs (each responsible for processing one batch of mentees). However, I'm unsure how to do that with the IDatabase::lock locks. I believe the locks are still required, as we implemented them as part of T322374 to prevent reassignMenteesJob from running twice for a single user, resulting in duplicate reassignment (which made no sense at the end).

Maybe we can define a memcached key for each mentor, which would contained the number of running unfinished jobs, and use that instead as the locking mechanism? Or is there a different way to orchestrate the jobs, possibly at the changeprop level?

@Tgr Can you take a look and advise what would be the best fix here?

Does reaching the timeout cause a rollback in some way that I'm not understanding?

Yeah, with the rare exception of jobs with the JOB_NO_EXPLICIT_TRX_ROUND flag, JobRunner::doExecuteJob() will wrap the job in a transaction so unless it returns successfully and the transaction gets committed, nothing is written into the DB. This is similar to the behavior of most other MediaWiki code, maintenance scripts are the only major exception IIRC.

I'm a bit confused about locking - shouldn't job deduplication already prevent that problem?
I suppose you could use the main stash and BagOStuff::lock() or add a lock flag to the mentor table but it feels like overkill.

Does reaching the timeout cause a rollback in some way that I'm not understanding?

Yeah, with the rare exception of jobs with the JOB_NO_EXPLICIT_TRX_ROUND flag, JobRunner::doExecuteJob() will wrap the job in a transaction so unless it returns successfully and the transaction gets committed, nothing is written into the DB. This is similar to the behavior of most other MediaWiki code, maintenance scripts are the only major exception IIRC.

Aha! That makes sense, thanks for the clarification.

I'm a bit confused about locking - shouldn't job deduplication already prevent that problem?

Good question. As a response to the T322374 bug report, I added locks and enabled job deduplication, so I'm not sure whether job deduplication would be sufficient on its own. Maybe this is a good time to try and see?

I suppose you could use the main stash and BagOStuff::lock() or add a lock flag to the mentor table but it feels like overkill.

Thanks for the suggestions. Unfortunately, I'm not sure how would BagOStuff::lock() help with detection whether a job belongs to the same or a different batch.

You'd have a "has lock" flag in the job specification, the job would start without the flag, acquire the lock, do a batch, then reschedule itself with the flag set to do the next batch. The final batch would release the lock.

Urbanecm_WMF moved this task from Inbox to Backlog on the Growth-Team board.

I suspect a long amount of the 200s was "getting the set of mentors", which used to take over a minute. With T375784 fixed, this should hopefully be less prevalent.