reassignMenteesJob is not able to finish in time when a mentor has too many mentees assigned
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Urbanecm_WMF
	Jan 2 2024, 8:49 PM

Description

Filling as a long-term solution counterpart to the T354220: User:FormalDude quit from mentorship, but their mentees were not reassigned issue. When a mentor has a lot of mentees assigned, the reassignMenteesJob is unable to finish in time and logs the following exception instead (Logstash):

Jan 2, 2024 @ 20:31:27.773 Exception executing job: reassignMenteesJob Special: mentorId=30481996 performerId=36735309 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message-removed reassignMessageAdditionalParams=["Martin Urbanec (WMF)"] requestId=0e019bc7-fe10-4702-97a3-95e530fadcb9 : Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 200 seconds was exceeded

from /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(59)
#0 /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/Detail/ExcimerRequestTimeout.php(34): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->exitCriticalSection(integer)
#1 /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/CriticalSectionProvider.php(134): Wikimedia\RequestTimeout\Detail\ExcimerRequestTimeout->exitCriticalSection(integer)
#2 /srv/mediawiki/php-1.42.0-wmf.10/vendor/wikimedia/request-timeout/src/CriticalSectionScope.php(57): Wikimedia\RequestTimeout\CriticalSectionProvider->Wikimedia\RequestTimeout\{closure}(integer)
#3 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(3150): Wikimedia\RequestTimeout\CriticalSectionScope->exit()
#4 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(725): Wikimedia\Rdbms\Database->completeCriticalSection(string, Wikimedia\RequestTimeout\CriticalSectionScope)
#5 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(640): Wikimedia\Rdbms\Database->executeQuery(Wikimedia\Rdbms\Query, string, integer)
#6 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(1347): Wikimedia\Rdbms\Database->query(Wikimedia\Rdbms\Query, string)
#7 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/Database.php(1356): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#8 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(119): Wikimedia\Rdbms\Database->selectRow(array, array, array, string, array, array)
#9 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/database/DBConnRef.php(369): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#10 /srv/mediawiki/php-1.42.0-wmf.10/includes/libs/rdbms/querybuilder/SelectQueryBuilder.php(756): Wikimedia\Rdbms\DBConnRef->selectRow(array, array, array, string, array, array)
#11 /srv/mediawiki/php-1.42.0-wmf.10/includes/user/UserSelectQueryBuilder.php(262): Wikimedia\Rdbms\SelectQueryBuilder->fetchRow()
#12 /srv/mediawiki/php-1.42.0-wmf.10/includes/user/ActorStore.php(259): MediaWiki\User\UserSelectQueryBuilder->fetchUserIdentity()
#13 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/Provider/StructuredMentorProvider.php(175): MediaWiki\User\ActorStore->getUserIdentityByUserId(integer)
#14 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/MentorPageMentorManager.php(231): GrowthExperiments\Mentorship\Provider\StructuredMentorProvider->getWeightedAutoAssignedMentors()
#15 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/ReassignMentees.php(140): GrowthExperiments\Mentorship\MentorPageMentorManager->getRandomAutoAssignedMentor(MediaWiki\User\UserIdentityValue)
#16 /srv/mediawiki/php-1.42.0-wmf.10/extensions/GrowthExperiments/includes/Mentorship/ReassignMenteesJob.php(88): GrowthExperiments\Mentorship\ReassignMentees->doReassignMentees(string, string)
#17 /srv/mediawiki/php-1.42.0-wmf.10/extensions/EventBus/includes/JobExecutor.php(80): GrowthExperiments\Mentorship\ReassignMenteesJob->run()
#18 /srv/mediawiki/rpc/RunSingleJob.php(60): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#19 {main}

As a result, the now-former mentor's mentees stay assigned to them, which leads to unwanted questions appearing at the former mentor's talk page.

Within this task, we should find a long-term solution to ensure reassignMenteesJob is able to process mentors with any amount of mentees assigned.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Urbanecm_WMF	T376124 Removing a mentor from the list of mentors does not always reassign newcomers
		Resolved		Urbanecm_WMF	T354222 reassignMenteesJob is not able to finish in time when a mentor has too many mentees assigned

Event Timeline

Urbanecm_WMF created this task.Jan 2 2024, 8:49 PM

Restricted Application added a project: Growth-Team. · View Herald TranscriptJan 2 2024, 8:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Urbanecm_WMF mentioned this in T354220: User:FormalDude quit from mentorship, but their mentees were not reassigned.Jan 2 2024, 8:53 PM

Urbanecm_WMF mentioned this in T330071: Mentorship: ensure that all mentees are assigned to an active mentor.

Urbanecm_WMF updated the task description. (Show Details)Jan 2 2024, 9:27 PM

The code responsible for the reassignment is in ReassignMentees::doReassignMentees, and it is wrapped by IDatabase::lock and IDatabase::unlock. At first, I thought that might be the problem (not sure how long such locks stay on), but it doesn't appear to be the case, for a couple of reasons:

manually running extensions/GrowthExperiments/maintenance/reassignMentees.php from mwmaint2002 works just fine (for more than 200 seconds), and it runs the same underlying code,
refreshLinkRecommendations.php, which uses the same locking mechanism, works fine

After I ruled out the first hypothesis, I checked how the $wgRequestTimeLimit variable is configured in Wikimedia production. The request time limit is defined based on HTTP_HOST as follows:

switch ( $_SERVER['HTTP_HOST'] ?? '' ) {
        case 'videoscaler.svc.eqiad.wmnet':
        case 'videoscaler.svc.codfw.wmnet':
        case 'videoscaler.discovery.wmnet':
                $wgRequestTimeLimit = 86400;
                break;

        case 'jobrunner.svc.eqiad.wmnet':
        case 'jobrunner.svc.codfw.wmnet':
        case 'jobrunner.discovery.wmnet':
                $wgRequestTimeLimit = 1200;
                break;

        default:
                if ( $_SERVER['REQUEST_METHOD'] === 'POST' ) {
                        $wgRequestTimeLimit = 200;
                } else {
                        $wgRequestTimeLimit = 60;
                }
}

The 200 timeout that ended up being used for this job suspiciously matches the default value for POST requests. According to the Logstash message, the job ran within the k8s cluster.

According to JobExecutor logs in Logstash, server (which I presume matches HTTP_HOST in the switch above) is jobrunner.discovery.wmnet for the non-k8s jobrunner, but mw-jobrunner.discovery.wmnet for the k8s one. If I'm interpreting this correctly, jobs running in the k8s jobrunner have $wgRequestTimeLimit = 200;, but jobs running in the old cluster have a much longer limit, 1200 seconds.

I'm still confused by the fact that the job was scheduled and started, but did not perform any write actions. Does reaching the timeout cause a rollback in some way that I'm not understanding? Or is the job progressing very slowly when in k8s, so that it is unable to get to the first writing action by the time the allocated 200 seconds pass? The debug logs for GrowthExperiments would likely give us the answer, but we do not log those in production.

That being said, since the reassignMentees.php maintenance script I invoked for FormalDude (T354220#9431083) is running for more than 1200 seconds, this (likely) difference in timeouts within the k8s world is interesting, but would not fix this issue. In any case, I filled those findings as T354229.

It's clear that we need to introduce some sort of batching to reassignMenteesJob to make it work for an arbitrary amount of mentees. We would likely need to split the job into multiple jobs (each responsible for processing one batch of mentees). However, I'm unsure how to do that with the IDatabase::lock locks. I believe the locks are still required, as we implemented them as part of T322374 to prevent reassignMenteesJob from running twice for a single user, resulting in duplicate reassignment (which made no sense at the end).

Maybe we can define a memcached key for each mentor, which would contained the number of running unfinished jobs, and use that instead as the locking mechanism? Or is there a different way to orchestrate the jobs, possibly at the changeprop level?

@Tgr Can you take a look and advise what would be the best fix here?

Urbanecm_WMF mentioned this in T354229: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s.Jan 2 2024, 10:36 PM

In T354222#9431265, @Urbanecm_WMF wrote:

Does reaching the timeout cause a rollback in some way that I'm not understanding?

Yeah, with the rare exception of jobs with the JOB_NO_EXPLICIT_TRX_ROUND flag, JobRunner::doExecuteJob() will wrap the job in a transaction so unless it returns successfully and the transaction gets committed, nothing is written into the DB. This is similar to the behavior of most other MediaWiki code, maintenance scripts are the only major exception IIRC.

I'm a bit confused about locking - shouldn't job deduplication already prevent that problem?
I suppose you could use the main stash and BagOStuff::lock() or add a lock flag to the mentor table but it feels like overkill.

In T354222#9431584, @Tgr wrote:

In T354222#9431265, @Urbanecm_WMF wrote:

Does reaching the timeout cause a rollback in some way that I'm not understanding?

Yeah, with the rare exception of jobs with the JOB_NO_EXPLICIT_TRX_ROUND flag, JobRunner::doExecuteJob() will wrap the job in a transaction so unless it returns successfully and the transaction gets committed, nothing is written into the DB. This is similar to the behavior of most other MediaWiki code, maintenance scripts are the only major exception IIRC.

Aha! That makes sense, thanks for the clarification.

I'm a bit confused about locking - shouldn't job deduplication already prevent that problem?

Good question. As a response to the T322374 bug report, I added locks and enabled job deduplication, so I'm not sure whether job deduplication would be sufficient on its own. Maybe this is a good time to try and see?

I suppose you could use the main stash and BagOStuff::lock() or add a lock flag to the mentor table but it feels like overkill.

Thanks for the suggestions. Unfortunately, I'm not sure how would BagOStuff::lock() help with detection whether a job belongs to the same or a different batch.

You'd have a "has lock" flag in the job specification, the job would start without the flag, acquire the lock, do a batch, then reschedule itself with the flag set to do the next batch. The final batch would release the lock.

Urbanecm_WMF triaged this task as Medium priority.Jan 4 2024, 12:21 PM

Urbanecm_WMF moved this task from Inbox to Backlog on the Growth-Team board.

Urbanecm_WMF mentioned this in T375784: Getting a full list of mentors timeouts for long-term mentors (Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded).Sep 26 2024, 7:02 PM

Urbanecm_WMF added a parent task: T375784: Getting a full list of mentors timeouts for long-term mentors (Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded).Sep 26 2024, 7:15 PM

Urbanecm_WMF mentioned this in T375791: UserSelectQueryBuilder times out when passed ~40k of user IDs.Sep 26 2024, 7:27 PM

I suspect a long amount of the 200s was "getting the set of mentors", which used to take over a minute. With T375784 fixed, this should hopefully be less prevalent.

Urbanecm_WMF added a parent task: T376124: Removing a mentor from the list of mentors does not always reassign newcomers.Oct 1 2024, 11:00 AM

Urbanecm_WMF removed a parent task: T375784: Getting a full list of mentors timeouts for long-term mentors (Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded).

Urbanecm_WMF mentioned this in T376124: Removing a mentor from the list of mentors does not always reassign newcomers.Oct 1 2024, 11:02 AM

Urbanecm_WMF added a project: Performance Issue.Oct 1 2024, 3:53 PM

I'm going to boldly call this resolved by adding batching: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1077493.

reassignMenteesJob is not able to finish in time when a mentor has too many mentees assignedClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

reassignMenteesJob is not able to finish in time when a mentor has too many mentees assigned
Closed, ResolvedPublic
Actions

Related Objects
Search...