Page MenuHomePhabricator

Mentors still having mentees after removing themselves
Open, HighPublicBUG REPORT

Description

What happened?:
As reported here, Liu1126 attempted to quit as a mentor, but they still have mentees and are still receiving questions.

What should have happened instead?:
Mentors who quit mentorship should have all their mentees reassigned.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Previously T354222 but that was supposedly fixed 2 years ago.

Thanks @Pppery for linking the prior task! Liu1126 indeed has quite a few mentees:

mysql:research@dbstore1008.eqiad.wmnet [enwiki]> select * from actor where actor_name="Liu1126";
+-----------+------------+------------+
| actor_id  | actor_user | actor_name |
+-----------+------------+------------+
| 198393562 |   37823666 | Liu1126    |
+-----------+------------+------------+
1 row in set (0.001 sec)

mysql:research@dbstore1008.eqiad.wmnet [enwiki]> 

mysql:research@dbstore1009.eqiad.wmnet [enwiki]> select count(*) from growthexperiments_mentor_mentee where gemm_mentor_id=37823666 and gemm_mentor_role='primary';
+----------+
| count(*) |
+----------+
|    22440 |
+----------+
1 row in set (0.025 sec)

mysql:research@dbstore1009.eqiad.wmnet [enwiki]>

So, it is definitely plausible that task is re-occuring.

Liu1126 resigned on December 16, 2025. Fortunately, we still have logs from then:

2025-12-15 23:13:01.047217 [a7c3da04-63bd-45ef-acee-61993113134c] mw-jobrunner.codfw.main-5f8596658f-qfdzr enwiki 1.46.0-wmf.5 GrowthExperiments WARNING: GrowthExperiments\Mentorship\ReassignMentees::doReassignMentees failed to acquire a lock for Liu1126 {"mentor":"Liu1126","context.job_type":"reassignMenteesJob"} 
2025-12-15 23:13:01.047523 [a7c3da04-63bd-45ef-acee-61993113134c] mw-jobrunner.codfw.main-5f8596658f-qfdzr enwiki 1.46.0-wmf.5 GrowthExperiments INFO: ReassignMenteesJob finished reassignment with  status {"status":false,"context.job_type":"reassignMenteesJob"}

There is no mention of Liu1126 in the following day's logs, meaning the job apparently did not run again (disclaimer: I did not check JobExecutor logs, because those have ~30 GBs/day compressed).

I read the relevant code to see whether "job not getting rescheduled when lock cannot be acquired" is a plausible story. I found that:

  1. The log comes from ReassignMentees::doReassignMentees. That method returns false when lock acquisition fails (or when WikiConfigException is thrown when reading the mentor list)
  2. If the return value is false, ReassignMenteesJob::run reports the job successfully finished (by returning true) and does NOT schedule a follow-up
  3. The job execution framework sees a success reported and moves on to other jobs

I do not know why the lock acquisition failed. But, if it happens, we probably shouldn't just drop the job entirely – we should retry it once or twice.

On an unrelated note: ReassignMentees::doReassignMentees catches WikiConfigException, MentorManager::getRandomAutoAssignedMentor documents a WikiConfigException can be thrown, but in reality, such codepath doesn't seem to exist.

During investigating this issue, I found an (unrelated) issue that is currently tracked privately at T418222. The Growth-Team needs to fix the private issue first. I'll continue with this task as soon as I finish the other one.

Urbanecm_WMF triaged this task as High priority.

Okay, the other issue should be now resolved, so I'm continuing the investigation here. I'll post an update as a comment.

Change #1244630 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@master] ReassignMentees: Log more information

https://gerrit.wikimedia.org/r/1244630

Change #1244638 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.17] ReassignMentees: Log more information

https://gerrit.wikimedia.org/r/1244638

Change #1244639 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.16] ReassignMentees: Log more information

https://gerrit.wikimedia.org/r/1244639

Change #1244639 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.16] ReassignMentees: Log more information

https://gerrit.wikimedia.org/r/1244639

Change #1244638 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.17] ReassignMentees: Log more information

https://gerrit.wikimedia.org/r/1244638

Mentioned in SAL (#wikimedia-operations) [2026-02-26T13:01:07Z] <urbanecm@deploy2002> Started scap sync-world: Backport for [[gerrit:1244639|ReassignMentees: Log more information (T418194)]], [[gerrit:1244638|ReassignMentees: Log more information (T418194)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-26T13:05:13Z] <urbanecm@deploy2002> urbanecm: Backport for [[gerrit:1244639|ReassignMentees: Log more information (T418194)]], [[gerrit:1244638|ReassignMentees: Log more information (T418194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-02-26T13:12:07Z] <urbanecm@deploy2002> Finished scap sync-world: Backport for [[gerrit:1244639|ReassignMentees: Log more information (T418194)]], [[gerrit:1244638|ReassignMentees: Log more information (T418194)]] (duration: 11m 00s)

Change #1244630 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] ReassignMentees: Log more information

https://gerrit.wikimedia.org/r/1244630

Change #1244656 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@master] ReassignMentees: Adjust logging

https://gerrit.wikimedia.org/r/1244656

Change #1244673 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/mediawiki-config@master] [Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500

https://gerrit.wikimedia.org/r/1244673

Change #1244673 merged by jenkins-bot:

[operations/mediawiki-config@master] [Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500

https://gerrit.wikimedia.org/r/1244673

It seems the job is timeouting:

2026-02-26 13:42:17.125738 [021414e9e1b6f975030921de] mw-jobrunner.codfw.main-58d77d9d9-rlgqn enwiki 1.46.0-wmf.16 JobExecutor ERROR: Exception executing job: reassignMenteesJob Special: mentorId=37823666 performerId=37823666 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=["Liu1126"] jobReleaseTimestamp=1772112136 requestId=021414e9e1b6f975030921de : Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of 1200 seconds was exceeded {"job_type":"reassignMenteesJob","context.job_type":"reassignMenteesJob"}

I'll try lowering the limit to 2000 and see if it helps...

Mentioned in SAL (#wikimedia-operations) [2026-02-26T14:20:00Z] <urbanecm@deploy2002> Started scap sync-world: Backport for [[gerrit:1244673|[Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 (T418194)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-26T14:22:03Z] <urbanecm@deploy2002> urbanecm: Backport for [[gerrit:1244673|[Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 (T418194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-02-26T14:26:21Z] <urbanecm@deploy2002> Finished scap sync-world: Backport for [[gerrit:1244673|[Growth] Lower wgGEMentorshipReassignMenteesBatchSize to 2500 (T418194)]] (duration: 06m 20s)

Change #1244685 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.16] ReassignMentees: Adjust logging level

https://gerrit.wikimedia.org/r/1244685

Change #1244686 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.17] ReassignMentees: Adjust logging level

https://gerrit.wikimedia.org/r/1244686

Change #1244686 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.17] ReassignMentees: Adjust logging level

https://gerrit.wikimedia.org/r/1244686

Change #1244685 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.16] ReassignMentees: Adjust logging level

https://gerrit.wikimedia.org/r/1244685

Mentioned in SAL (#wikimedia-operations) [2026-02-26T15:34:37Z] <dreamyjazz@deploy2002> Started scap sync-world: Backport for [[gerrit:1244688|SI: Populate siu_info in cusi_user from matched signals (T411118)]], [[gerrit:1244686|ReassignMentees: Adjust logging level (T418194)]], [[gerrit:1244685|ReassignMentees: Adjust logging level (T418194)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-26T15:36:40Z] <dreamyjazz@deploy2002> dreamyjazz, urbanecm: Backport for [[gerrit:1244688|SI: Populate siu_info in cusi_user from matched signals (T411118)]], [[gerrit:1244686|ReassignMentees: Adjust logging level (T418194)]], [[gerrit:1244685|ReassignMentees: Adjust logging level (T418194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Change #1244656 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] ReassignMentees: Adjust logging level

https://gerrit.wikimedia.org/r/1244656

Mentioned in SAL (#wikimedia-operations) [2026-02-26T15:41:06Z] <dreamyjazz@deploy2002> Finished scap sync-world: Backport for [[gerrit:1244688|SI: Populate siu_info in cusi_user from matched signals (T411118)]], [[gerrit:1244686|ReassignMentees: Adjust logging level (T418194)]], [[gerrit:1244685|ReassignMentees: Adjust logging level (T418194)]] (duration: 06m 29s)

Much better:

[urbanecm@mwlog2002 /srv/mw-log]$ grep 5407303fddeed2f2b7350653 GrowthExperiments.log 
2026-02-26 15:45:26.472151 [5407303fddeed2f2b7350653] mw-jobrunner.codfw.main-59dcdb84cc-24flv enwiki 1.46.0-wmf.16 GrowthExperiments INFO: GrowthExperiments\Mentorship\ReassignMentees::doReassignMentees processing 10159 mentees for Liu1126 with limit of 2000 {"mentees":10159,"mentor":"Liu1126","limit":2000,"context.job_type":"reassignMenteesJob"} 
2026-02-26 15:51:24.041665 [5407303fddeed2f2b7350653] mw-jobrunner.codfw.main-59dcdb84cc-x2zs7 enwiki 1.46.0-wmf.16 GrowthExperiments WARNING: GrowthExperiments\Mentorship\ReassignMentees::doReassignMentees failed to acquire a lock for Liu1126 {"mentor":"Liu1126","context.job_type":"reassignMenteesJob"} 
2026-02-26 15:51:24.041934 [5407303fddeed2f2b7350653] mw-jobrunner.codfw.main-59dcdb84cc-x2zs7 enwiki 1.46.0-wmf.16 GrowthExperiments INFO: ReassignMenteesJob finished reassignment for Liu1126 with false status {"status":"false","mentor":"Liu1126","context.job_type":"reassignMenteesJob"} 
2026-02-26 15:53:28.039945 [5407303fddeed2f2b7350653] mw-jobrunner.codfw.main-59dcdb84cc-bmpbz enwiki 1.46.0-wmf.16 GrowthExperiments WARNING: GrowthExperiments\Mentorship\ReassignMentees::doReassignMentees failed to acquire a lock for Liu1126 {"mentor":"Liu1126","context.job_type":"reassignMenteesJob"} 
2026-02-26 15:53:28.040243 [5407303fddeed2f2b7350653] mw-jobrunner.codfw.main-59dcdb84cc-bmpbz enwiki 1.46.0-wmf.16 GrowthExperiments INFO: ReassignMenteesJob finished reassignment for Liu1126 with false status {"status":"false","mentor":"Liu1126","context.job_type":"reassignMenteesJob"} 
2026-02-26 15:54:52.918047 [5407303fddeed2f2b7350653] mw-jobrunner.codfw.main-59dcdb84cc-24flv enwiki 1.46.0-wmf.16 GrowthExperiments INFO: ReassignMentees processed the maximum number of mentees {"limit":2000,"mentor":"Liu1126","context.job_type":"reassignMenteesJob"} 
2026-02-26 15:54:52.919382 [5407303fddeed2f2b7350653] mw-jobrunner.codfw.main-59dcdb84cc-24flv enwiki 1.46.0-wmf.16 GrowthExperiments INFO: ReassignMenteesJob finished reassignment for Liu1126 with true status {"status":"true","mentor":"Liu1126","context.job_type":"reassignMenteesJob"} 
2026-02-26 15:54:52.920564 [5407303fddeed2f2b7350653] mw-jobrunner.codfw.main-59dcdb84cc-24flv enwiki 1.46.0-wmf.16 GrowthExperiments INFO: ReassignMenteesJob did not reassign all mentees for Liu1126, scheduling new job {"mentor":"Liu1126","context.job_type":"reassignMenteesJob"} 
2026-02-26 15:59:03.977233 [5407303fddeed2f2b7350653] mw-jobrunner.codfw.main-59dcdb84cc-24flv enwiki 1.46.0-wmf.16 GrowthExperiments INFO: GrowthExperiments\Mentorship\ReassignMentees::doReassignMentees processing 8159 mentees for Liu1126 with limit of 2000 {"mentees":8159,"mentor":"Liu1126","limit":2000,"context.job_type":"reassignMenteesJob"} 
[urbanecm@mwlog2002 /srv/mw-log]$

Good news is it works, we started with 20k of mentees. What I'm not sure I understand are the lock failures, but I'm also not 100% certain I didn't schedule multiple jobs by accident (given I use shell.php, cf. T418468).

I'm also not happy about the 2k decrease of the batch size (5k used to work just fine...). At first, I was blaming T418379: ReassignMentees does not enforce batch size limit for mentees that are dropped from the mentor/mentee relationship, but then the logs would include information about the drops, which is not happening:

[urbanecm@mwlog2002 /srv/mw-log]$ grep 'ReassignMentees failed to reassign mentees' GrowthExperiments.log 
[urbanecm@mwlog2002 /srv/mw-log]$ grep 'dropping relationship' GrowthExperiments.log

Good news is it works, we started with 20k of mentees

There's still an issue though... This is where we are at now:

mysql:research@dbstore1009.eqiad.wmnet [enwiki]> select count(*) from growthexperiments_mentor_mentee where gemm_mentor_id=37823666 and gemm_mentor_role='primary';
+----------+
| count(*) |
+----------+
|      157 |
+----------+
1 row in set (0.012 sec)

mysql:research@dbstore1009.eqiad.wmnet [enwiki]>

The remaining mentees were not reassigned.

Another issue is related to the speed. From JobExecutor:

2026-02-26 16:25:59.714315 [5407303fddeed2f2b7350653] mw-jobrunner.codfw.main-59dcdb84cc-hcfx8 enwiki 1.46.0-wmf.16 JobExecutor INFO: Finished job execution {"job":"reassignMenteesJob Special: mentorId=37823666 performerId=37823666 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Liu1126\"] jobReleaseTimestamp=1772122236 requestId=5407303fddeed2f2b7350653","job_type":"reassignMenteesJob","job_status":true,"job_duration":731.7542641162872,"context.job_type":"reassignMenteesJob"} 
2026-02-26 16:40:40.759666 [5407303fddeed2f2b7350653] mw-jobrunner.codfw.main-59dcdb84cc-rkmlf enwiki 1.46.0-wmf.16 JobExecutor INFO: Finished job execution {"job":"reassignMenteesJob Special: mentorId=37823666 performerId=37823666 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Liu1126\"] jobReleaseTimestamp=1772122968 requestId=5407303fddeed2f2b7350653","job_type":"reassignMenteesJob","job_status":true,"job_duration":881.3475499153137,"context.job_type":"reassignMenteesJob"}

The current limit is 2000 mentees/job, so that means about half a second per mentee. That is a lot. Not sure why.

Mentioned in SAL (#wikimedia-operations) [2026-03-19T12:10:06Z] <urbanecm@deploy2002> mwscript-k8s job started: GrowthExperiments:reassignMentees --wiki=enwiki --mentor=Bilorv --performer=Bilorv --as-job # T418194

mysql:research@dbstore1009.eqiad.wmnet [enwiki]> select count(*) from growthexperiments_mentor_mentee where gemm_mentor_id=37823666 and gemm_mentor_role='primary';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.012 sec)

mysql:research@dbstore1009.eqiad.wmnet [enwiki]>

mysql:research@dbstore1008.eqiad.wmnet [enwiki]> select * from actor where actor_user=20355744;
+----------+------------+------------+
| actor_id | actor_user | actor_name |
+----------+------------+------------+
|     2765 |   20355744 | Bilorv     |
+----------+------------+------------+
1 row in set (0.001 sec)

mysql:research@dbstore1008.eqiad.wmnet [enwiki]>

So, for the originally reported user, this is now resolved. I found another mentor who has quite a few mentees assigned, and I'm going to try to reassign them, to see what happens.

Okay... This is suspicious (JobExecutor logs from today):

2026-03-19 12:11:03.017815 [d3f5760d475234e6e5acefde] mw-jobrunner.codfw.main-5b48559974-q76rq enwiki 1.46.0-wmf.20 JobExecutor DEBUG: Beginning job execution {"job":"reassignMenteesJob Special: mentorId=20355744 performerId=20355744 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Bilorv\"] jobReleaseTimestamp=1773922263 requestId=d3f5760d475234e6e5acefde","job_type":"reassignMenteesJob"} 
2026-03-19 12:17:03.028721 [d3f5760d475234e6e5acefde] mw-jobrunner.codfw.main-5b48559974-scxt7 enwiki 1.46.0-wmf.20 JobExecutor DEBUG: Beginning job execution {"job":"reassignMenteesJob Special: mentorId=20355744 performerId=20355744 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Bilorv\"] jobReleaseTimestamp=1773922263 requestId=d3f5760d475234e6e5acefde","job_type":"reassignMenteesJob"} 
2026-03-19 12:17:03.039864 [d3f5760d475234e6e5acefde] mw-jobrunner.codfw.main-5b48559974-scxt7 enwiki 1.46.0-wmf.20 JobExecutor INFO: Finished job execution {"job":"reassignMenteesJob Special: mentorId=20355744 performerId=20355744 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Bilorv\"] jobReleaseTimestamp=1773922263 requestId=d3f5760d475234e6e5acefde","job_type":"reassignMenteesJob","job_status":true,"job_duration":0.01604008674621582,"context.job_type":"reassignMenteesJob"} 
2026-03-19 12:21:11.097690 [d3f5760d475234e6e5acefde] mw-jobrunner.codfw.main-5b48559974-8kfjc enwiki 1.46.0-wmf.20 JobExecutor DEBUG: Beginning job execution {"job":"reassignMenteesJob Special: mentorId=20355744 performerId=20355744 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Bilorv\"] jobReleaseTimestamp=1773922827 requestId=d3f5760d475234e6e5acefde","job_type":"reassignMenteesJob"} 
2026-03-19 12:21:11.314985 [d3f5760d475234e6e5acefde] mw-jobrunner.codfw.main-5b48559974-q76rq enwiki 1.46.0-wmf.20 JobExecutor INFO: Finished job execution {"job":"reassignMenteesJob Special: mentorId=20355744 performerId=20355744 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Bilorv\"] jobReleaseTimestamp=1773922263 requestId=d3f5760d475234e6e5acefde","job_type":"reassignMenteesJob","job_status":true,"job_duration":608.3002531528473,"context.job_type":"reassignMenteesJob"} 
2026-03-19 12:27:11.101796 [d3f5760d475234e6e5acefde] mw-jobrunner.codfw.main-5b48559974-scxt7 enwiki 1.46.0-wmf.20 JobExecutor DEBUG: Beginning job execution {"job":"reassignMenteesJob Special: mentorId=20355744 performerId=20355744 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Bilorv\"] jobReleaseTimestamp=1773922827 requestId=d3f5760d475234e6e5acefde","job_type":"reassignMenteesJob"} 
2026-03-19 12:27:11.109884 [d3f5760d475234e6e5acefde] mw-jobrunner.codfw.main-5b48559974-scxt7 enwiki 1.46.0-wmf.20 JobExecutor INFO: Finished job execution {"job":"reassignMenteesJob Special: mentorId=20355744 performerId=20355744 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Bilorv\"] jobReleaseTimestamp=1773922827 requestId=d3f5760d475234e6e5acefde","job_type":"reassignMenteesJob","job_status":true,"job_duration":0.009999990463256836,"context.job_type":"reassignMenteesJob"} 
2026-03-19 12:34:25.471104 [d3f5760d475234e6e5acefde] mw-jobrunner.codfw.main-5b48559974-wv2x2 enwiki 1.46.0-wmf.20 JobExecutor DEBUG: Beginning job execution {"job":"reassignMenteesJob Special: mentorId=20355744 performerId=20355744 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Bilorv\"] jobReleaseTimestamp=1773923615 requestId=d3f5760d475234e6e5acefde","job_type":"reassignMenteesJob"} 
2026-03-19 12:34:25.747795 [d3f5760d475234e6e5acefde] mw-jobrunner.codfw.main-5b48559974-8kfjc enwiki 1.46.0-wmf.20 JobExecutor INFO: Finished job execution {"job":"reassignMenteesJob Special: mentorId=20355744 performerId=20355744 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Bilorv\"] jobReleaseTimestamp=1773922827 requestId=d3f5760d475234e6e5acefde","job_type":"reassignMenteesJob","job_status":true,"job_duration":794.6530587673187,"context.job_type":"reassignMenteesJob"} 
2026-03-19 12:40:25.478645 [d3f5760d475234e6e5acefde] mw-jobrunner.codfw.main-5b48559974-scxt7 enwiki 1.46.0-wmf.20 JobExecutor DEBUG: Beginning job execution {"job":"reassignMenteesJob Special: mentorId=20355744 performerId=20355744 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Bilorv\"] jobReleaseTimestamp=1773923615 requestId=d3f5760d475234e6e5acefde","job_type":"reassignMenteesJob"} 
2026-03-19 12:40:25.486427 [d3f5760d475234e6e5acefde] mw-jobrunner.codfw.main-5b48559974-scxt7 enwiki 1.46.0-wmf.20 JobExecutor INFO: Finished job execution {"job":"reassignMenteesJob Special: mentorId=20355744 performerId=20355744 reassignMessageKey=growthexperiments-quit-mentorship-reassign-mentees-log-message reassignMessageAdditionalParams=[\"Bilorv\"] jobReleaseTimestamp=1773923615 requestId=d3f5760d475234e6e5acefde","job_type":"reassignMenteesJob","job_status":true,"job_duration":0.01068115234375,"context.job_type":"reassignMenteesJob"}

The only thing I did is running GrowthExperiments:reassignMentees --wiki=enwiki --mentor=Bilorv --performer=Bilorv --as-job (see the SAL entry above). Somehow, our code scheduled&started a job before the execution finished? Which terminated early (because of the lock)? How is that possible?


I decided to compare JobExecutor logs (when jobs were actually executed) with the Kafka traffic (when the jobs were _scheduled_):

[urbanecm@stat1008 ~]$ kafkacat -b kafka-main1006.eqiad.wmnet -t "codfw.mediawiki.job.reassignMenteesJob" > jobs.json
[urbanecm@stat1008 ~]$ grep Bilorv jobs.json | wc -l
5
[urbanecm@stat1008 ~]$

The result is (WMF-NDA paste):

{P89887}

So...GrowthExperiments:reassignMentees scheduled the same job _twice_, one version was delayed by 6 minutes, the other one started executing immediately? And then at 12:21:10.647Z, we scheduled a job delayed until 12:20:27Z? That...seems suspicious...

All of this happens in subsequent rounds as well, we apparently run the reassignment twice in parallel, and only one version succeeds...

Change #1279318 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@master] ReassignMentees: Add logging information

https://gerrit.wikimedia.org/r/1279318

Mentioned in SAL (#wikimedia-operations) [2026-04-29T12:57:45Z] <urbanecm@deploy1003> mwscript-k8s job started: GrowthExperiments:reassignMentees --wiki=enwiki --mentor=GrayStorm --performer=GrayStorm --as-job # T418194

Change #1279318 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] ReassignMentees: Add logging information

https://gerrit.wikimedia.org/r/1279318

Change #1280368 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.26] ReassignMentees: Add logging information

https://gerrit.wikimedia.org/r/1280368

Change #1280370 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.24] ReassignMentees: Add logging information

https://gerrit.wikimedia.org/r/1280370

Change #1280368 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.26] ReassignMentees: Add logging information

https://gerrit.wikimedia.org/r/1280368

Change #1280370 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.24] ReassignMentees: Add logging information

https://gerrit.wikimedia.org/r/1280370

Mentioned in SAL (#wikimedia-operations) [2026-04-30T12:19:15Z] <urbanecm@deploy1003> Started scap sync-world: Backport for [[gerrit:1280368|ReassignMentees: Add logging information (T418194)]], [[gerrit:1280370|ReassignMentees: Add logging information (T418194)]]

Mentioned in SAL (#wikimedia-operations) [2026-04-30T12:21:06Z] <urbanecm@deploy1003> urbanecm: Backport for [[gerrit:1280368|ReassignMentees: Add logging information (T418194)]], [[gerrit:1280370|ReassignMentees: Add logging information (T418194)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-04-30T14:30:46Z] <urbanecm@deploy1003> Finished scap sync-world: Backport for [[gerrit:1280368|ReassignMentees: Add logging information (T418194)]], [[gerrit:1280370|ReassignMentees: Add logging information (T418194)]] (duration: 131m 31s)