Error
- mwversion: 1.44.0-wmf.17
- reqId: 64e175a4-3845-472d-9299-1d4564d1f5aa
- Find reqId in Logstash
Deferred update '{deferred_type}' failed to run.Impact
wmf.17 729 errors ( logstash link).
| Etonkovidova | |
| Feb 22 2025, 1:06 AM |
| F58628324: Screen Shot 2025-03-06 at 4.11.33 PM.png | |
| Mar 7 2025, 12:15 AM |
Deferred update '{deferred_type}' failed to run.wmf.17 729 errors ( logstash link).
Looking at logstash Call to a member function getName() on null would seem to be the main cuplrit.
This is the associated stacktrace:
from /srv/mediawiki/php-1.44.0-wmf.17/extensions/GrowthExperiments/includes/Mentorship/ChangeMentor.php(185)
#0 /srv/mediawiki/php-1.44.0-wmf.17/includes/deferred/MWCallableUpdate.php(52): GrowthExperiments\Mentorship\ChangeMentor->GrowthExperiments\Mentorship\{closure}(string)
#1 /srv/mediawiki/php-1.44.0-wmf.17/includes/deferred/DeferredUpdates.php(459): MediaWiki\Deferred\MWCallableUpdate->doUpdate()
#2 /srv/mediawiki/php-1.44.0-wmf.17/includes/deferred/DeferredUpdates.php(201): MediaWiki\Deferred\DeferredUpdates::attemptUpdate(MediaWiki\Deferred\MWCallableUpdate)
#3 /srv/mediawiki/php-1.44.0-wmf.17/includes/deferred/DeferredUpdates.php(288): MediaWiki\Deferred\DeferredUpdates::run(MediaWiki\Deferred\MWCallableUpdate)
#4 /srv/mediawiki/php-1.44.0-wmf.17/includes/deferred/DeferredUpdatesScope.php(243): MediaWiki\Deferred\DeferredUpdates::MediaWiki\Deferred\{closure}(MediaWiki\Deferred\MWCallableUpdate, int)
#5 /srv/mediawiki/php-1.44.0-wmf.17/includes/deferred/DeferredUpdatesScope.php(172): MediaWiki\Deferred\DeferredUpdatesScope->processStageQueue(int, int, Closure)
#6 /srv/mediawiki/php-1.44.0-wmf.17/includes/deferred/DeferredUpdates.php(307): MediaWiki\Deferred\DeferredUpdatesScope->processUpdates(int, Closure)
#7 /srv/mediawiki/php-1.44.0-wmf.17/extensions/EventBus/includes/JobExecutor.php(110): MediaWiki\Deferred\DeferredUpdates::doUpdates()
#8 /srv/mediawiki/rpc/RunSingleJob.php(60): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#9 {main}Change #1122167 had a related patch set uploaded (by Michael Große; author: Michael Große):
[mediawiki/extensions/GrowthExperiments@master] fix(Mentorship): Don't fail notification if old mentor is null
@Michael noted on the patch: // TODO: Figure out how the old mentor could be null and what should be done in that cae. This had a short conversation on Gerrit:
I'm moving it the conversation to Phabricator, as it is getting more general than what is in scope for code review discussions:
We could reintroduce the reason field? Though the i18n message key makes it look like that the related message of growthexperiments-notification-body-mentor-change that made use of it was intended to be removed eventually?
I'm not 100% sure how the reason field would be helpful here. We explicitly decided on not exposing the reason to the mentee, see the conversation in T327493. The field has to be processed by the presentation model for B/C reasons. Unfortunately, we need to basically keep the B/C layer in indefinitely. Echo is capable of showing up to 2000 notifications per user. When was the oldest notification generated depends on how many notifications the user has. If they are not particularly active, the 2000th oldest notification might be very old (possibly including the reason field).
It might be a good idea to switch to a time-based expiry (T383948). That would allow us to assume any notification older than N days is no longer present in the database, and thus no user can possibly see it. With the current offset-based approach, we do not have the certainty, and we need to be able to render notifications with reason basically forever.
The other half of the question is how we would encounter this code-path for them?
Good question! As far as I can see, all occurrences were on a job runner. The only place where ChangeMentor is called from a job context is ReassignMentees. I agree it seems unlikely (and almost impossible), but it is the only job calling this place. Another interesting observation is that while there were hundreds of notices, all of them came from a single request (reqId 64e175a4-3845-472d-9299-1d4564d1f5aa). This means that only one job failed (rather than a job failing regularly), and that we are dealing with some kind of an edge case.
A very good question is how did ``ReassignMentees get called on someone that does not have any mentor. A possible theory for that:
However, this is a lot of conditions at once, and it all feels unlikely to happen. I also can't really explain how the race condition could have happened hundreds of times during a single job. Maybe not all of the circumstances I listed above are strictly necessary, and it is possible to achieve something similar in situations that are more likely to happen?
Hopefully this provides some extra thoughts on how this might happen, and where to go with debugging. It might be useful to take a look at all logs from that reqId (we wouldn't have debug logs, but at least something should be logged). Hope this helps!
Thank you for these additional thoughts! I've updated the change in the manner we discussed elsewhere.
Change #1122167 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] fix(Mentorship): Don't fail notification if old mentor is null
QA Note: the issue fixed here is likely to only occur in a rare race-condition. Probably not something that can be produced deliberately. The main thing to check would probably be that the notifications received by mentees when their mentor changes still look the same.
The error reports have stopped on Feb 26, 2025 @ 11:02:51.222 - https://logstash.wikimedia.org/goto/52984d3c969f34086235720678bee3de
Thx, @Michael! Yes, re-assigend mentees will get a regular notification: