Page MenuHomePhabricator

Transcode jobs failing with Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit')
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
  • service.version: 1.46.0-wmf.12
  • timestamp: 2026-01-21T09:38:21.886Z
  • labels.phpversion: 8.3.29
  • trace.id: 3bef7331-6a02-4cae-a2f3-2049c3b3ac65
  • Find trace.id in Logstash
labels.normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit')
FrameLocationCall
from/srv/mediawiki/php-1.46.0-wmf.12/includes/libs/Rdbms/LBFactory/LBFactory.php(820)
#0/srv/mediawiki/php-1.46.0-wmf.12/includes/libs/Rdbms/LBFactory/LBFactory.php(299)Wikimedia\Rdbms\LBFactory->assertTransactionRoundStage(string)
#1/srv/mediawiki/php-1.46.0-wmf.12/extensions/EventBus/includes/JobExecutor.php(96)Wikimedia\Rdbms\LBFactory->commitPrimaryChanges(string, int)
#2/srv/mediawiki/php-1.46.0-wmf.12/extensions/EventBus/maintenance/runSingleJobStdin.php(67)MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#3/srv/mediawiki/php-1.46.0-wmf.12/maintenance/includes/MaintenanceRunner.php(696)MediaWiki\Extensions\EventBus\Maintenance\RunSingleJobStdin->execute()
#4/srv/mediawiki/php-1.46.0-wmf.12/maintenance/run.php(53)MediaWiki\Maintenance\MaintenanceRunner->run()
#5/srv/mediawiki/multiversion/MWScript.php(219)require_once(string)
#6{main}
Impact

370 issues within the last ~30 minutes, all on webvideotranscode hosts.

Notes

This is new since deployment of 1.46.0-wmf.12 on Commons. So far I don't think that this deserves a rollback but it is noisy.
Tentatively setting UBN priority, would appreciate if Data-Engineering looked into this.
The codepaths above in the stacktrace don't show any recently committed changes.

Event Timeline

Aklapper triaged this task as Unbreak Now! priority.
Ottomata subscribed.

This looks like something inside of a specific job, but it is not clear what from the stack trace. The EventBus related calls in the stack are likely unrelated to the underlying problem.

I'm not exactly sure who might be able to investigate.

Untagging Data-Engineering and tagging MW-Interfaces-Team and serviceops to search for an appropriate owner.

Scott_French subscribed.

These are indeed all WebVideoTranscodeJob. From a quick spot-check of Special:NewFiles on commons, together with the videoscaling error rates over the last 12h, I don't think transcodes are succeeding.

As far as I can tell, the only plausible change in the wmf.11 - wmf.12 range that might affect DB interactions in WebVideoTranscodeJob is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/1225597.

We should probably roll back the train until the underlying issue is fixed or reverted.

BPirkle subscribed.

No idea yet what's going on.

I see that job_type webVideoTranscodePrioritized is indicated, and codesearch finds that in TimedMediaHandler. So per the Maintainers page, I'm adding what are hopefully some related tags. Please untag if I'm incorrect.

Thanks for the investigation! I'm rolling back wmf.12 to group0 for now, so group1 will be back on wmf.11.

From a quick scan of the code, we're somehow entering the commitPrimaryChanges call right after WebVideoTranscodeJob::run returns in JobExecutor::execute while still in LBFactory::ROUND_COMMITTING. That feels like a call to commitPrimaryChanges from within WebVideoTranscodeJob threw before resetting trxRoundStage, but it was then caught and run returned normally (though possibly with an error status).

Edit: I wish assertTransactionRoundStage included $this->trxRoundFname in its DBTransactionError message if non-null, as it would at least highlight where an interrupted transaction had started.

Confirmed that transcodes are completing once again following the rollback.

I can reproduce now.. Interestingly.. this is a scoping issue.

If i inline actuallyRun() into the try catch block, then suddenly things work normally... which.. is not normal to me :)

Change #1229664 had a related patch set uploaded (by TheDJ; author: TheDJ):

[mediawiki/extensions/TimedMediaHandler@master] Transaction round stage must be 'cursory' fix

https://gerrit.wikimedia.org/r/1229664

I wonder why WebVideoTranscode::run() has it's own try/catch that doesn't rethrow. This will mean that JobExecuter will still call commitPrimaryChanges(), unaware that the LBFactory is in an error state and a rollback is needed.

I wonder why WebVideoTranscode::run() has it's own try/catch that doesn't rethrow. This will mean that JobExecuter will still call commitPrimaryChanges(), unaware that the LBFactory is in an error state and a rollback is needed.

hmm. it probably should indeed. However, I tested this just now and it doesn't seem to fix this problem.

Edit: I wish assertTransactionRoundStage included $this->trxRoundFname in its DBTransactionError message if non-null, as it would at least highlight where an interrupted transaction had started.

Your wish etc.

Change #1229723 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/TimedMediaHandler@master] Revert "Fix DivisionByZeroError when calculating bitrate"

https://gerrit.wikimedia.org/r/1229723

Change #1229724 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/TimedMediaHandler@wmf/1.46.0-wmf.12] Revert "Fix DivisionByZeroError when calculating bitrate"

https://gerrit.wikimedia.org/r/1229724

I've proposed a revert of the problematic patch, which would allow us to continue for wmf.12 and debug in slower time.

I think i figured it out... commitPrimaryChanges( __METHOD__ ). but that METHOD is the name for the transaction owner, and in a subfunction, it doesn't match run() which is what the JobRunner assumes $fnameTrxOwner = get_class( $job ) . '::run'; // give run() outer scope

updated the 1229664 patch with those changes and it does seem to work now. It raises the question if we should not simply manage our own transactions with beginPrimaryChanges ? I'm not super familiar with this level of depth of our transaction management combined with that the jobrunner does..

I think i figured it out... commitPrimaryChanges( __METHOD__ ). but that METHOD is the name for the transaction owner, and in a subfunction, it doesn't match run() which is what the JobRunner assumes $fnameTrxOwner = get_class( $job ) . '::run'; // give run() outer scope

Yes, since actuallyRun() handles the transaction on behalf of run(), it needs to use the same name. That looks like the main systemic problem.

It will still cause commitPrimaryChanges() errors when errors arise since the exception is not rethrown, though JobExecuter will try to catch those and call rollbackPrimaryChangesAndLog() to recover.

updated the 1229664 patch with those changes and it does seem to work now. It raises the question if we should not simply manage our own transactions with beginPrimaryChanges ? I'm not super familiar with this level of depth of our transaction management combined with that the jobrunner does..

[EDIT] I meant to reply to this part separately. I need to look at the original patch more to see what it was trying to accomplish. Is it just trying to handle DivisionByZeroError? Can we not just abort around where the $bitrate is calculated, before that exception is thrown, and return false?

Ah, of course ... the "inner" commitPrimaryChanges call (i.e., from within WebVideoTranscodeJob::actuallyRun) raises DBTransactionError due to the name mismatch. This is then caught in run which returns (rather than re-raises), while trxRoundStage is left in a bad state, leading to the effect seen here. Good find!

Change #1229724 merged by jenkins-bot:

[mediawiki/extensions/TimedMediaHandler@wmf/1.46.0-wmf.12] Revert "Fix DivisionByZeroError when calculating bitrate"

https://gerrit.wikimedia.org/r/1229724

Mentioned in SAL (#wikimedia-operations) [2026-01-22T09:10:57Z] <aklapper@deploy2002> Started scap sync-world: Backport for [[gerrit:1229724|Revert "Fix DivisionByZeroError when calculating bitrate" (T415169)]]

Mentioned in SAL (#wikimedia-operations) [2026-01-22T09:13:20Z] <aklapper@deploy2002> jforrester, aklapper: Backport for [[gerrit:1229724|Revert "Fix DivisionByZeroError when calculating bitrate" (T415169)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-01-22T09:18:11Z] <aklapper@deploy2002> Finished scap sync-world: Backport for [[gerrit:1229724|Revert "Fix DivisionByZeroError when calculating bitrate" (T415169)]] (duration: 07m 13s)

Thanks everyone for looking into this / creating patches. Appreciated!

To unblock the train, I merged the revert only for the 1.46.0-wmf.12 branch in https://gerrit.wikimedia.org/r/c/1229724.

Marking this ticket as blocking 1.46.0-wmf.13 as a decision needs to be made whether to also revert on master (resp the future 1.46.0-wmf.13 branch) via https://gerrit.wikimedia.org/r/c/1229723, or instead to apply https://gerrit.wikimedia.org/r/c/1229664 attempting to fix the root cause.

[EDIT] I meant to reply to this part separately. I need to look at the original patch more to see what it was trying to accomplish. Is it just trying to handle DivisionByZeroError? Can we not just abort around where the $bitrate is calculated, before that exception is thrown, and return false?

This whole job is unusual, in that it drops the connection after step 1, spends a LOT of time transcoding, and then reinstates the connection and does step 2. This in itself already implies two transaction rounds I think, which probably means we should not leave this to an implicit round of the job runner ?

aaron renamed this task from Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit') to Transcode jobs failing with Wikimedia\Rdbms\DBTransactionError: Transaction round stage must be 'cursory' (not 'within-commit').Thu, Jan 22, 10:30 PM

Change #1229723 merged by jenkins-bot:

[mediawiki/extensions/TimedMediaHandler@master] Partial revert "Fix DivisionByZeroError when calculating bitrate"

https://gerrit.wikimedia.org/r/1229723

[EDIT] I meant to reply to this part separately. I need to look at the original patch more to see what it was trying to accomplish. Is it just trying to handle DivisionByZeroError? Can we not just abort around where the $bitrate is calculated, before that exception is thrown, and return false?

This whole job is unusual, in that it drops the connection after step 1, spends a LOT of time transcoding, and then reinstates the connection and does step 2. This in itself already implies two transaction rounds I think, which probably means we should not leave this to an implicit round of the job runner ?

Yeah, looking through that method again, it seems like that job class should have $executionFlags set similar to RefreshLinksJob.

This is an open UBN ticket. Could there please be a decision? This is blocking again the train later today, per T415169#11544264. Thanks.

Ah, thanks, and sorry! I assume that means this ticket can get resolved.

Change #1229664 abandoned by TheDJ:

[mediawiki/extensions/TimedMediaHandler@master] Preserve name of transaction owner

Reason:

follow up will be in T415646

https://gerrit.wikimedia.org/r/1229664