Page MenuHomePhabricator

Translating a page on Meta-Wiki didn't create the translated page
Closed, ResolvedPublic4 Estimated Story PointsBUG REPORT

Description

A while ago, I have translated the upcoming Tech News (2026-17) into Polish. As it can be seen on Special:PrefixIndex, the Polish messages are created properly.

However, there's no overall page with the translated content, i.e. Tech/News/2026/17/pl is missing.

When translating, the only suspicious thing that I saw, was this message (displayed for every translation unit):
Failed to load translation aids: Title does not correspond to a translatable message

image.png (1,715×1,012 px, 239 KB)

See also: T423214: Translation page not created when <1% translated

Event Timeline

I re-marked the page for translation and "Failed to load translation aids" went away and Polish subpage was created. Root cause unknown but given the email to translators-l this seems to be recently recurring.

I believe I've found another example of this problem. I won't repeat the "mark as ready for translation" for now, in case you want to do so to help diagnose it:

Nikerabbit triaged this task as Medium priority.Apr 23 2026, 6:53 AM
Nikerabbit set the point value for this task to 2.

Worth investigating given the number of reports recently.

After some loose searching, it looks like new translation units since m:Translations:Queering Wiki 2026/Organizers/Page display title/en (2026-04-14 12:13) have this problem.

You can see the split here.

(My searching method: I click through FuzzyBot's page creation logs for 'Translations:' in recent weeks, and see if "In other languages" stops showing in the toolbox.)

I am seeing some:

The maximum execution time of 1200 seconds was exceeded

With following trace (checked only a few ones, but they seem to be the same):

from /srv/mediawiki/php-1.46.0-wmf.24/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(130)
#0 /srv/mediawiki/php-1.46.0-wmf.24/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(105): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->onTimeout(int)
#1 /srv/mediawiki/php-1.46.0-wmf.24/includes/libs/Http/MultiHttpClient.php(286): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->Wikimedia\RequestTimeout\Detail\{closure}(int)
#2 /srv/mediawiki/php-1.46.0-wmf.24/includes/libs/Http/MultiHttpClient.php(219): Wikimedia\Http\MultiHttpClient->runMultiCurl(array, array, string)
#3 /srv/mediawiki/php-1.46.0-wmf.24/extensions/EventBus/includes/EventBus.php(400): Wikimedia\Http\MultiHttpClient->runMulti(array, array)
#4 /srv/mediawiki/php-1.46.0-wmf.24/extensions/EventBus/includes/Adapters/JobQueue/JobQueueEventBus.php(118): MediaWiki\Extension\EventBus\EventBus->send(array, int)
#5 /srv/mediawiki/php-1.46.0-wmf.24/includes/JobQueue/JobQueue.php(372): MediaWiki\Extension\EventBus\Adapters\JobQueue\JobQueueEventBus->doBatchPush(array, int)
#6 /srv/mediawiki/php-1.46.0-wmf.24/includes/JobQueue/JobQueue.php(344): MediaWiki\JobQueue\JobQueue->batchPush(array, int)
#7 /srv/mediawiki/php-1.46.0-wmf.24/includes/JobQueue/JobQueueGroup.php(150): MediaWiki\JobQueue\JobQueue->push(array)
#8 /srv/mediawiki/php-1.46.0-wmf.24/extensions/Translate/src/TtmServer/TtmServer.php(58): MediaWiki\JobQueue\JobQueueGroup->push(array)
#9 /srv/mediawiki/php-1.46.0-wmf.24/includes/HookContainer/HookContainer.php(135): MediaWiki\Extension\Translate\TtmServer\TtmServer::onGroupChange(MediaWiki\Extension\Translate\MessageLoading\MessageHandle, array, array)
#10 /srv/mediawiki/php-1.46.0-wmf.24/extensions/Translate/src/HookRunner.php(136): MediaWiki\HookContainer\HookContainer->run(string, array)
#11 /srv/mediawiki/php-1.46.0-wmf.24/extensions/Translate/src/MessageLoading/MessageIndex.php(383): MediaWiki\Extension\Translate\HookRunner->onTranslateEventMessageMembershipChange(MediaWiki\Extension\Translate\MessageLoading\MessageHandle, array, array)
#12 /srv/mediawiki/php-1.46.0-wmf.24/extensions/Translate/src/MessageLoading/MessageIndex.php(260): MediaWiki\Extension\Translate\MessageLoading\MessageIndex->clearMessageGroupStats(array)
#13 /srv/mediawiki/php-1.46.0-wmf.24/extensions/Translate/src/MessageLoading/RebuildMessageIndexJob.php(48): MediaWiki\Extension\Translate\MessageLoading\MessageIndex->rebuild(float)
#14 /srv/mediawiki/php-1.46.0-wmf.24/extensions/EventBus/includes/JobExecutor.php(94): MediaWiki\Extension\Translate\MessageLoading\RebuildMessageIndexJob->run()
#15 /srv/mediawiki/rpc/RunSingleJob.php(60): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#16 {main}

This seems to have started 2026-04-14:

image.png (1,910×707 px, 240 KB)

Nikerabbit raised the priority of this task from Medium to Unbreak Now!.Apr 24 2026, 3:46 PM

This is now blocking people from getting their work done.

Tagging in @BBlack who is also looking into this issue.

From the top of that stack trace it looks to me that it is timing out trying to send a message out to the event bus:

from /srv/mediawiki/php-1.46.0-wmf.24/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(130)
#0 /srv/mediawiki/php-1.46.0-wmf.24/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(105): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->onTimeout(int)
#1 /srv/mediawiki/php-1.46.0-wmf.24/includes/libs/Http/MultiHttpClient.php(286): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->Wikimedia\RequestTimeout\Detail\{closure}(int)
#2 /srv/mediawiki/php-1.46.0-wmf.24/includes/libs/Http/MultiHttpClient.php(219): Wikimedia\Http\MultiHttpClient->runMultiCurl(array, array, string)
#3 /srv/mediawiki/php-1.46.0-wmf.24/extensions/EventBus/includes/EventBus.php(400): Wikimedia\Http\MultiHttpClient->runMulti(array, array)
#4 /srv/mediawiki/php-1.46.0-wmf.24/extensions/EventBus/includes/Adapters/JobQueue/JobQueueEventBus.php(118): MediaWiki\Extension\EventBus\EventBus->send(array, int)

something is triggering a event bus message, php is cueing up the http client to send it out, and then it times out. maybe the message is large, maybe the destination is down? Was there some new eventbus configuration landed around April 14?

Or perhaps, did the destination of some subrequests that Translate makes change to be pointing at something slower than before?

As far I can see, this is not affecting other jobs, so there is probably something special about this one. The job itself hasn't changed recently, but it takes one array param that can possibly be large. If that is the problem, I would expect that to fail instead of hang, though.

	protected function clearMessageGroupStats( array $diff ): void {
		$job = RebuildMessageGroupStatsJob::newRefreshGroupsJob( $diff['values'] );
		$this->jobQueueGroup->push( $job );

		foreach ( $diff['keys'] as $keys ) {
			foreach ( $keys as $key => $data ) {
				[ $ns, $pageName ] = explode( ':', $key, 2 );
				$title = Title::makeTitle( (int)$ns, $pageName );
				$handle = new MessageHandle( $title );
				[ $oldGroups, $newGroups ] = $data;
				$this->hookRunner->onTranslateEventMessageMembershipChange(
					$handle, $oldGroups, $newGroups );
			}
		}
	}

Eugh, ignore previous comment, according to the trace it is actually this line of code:
TtmServer.php

	public static function onGroupChange( MessageHandle $handle, array $old ): void {
		if ( $old === [] ) {
			// Don't bother for newly added messages
			return;
		}

		$job = TtmServerMessageUpdateJob::newJob( $handle, 'rebuild' );
		MediaWikiServices::getInstance()->getJobQueueGroup()->push( $job );
	}

This can potentially create a lot of jobs and it will try to push them one by one. That might mean it is not actually hanging, but just reaching the time limit if each push is slow enough and there are too many jobs. I guess using lazyPush might help here, but that's just guessing for now.

I found a single special even that might give a bit of additional clue (or just be a red herring): https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2026.04.23?id=D_A1uJ0BZNKAUxWE8T49

FormatJson::encode($events) thrown exception: No error. Aborting send.

from /srv/mediawiki/php-1.46.0-wmf.24/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(130)
#0 /srv/mediawiki/php-1.46.0-wmf.24/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(105): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->onTimeout(int)
#1 /srv/mediawiki/php-1.46.0-wmf.24/includes/Json/FormatJson.php(97): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->Wikimedia\RequestTimeout\Detail\{closure}(int)
#2 /srv/mediawiki/php-1.46.0-wmf.24/extensions/EventBus/includes/EventBus.php(561): MediaWiki\Json\FormatJson::encode(array, bool, int)
#3 /srv/mediawiki/php-1.46.0-wmf.24/extensions/EventBus/includes/EventFactory.php(475): MediaWiki\Extension\EventBus\EventBus::serializeEvents(array)
#4 /srv/mediawiki/php-1.46.0-wmf.24/extensions/EventBus/includes/EventFactory.php(1152): MediaWiki\Extension\EventBus\EventFactory->signEvent(array)
#5 /srv/mediawiki/php-1.46.0-wmf.24/extensions/EventBus/includes/Adapters/JobQueue/JobQueueEventBus.php(93): MediaWiki\Extension\EventBus\EventFactory->createJobEvent(string, string, MediaWiki\Extension\Translate\TtmServer\TtmServerMessageUpdateJob)
#6 /srv/mediawiki/php-1.46.0-wmf.24/includes/JobQueue/JobQueue.php(372): MediaWiki\Extension\EventBus\Adapters\JobQueue\JobQueueEventBus->doBatchPush(array, int)
#7 /srv/mediawiki/php-1.46.0-wmf.24/includes/JobQueue/JobQueue.php(344): MediaWiki\JobQueue\JobQueue->batchPush(array, int)
#8 /srv/mediawiki/php-1.46.0-wmf.24/includes/JobQueue/JobQueueGroup.php(150): MediaWiki\JobQueue\JobQueue->push(array)
#9 /srv/mediawiki/php-1.46.0-wmf.24/extensions/Translate/src/TtmServer/TtmServer.php(58): MediaWiki\JobQueue\JobQueueGroup->push(array)
#10 /srv/mediawiki/php-1.46.0-wmf.24/includes/HookContainer/HookContainer.php(135): MediaWiki\Extension\Translate\TtmServer\TtmServer::onGroupChange(MediaWiki\Extension\Translate\MessageLoading\MessageHandle, array, array)
#11 /srv/mediawiki/php-1.46.0-wmf.24/extensions/Translate/src/HookRunner.php(136): MediaWiki\HookContainer\HookContainer->run(string, array)
#12 /srv/mediawiki/php-1.46.0-wmf.24/extensions/Translate/src/MessageLoading/MessageIndex.php(383): MediaWiki\Extension\Translate\HookRunner->onTranslateEventMessageMembershipChange(MediaWiki\Extension\Translate\MessageLoading\MessageHandle, array, array)
#13 /srv/mediawiki/php-1.46.0-wmf.24/extensions/Translate/src/MessageLoading/MessageIndex.php(260): MediaWiki\Extension\Translate\MessageLoading\MessageIndex->clearMessageGroupStats(array)
#14 /srv/mediawiki/php-1.46.0-wmf.24/extensions/Translate/src/MessageLoading/RebuildMessageIndexJob.php(48): MediaWiki\Extension\Translate\MessageLoading\MessageIndex->rebuild(float)
#15 /srv/mediawiki/php-1.46.0-wmf.24/extensions/EventBus/includes/JobExecutor.php(94): MediaWiki\Extension\Translate\MessageLoading\RebuildMessageIndexJob->run()
#16 /srv/mediawiki/rpc/RunSingleJob.php(60): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#17 {main}

Change #1277135 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] TtmServer: Use lazyPush for job queue

https://gerrit.wikimedia.org/r/1277135

This might help?

T382970: Ad-hoc EventBus event submissions should be batched
Batch together all jobs destined to the same eventgate (724828)

I think it was just never deployed because the author left the foundation.

Needs review and merge from JobQueue folks, cc MW-Interfaces-Team

Has the lazyPush thing from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1277135 been tried yet? (I don't think so). Is it worth trying?

Has the lazyPush thing from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1277135 been tried yet? (I don't think so). Is it worth trying?

Not merged yet that I see. I don't know the job queue code well enough to say if that would help, but I don't think it would hurt anything at least.

MediaWiki-Platform-Team was asked to look into this. (Slack thread with the investigation: https://wikimedia.slack.com/archives/C0AQK0XDPFU/p1777053596503259)

My best guess is that the RebuildMessageIndexJob jobs consistently time out after 20 minutes, because they simply have more work to do than can be done in 20 minutes. This timeout usually happens while it's trying to queue another job, but that's because this job primarily queues other jobs.

The problem started right after midnight UTC on 14 April: https://logstash.wikimedia.org/goto/6743e9822a33bdd0d3c0bcc17856bb7b, corresponding to 1000+ translation configuration changes in about 2 hours, which can be found in the log here: https://meta.wikimedia.org/w/index.php?title=Special:Log/pagetranslation&wpdate=2026-04-14&limit=1250. I'm not familiar with the message index, but it seems that ever since then none of these jobs completed, and presumably the index has not been fully rebuilt.

It may be enough to just increase this 20 minute timeout, allowing the job to complete. However, it's not clear how much time it needs (the job doesn't report progress, and although it logs every message it processes, I don't know how many messages are there), and we can only control the timeout globally (for every kind of job all at once). While that would probably be safe, it's not clear how it would affect site stability, particularly on a Friday evening (there are some other jobs failing with timeouts). The configuration is here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/CommonSettings.php#482

If you can find a way to refactor that job so that it does its work in smaller chunks that can definitely complete in 20 minutes, that would be a better long-term solution, but I don't know how much work that would be.

My suggestion, unless someone from the Language team has a better idea, is to try raising the timeout to a very large value (let's say 12 hours) on Monday morning and see if that allows the job to complete.

The job doesn't very much. The so called critical section which is protected by the database lock takes about 30 seconds in metawiki°. Rest of the job is just creating TtmServerMessageUpdateJobs, one by one. Even if assume it takes 2 seconds per job, it should still be able to push charitable 36k jobs without hitting the timeout.

° Although, looking at logstash, one problem seems that the lock is not actually released when unlock is called, but later on transaction idle, which means it is held open for the whole duration of the job. This prevents running the message index clearing code from a maintenance script (which could have a longer timeout), which is one workaround I tried.

As a temporary workaround, translation admins can re-mark the page for translation, which update translation pages and solve translation aids loading for a few minutes.
index.php?title=Special:PageTranslation&do=mark&target=SOURCE_PAGE_NAME

index.php?title=Special:PageTranslation&do=mark&target=SOURCE_PAGE_NAME

Or the Mark for translation link in the toolbox.

The job doesn't very much. The so called critical section which is protected by the database lock takes about 30 seconds in metawiki°. Rest of the job is just creating TtmServerMessageUpdateJobs, one by one. Even if assume it takes 2 seconds per job, it should still be able to push charitable 36k jobs without hitting the timeout.

There are seemingly more jobs than that. Let me know if I'm misunderstanding the logs, but this: https://logstash.wikimedia.org/goto/c6cee2a7b6cc086d565bc96b5545e47b looks like one (randomly picked) instance processed 101,904 pages / sub-jobs before timing out after 20 minutes. (You can invert the normalized_message filter in the query to see the other logs about it starting and then crashing.)

The job doesn't very much. The so called critical section which is protected by the database lock takes about 30 seconds in metawiki°. Rest of the job is just creating TtmServerMessageUpdateJobs, one by one. Even if assume it takes 2 seconds per job, it should still be able to push charitable 36k jobs without hitting the timeout.

There are seemingly more jobs than that. Let me know if I'm misunderstanding the logs, but this: https://logstash.wikimedia.org/goto/c6cee2a7b6cc086d565bc96b5545e47b looks like one (randomly picked) instance processed 101,904 pages / sub-jobs before timing out after 20 minutes. (You can invert the normalized_message filter in the query to see the other logs about it starting and then crashing.)

Also, I can't do math. With those numbers we see that pushing one job takes about 12 ms, which sounds reasonable. The issue is the large number of jobs, just like you are saying.

FWIW, doBatchPush() in JobQueueEventBus does not chunk internally, it sends all events in a single EventBus::send() call. So with 101K jobs, even after batching via lazyPush, the serialization and HTTP POST might still be heavy.

Change #1277135 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] TtmServer: Use lazyPush for job queue

https://gerrit.wikimedia.org/r/1277135

Change #1277290 had a related patch set uploaded (by Abijeet Patro; author: Nikerabbit):

[mediawiki/extensions/Translate@wmf/1.46.0-wmf.24] TtmServer: Use lazyPush for job queue

https://gerrit.wikimedia.org/r/1277290

Change #1277290 merged by jenkins-bot:

[mediawiki/extensions/Translate@wmf/1.46.0-wmf.24] TtmServer: Use lazyPush for job queue

https://gerrit.wikimedia.org/r/1277290

Mentioned in SAL (#wikimedia-operations) [2026-04-27T07:05:56Z] <kartik@deploy1003> Started scap sync-world: Backport for [[gerrit:1277290|TtmServer: Use lazyPush for job queue (T423779)]]

Mentioned in SAL (#wikimedia-operations) [2026-04-27T07:22:11Z] <kartik@deploy1003> abi, kartik: Backport for [[gerrit:1277290|TtmServer: Use lazyPush for job queue (T423779)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-04-27T07:34:49Z] <kartik@deploy1003> Finished scap sync-world: Backport for [[gerrit:1277290|TtmServer: Use lazyPush for job queue (T423779)]] (duration: 28m 53s)

That seems to have resolved the issue, though the job that succeeded only created 38,574 ttmserver update jobs. This either means that some of it was silently truncated, or the number of jobs was not that large. One possible explanation is that the rebuildjobs were re-attempted using same reqId, which could increase that by 2-3 times (I don't remember how many times we try again).

I am going to run a script to update all translatable pages that were not updated during this time.

Nikerabbit changed the point value for this task from 2 to 4.Apr 27 2026, 10:26 AM
Nikerabbit moved this task from In Progress to Done on the LPL Essential (FY2025-26 Q3&4) board.

Looking at the link @matmarex shared, it now displays 163,297 items. But I checked that each page was attempted twice, and there is two messages per job (one for each dc). That brings it quite close to the number I saw (38,574 vs. 40824), so it does not look like a lot of updates were missed, if any.

The script has also finished updating all translatable pages.

Thanks everyone for the help!

JobQueueEventBus does not chunk internally, it sends all events in a single EventBus::send()

FWIW, EventBus::send() does chunk if it is given a batch of events to send, and the serialized size of the batch is too large. (It does this a bit inefficiently, in that it has to re-serialize again after chunking, but it should only do this when the payload is too large in the first place).

This might help
T382970: Ad-hoc EventBus event submissions should be batched

BTW, On closer read of this problem, I don't think it would have helped here. EventBus was already batching by Job; that patch is just an optimization that allows batching multiple Jobs in the same send() call.

Feeling confident enough to close this. What likely happened:

  • Lot of pages were (re-)marked for translation in a short time frame
  • The message index rebuild job was creating translation memory updates for affected translation units, and pushing them into jobqueue one by one
  • This lead the job to timeout after 1200 seconds, as it could not finish pushing all those jobs in time, and the message index update did not succeed at all
  • By batching job pushes, we avoided the overhead of pushing them individually, avoiding a timeout

The symptoms were that after some time a new page or existing page with new units was marked for translation, an interim cache expired. This cache holds a list of known translation units from pages recently marked for translation. If a translation unit is not known, translation aids fails to load, and translatable pages are not updated.

The vast majority of the log entries on https://meta.wikimedia.org/w/index.php?title=Special:Log/pagetranslation&wpdate=2026-04-14&limit=1250 are translation discouragements (1069 out of 1250), did all of them trigger message index rebuilds? If yes, why – how can translation discouragement affect the message index? Not updating the message index on discouragement/encouragement would also have avoided the incident – and would also mean less jobs under usual load.