Page MenuHomePhabricator

Translation pages are updated after a long delay on Wikimedia sites
Closed, ResolvedPublic


There has been reports of translation page updates taking a long time, even over 30 minutes. We developers knew that it would get delayed when we switched to using a JobQueue, but this is a huge regression from a couple second of delay it used to be.


Suggested plan of action

  1. Confirm that this is just caused by Job Queue slowness and not any errors in Translate
  2. Check if this types of jobs can be prioritized
  3. Considering bringing up these delays to Job Queue infrastructure maintainers, if we see that there are constantly long delays

Event Timeline

Why I think this is high priority: We (Translate developers) have been working on making translation page updates robust. The code and funcionality is still fresh in our minds so we should look at and address this issue before we declare ourselves done with this area.

Can you please list the names of your jobs? has a lot of data about delays for the particular jobs, I can help look into it.

I no longer find any settings for priority jobs, so I guess such thing no longer exists or is now moved somewhere where codesearch does not find it. In any case it would be problematic as refresh-translatable-pages.php maintenance script would spawn thousands of TranslateRenderJobs that would overwhelm the queue if they had a priority.

In other words, the only option available seems to be to investigate whether the jobs are at fault (errors or being excessively slow) or just taking time to be run through the job queue.

Looking at recent history of channel:Translate.Jobs in Logstash I don't see any obvious issues. Job execution durations are slow, and I do not see big delays between TranslationsUpdateJob and TranslateRenderJobs. There are some errors with code "edit-conflict" and "edit-already-exists". Sometimes few, sometimes a bunch for the same page in a short duration. These jobs are the only actors editing the affected pages, so they can only conflict with themselves (i.e. same job being executed multiple times or more than one job for the same page [which becomes more likely if there are long delays and the first job hasn't had chance to run before the new one(s) are being inserted]).

Ok, this was an easy one.

The translate jobs are rather low traffic, so they share the queue with other low traffic jobs. Apparently LocalGlobalUserPageCacheUpdateJob is prone to large spikes, blocking the low traffic job queue:

We are going to move the LocalGlobalUserPageCacheUpdateJob out of the low traffic queue and give it its own queue. This should resolve this problem. If there's ever again a slowdown of translate jobs, we will give them a designated queue.

Change 640446 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/deployment-charts@master] JobQueue: Move LocalGlobalUserPageCacheUpdateJob to it's own queue.

Change 640446 merged by jenkins-bot:
[operations/deployment-charts@master] JobQueue: Move LocalGlobalUserPageCacheUpdateJob to it's own queue.

This issue should stop occurring now. please reopen if it comes back.

Ciencia_Al_Poder added a subscriber: Ciencia_Al_Poder.

Looks like this is happening again. Reported today at Topic:Vyupiut0evae00hd

Looking at Wladek92 contributions, the translation unit Translations:Manual:GenerateJsonI18n.php/Page display title/fr (the first for the Manual:GenerateJsonI18n.php/fr page) was created at 15:12 UTC, but the corresponding page that contains all translations, namely Manual:GenerateJsonI18n.php/fr was created at 15:25 UTC with an edit summary corresponding to the last translation unit edited.

That page was created at the same time as Manual:$wgMessagesDirs/fr (15:25 UTC) but the first translation unit was created at 15:08 UTC (17 minutes of lag).

yes, same situation 2 hours later.

Magic trick I early used: I reopen the first msg to translate and add a blank at the end (which changes nothing) and validate. History is unchanged but now 'français' appears in the list !!!

Normal behaviour has been observed back today. Item shown immediately after translating a random message from within the page. No specific action done.

Seems there was another backlog in low traffic jobs that caused this delay on Dec 2nd.

We'll go ahead and move translate jobs to its own queue.

Change 647307 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[operations/deployment-charts@master] JobQueue: Move translation jobs to its own queue

Change 647307 merged by jenkins-bot:
[operations/deployment-charts@master] JobQueue: Move translation jobs to its own queue

The change has been deployed and shouldn't be happening anymore. Please feel free to reopen if the issue continues