Page MenuHomePhabricator

Translation pages are updated after a long delay on Wikimedia sites
Closed, ResolvedPublic

Description

There has been reports of translation page updates taking a long time, even over 30 minutes. We developers knew that it would get delayed when we switched to using a JobQueue, but this is a huge regression from a couple second of delay it used to be.

Reports:

Suggested plan of action

  1. Confirm that this is just caused by Job Queue slowness and not any errors in Translate
  2. Check if this types of jobs can be prioritized
  3. Considering bringing up these delays to Job Queue infrastructure maintainers, if we see that there are constantly long delays

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Nov 9, 8:38 AM
Nikerabbit triaged this task as High priority.Mon, Nov 9, 8:41 AM

Why I think this is high priority: We (Translate developers) have been working on making translation page updates robust. The code and funcionality is still fresh in our minds so we should look at and address this issue before we declare ourselves done with this area.

Nikerabbit updated the task description. (Show Details)Mon, Nov 9, 8:45 AM

Can you please list the names of your jobs?

https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 has a lot of data about delays for the particular jobs, I can help look into it.

The relevant jobs would be TranslationsUpdateJob and TranslateRenderJob. More info at https://www.mediawiki.org/wiki/Help:Extension:Translate/Process_flow_in_MediaWiki_jobs

I no longer find any settings for priority jobs, so I guess such thing no longer exists or is now moved somewhere where codesearch does not find it. In any case it would be problematic as refresh-translatable-pages.php maintenance script would spawn thousands of TranslateRenderJobs that would overwhelm the queue if they had a priority.

In other words, the only option available seems to be to investigate whether the jobs are at fault (errors or being excessively slow) or just taking time to be run through the job queue.

Looking at recent history of channel:Translate.Jobs in Logstash I don't see any obvious issues. Job execution durations are slow, and I do not see big delays between TranslationsUpdateJob and TranslateRenderJobs. There are some errors with code "edit-conflict" and "edit-already-exists". Sometimes few, sometimes a bunch for the same page in a short duration. These jobs are the only actors editing the affected pages, so they can only conflict with themselves (i.e. same job being executed multiple times or more than one job for the same page [which becomes more likely if there are long delays and the first job hasn't had chance to run before the new one(s) are being inserted]).

Ok, this was an easy one.

The translate jobs are rather low traffic, so they share the queue with other low traffic jobs. Apparently LocalGlobalUserPageCacheUpdateJob is prone to large spikes, blocking the low traffic job queue:

We are going to move the LocalGlobalUserPageCacheUpdateJob out of the low traffic queue and give it its own queue. This should resolve this problem. If there's ever again a slowdown of translate jobs, we will give them a designated queue.

Change 640446 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/deployment-charts@master] JobQueue: Move LocalGlobalUserPageCacheUpdateJob to it's own queue.

https://gerrit.wikimedia.org/r/640446

Xiplus added a subscriber: Xiplus.Sun, Nov 15, 2:48 PM

Change 640446 merged by jenkins-bot:
[operations/deployment-charts@master] JobQueue: Move LocalGlobalUserPageCacheUpdateJob to it's own queue.

https://gerrit.wikimedia.org/r/640446

Pchelolo closed this task as Resolved.Tue, Nov 17, 4:40 PM

This issue should stop occurring now. please reopen if it comes back.