Page MenuHomePhabricator

Translate job execution time seems to have significantly increase
Closed, ResolvedPublicBUG REPORT

Description

For several weeks, I have noticed a huge delay with Translate jobs on Meta-Wiki.

Reproduction steps:

  1. Edit a translatable page, mark the new version for translation, and observe the time for FuzzyBot to update translation pages.
  2. Edit a translation unit (e.g. using Tux interface) and observe the time for the edit to affect the translation page.

What happens?:

  1. It can take more than 120 seconds (e.g. mark action log, FuzzyBot edit).
  2. It can take more than 60 seconds (e.g. unit edit, final page edit).

I did not immediately report this issue, because I thought it was due to a temporary server overload. But since it has been lasting for several weeks, it is not “temporary”.

I think this issue appears in August, but I am not sure. I am only sure I did not notice it in April.

Is there a way to get a job performance graph for 6 last months, in order to check whether this is actually a regression?

Event Timeline

Increase in execution time for MessageUpdateJob do look suspicious, but it shouldn't affect case #2. https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=RenderTranslationPageJob&from=now-90d&to=now looks okay with average being ~1.5 seconds (though many outliers given the p99 graph).

The changes for MessageUpdateJob seem to have started around August 15-16th, which could correspond to 1.39.0-wmf.25. We did some small maintenance for Translate jobs in that branch https://www.mediawiki.org/wiki/MediaWiki_1.39/wmf.25#Translate but nothing there should cause a performance regression.

The graph was showing times for TTMServerMessageUpdateJob in addition to MessageUpdateJob. This increase makes there makes sense and is known and doesn't affect FuzzyBot. The graph shows no change for MessageUpdateJob.

Looking at the first example in Logstash reqId:9b010468-b15c-4128-80fd-b24b9bf83305, I see that the page was marked around Sep 23, 2022 @ 19:37:55.685 and first log entry from the 'UpdateTranslatablePageJob` is at Sep 23, 2022 @ 19:39:08.748. That job spawned RenderTranslationPageJob which finished at Sep 23, 2022 @ 19:40:20.998.

This seems to indicate the delay was in the JobQueue. This graph seems to confirm it: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=UpdateTranslatablePageJob&from=now-90d&to=now&viewPanel=5: average delay is about minute.

Now, given we changed these job names recently, this seems a regression from that, given the old names are still used in https://gerrit.wikimedia.org/g/operations/deployment-charts/+/d5052e6d8bb1b89393b993e9b59762fe5db8bcb1/helmfile.d/services/changeprop-jobqueue/values.yaml#107. I believe if we update those names, we should be back to previous delays.

Change 835589 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[operations/deployment-charts@master] Update Translate job names

https://gerrit.wikimedia.org/r/835589

Now, given we changed these job names recently, this seems a regression from that, given the old names are still used in https://gerrit.wikimedia.org/g/operations/deployment-charts/+/d5052e6d8bb1b89393b993e9b59762fe5db8bcb1/helmfile.d/services/changeprop-jobqueue/values.yaml#107. I believe if we update those names, we should be back to previous delays.

You've already got the right fix in that review, good stuff! Just to confirm - when the job names changed they no longer had the concurrency rules or a dedicated worker applied to them, and so they fell into the low_traffic jobs bucket which has much more contested concurrency and less guarantees, hence slower times.

Change 835589 merged by jenkins-bot:

[operations/deployment-charts@master] Update Translate job names

https://gerrit.wikimedia.org/r/835589

I have been monitoring Logstash and Grafana after the deployment and after things stabilized everything started looking as expected.

Many thanks for this quick, deep and effective investigation! 🙂