Page MenuHomePhabricator

Jobs are not getting executed or executed really slowly
Open, HighPublic

Description

Judging by the recent bug reports (see subtasks), various jobs are either not being executed or executed slowly

https://grafana-rw.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1 does not seem to have any meaningful metrics.

This is a catch-all task to track all the various issues relevant to poorly performing job queue.

Also the message processing backlog graph does not look good.

Event Timeline

Urbanecm_WMF triaged this task as Unbreak Now! priority.Feb 22 2021, 9:04 PM

Preliminary prioritising this as UBN, unless we figure out this is not as serious as it looks like.

Need to delete the dashboard you've referenced, it's outdated. https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 is the correct one.

The backlog of various jobs indeed is growing.

So, processMediaModeration job should be excluded - it's expected for it's backlog to grow this way. It's running in it's own queue, not affecting others, so even though it looks scary - it's normal.

Same with cirrusSearchCheckerJob.

The recent drastic increase in LocalGlobalUserPageCacheUpdateJob is interesting, but we've already been through this process with it already, and separated it into it's own queue. Seems like the queue size is insufficient, we can bump it.

cirrusSearchIncomingLonksCount is lagging a bit, but that's normal for it.

Other then that, there seem to be all ok.

There are two reports of jobs not being processed above, one being massmessage and the other one being growth jobs. Maybe we are running out capacity because of depooling job runners for the buster upgrade?

Maybe we are running out capacity because of depooling job runners for the buster upgrade?

job runners are already 100% buster.

Looking at the dashboard, the job insert rate is much higher than the processing rate, is is just backlogged/out of capacity or is it not executing jobs/job types at all?

The recent drastic increase in LocalGlobalUserPageCacheUpdateJob is interesting, but we've already been through this process with it already, and separated it into it's own queue. Seems like the queue size is insufficient, we can bump it.

This is mostly dependent upon editing activity on Meta's User namespace.

This ticket is not a UBN. We've had issues with mass message before, and there's no indication of widespread jobs failures.

This ticket is not a UBN. We've had issues with mass message before, and there's no indication of widespread jobs failures.

Feel free to deprioritize it. The graphs I linked earlier (and that are outdated, as you noted) scared me, so I preliminarily set the priority to UBN. If you think this is not UBN, no objections for decreasing the priority (or reseting to Needs Triage).

Pchelolo lowered the priority of this task from Unbreak Now! to Needs Triage.Feb 22 2021, 9:36 PM

This ticket is not a UBN. We've had issues with mass message before, and there's no indication of widespread jobs failures.

Feel free to deprioritize it. The graphs I linked earlier (and that are outdated, as you noted) scared me, so I preliminarily set the priority to UBN. If you think this is not UBN, no objections for decreasing the priority (or reseting to Needs Triage).

FYI: T218511 is on 3 hours (sic!) delay on Meta now.

Ok, restarting jobqueue change propagation service might have resolved the problem. Will investigate the root cause of this.

This ticket is not a UBN. We've had issues with mass message before, and there's no indication of widespread jobs failures.

Feel free to deprioritize it. The graphs I linked earlier (and that are outdated, as you noted) scared me, so I preliminarily set the priority to UBN. If you think this is not UBN, no objections for decreasing the priority (or reseting to Needs Triage).

FYI: T218511 is on 3 hours (sic!) delay on Meta now.

Stopped right now.