Jobs are not getting executed or executed really slowly
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Urbanecm_WMF
	Feb 22 2021, 9:01 PM

Description

Judging by the recent bug reports (see subtasks), various jobs are either not being executed or executed slowly

https://grafana-rw.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1 does not seem to have any meaningful metrics.

This is a catch-all task to track all the various issues relevant to poorly performing job queue.

Also the message processing backlog graph does not look good.

Related Objects
Search...

Status	Assigned	Task
Resolved	• Pchelolo	T275437 Jobs are not getting executed or executed really slowly
Resolved	None	T275429 Homepage mentor is not stored persistently at Romanian Wikipedia
Resolved	Urbanecm_WMF	T275480 MentorPageMentorManager::setMentorForUser should not use job queue for POSTs
Resolved	Urbanecm_WMF	T275481 MentorPageMentorManager::setMentorForUser must invalidate MentorPageMentorManager's in-process cache
Open	None	T275432 MassMessage not delivering

Event Timeline

Urbanecm_WMF created this task.Feb 22 2021, 9:01 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 22 2021, 9:01 PM

Urbanecm_WMF updated the task description. (Show Details)Feb 22 2021, 9:01 PM

Urbanecm_WMF added a subtask: T275429: Homepage mentor is not stored persistently at Romanian Wikipedia.

Urbanecm_WMF added a subtask: T275432: MassMessage not delivering.

Urbanecm_WMF updated the task description. (Show Details)

Some scary graphs:

Preliminary prioritising this as UBN, unless we figure out this is not as serious as it looks like.

Need to delete the dashboard you've referenced, it's outdated. https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 is the correct one.

The backlog of various jobs indeed is growing.

taavi subscribed.Feb 22 2021, 9:05 PM

kostajh subscribed.Feb 22 2021, 9:06 PM

So, processMediaModeration job should be excluded - it's expected for it's backlog to grow this way. It's running in it's own queue, not affecting others, so even though it looks scary - it's normal.

Same with cirrusSearchCheckerJob.

The recent drastic increase in LocalGlobalUserPageCacheUpdateJob is interesting, but we've already been through this process with it already, and separated it into it's own queue. Seems like the queue size is insufficient, we can bump it.

cirrusSearchIncomingLonksCount is lagging a bit, but that's normal for it.

Other then that, there seem to be all ok.

Urbanecm mentioned this in T275438: Clean graphana graphs about job queue.Feb 22 2021, 9:11 PM

There are two reports of jobs not being processed above, one being massmessage and the other one being growth jobs. Maybe we are running out capacity because of depooling job runners for the buster upgrade?

In T275437#6850546, @Ladsgroup wrote:

Maybe we are running out capacity because of depooling job runners for the buster upgrade?

job runners are already 100% buster.

That rules that out. Thanks.

Looking at the dashboard, the job insert rate is much higher than the processing rate, is is just backlogged/out of capacity or is it not executing jobs/job types at all?

IKhitron subscribed.Feb 22 2021, 9:21 PM

In T275437#6850519, @Pchelolo wrote:

The recent drastic increase in LocalGlobalUserPageCacheUpdateJob is interesting, but we've already been through this process with it already, and separated it into it's own queue. Seems like the queue size is insufficient, we can bump it.

This is mostly dependent upon editing activity on Meta's User namespace.

RhinosF1 subscribed.Feb 22 2021, 9:26 PM

This ticket is not a UBN. We've had issues with mass message before, and there's no indication of widespread jobs failures.

In T275437#6850596, @Pchelolo wrote:

This ticket is not a UBN. We've had issues with mass message before, and there's no indication of widespread jobs failures.

Feel free to deprioritize it. The graphs I linked earlier (and that are outdated, as you noted) scared me, so I preliminarily set the priority to UBN. If you think this is not UBN, no objections for decreasing the priority (or reseting to Needs Triage).

• Pchelolo lowered the priority of this task from Unbreak Now! to Needs Triage.Feb 22 2021, 9:36 PM

In T275437#6850610, @Urbanecm_WMF wrote:

In T275437#6850596, @Pchelolo wrote:

This ticket is not a UBN. We've had issues with mass message before, and there's no indication of widespread jobs failures.

Feel free to deprioritize it. The graphs I linked earlier (and that are outdated, as you noted) scared me, so I preliminarily set the priority to UBN. If you think this is not UBN, no objections for decreasing the priority (or reseting to Needs Triage).

FYI: T218511 is on 3 hours (sic!) delay on Meta now.

Ok, restarting jobqueue change propagation service might have resolved the problem. Will investigate the root cause of this.

In T275437#6850970, @IKhitron wrote:

In T275437#6850610, @Urbanecm_WMF wrote:

In T275437#6850596, @Pchelolo wrote:

This ticket is not a UBN. We've had issues with mass message before, and there's no indication of widespread jobs failures.

Feel free to deprioritize it. The graphs I linked earlier (and that are outdated, as you noted) scared me, so I preliminarily set the priority to UBN. If you think this is not UBN, no objections for decreasing the priority (or reseting to Needs Triage).

FYI: T218511 is on 3 hours (sic!) delay on Meta now.

Stopped right now.

Legoktm triaged this task as High priority.Feb 23 2021, 12:30 AM

Addshore subscribed.Feb 23 2021, 8:31 AM

• Elitre subscribed.Feb 23 2021, 10:25 AM

kostajh mentioned this in T275429: Homepage mentor is not stored persistently at Romanian Wikipedia.Feb 23 2021, 12:50 PM

Quiddity subscribed.Feb 23 2021, 8:52 PM

Etonkovidova closed subtask T275429: Homepage mentor is not stored persistently at Romanian Wikipedia as Resolved.Feb 24 2021, 4:48 PM

R4356th subscribed.Feb 25 2021, 2:24 PM

In T275437#6850975, @Pchelolo wrote:

Ok, restarting jobqueue change propagation service might have resolved the problem. Will investigate the root cause of this.

Was there time for this? Is this still high priority nowadays?

Didn't happen again in half a year.

	F34119171: image.png
	Feb 22 2021, 9:03 PM

	F34119169: image.png
	Feb 22 2021, 9:03 PM

Jobs are not getting executed or executed really slowlyClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Jobs are not getting executed or executed really slowly
Closed, ResolvedPublic
Actions

Related Objects
Search...