Page MenuHomePhabricator

enwiki's job is about 28m atm and increasing
Closed, ResolvedPublic

Event Timeline

Glaisher raised the priority of this task from to Needs Triage.
Glaisher updated the task description. (Show Details)
Glaisher subscribed.
Glaisher renamed this task from enwiki's job is about 22m and increasing to enwiki's job is about 22m atm and increasing.May 8 2015, 5:24 PM
Glaisher triaged this task as High priority.
Glaisher updated the task description. (Show Details)
Glaisher added a project: WMF-JobQueue.
Glaisher set Security to None.
Glaisher removed a subscriber: Betacommand.
Apr 30 13:50:29 <Krenair>	Someone just pointed out in tech that enwiki has a ridiculously large job queue at the moment
Apr 30 16:44:53 <legoPanda>	Betacommand: I see 10 million refreshlinks jobs???
Apr 30 21:49:58 <legoPanda>	AaronSchulz: do you know why enwiki has 11m refreshLinks jobs queued?
May 07 16:25:20 <T13|mobile>	[16:24:33] There's concerns that the enwp job queue is stuck since it's growing so much and pushing 20 million. Can someone peek and poke at it as needed?
May 07 16:25:20 <T13|mobile>	[16:25:03] <MatmaRex> T13|mobile: i'd wager this is fallout from last saturday, when someone accidentally disabled the job queue
May 07 16:32:03 <T13|mobile>	"jobs": 19977207
May 07 16:32:58 <MatmaRex>	T13|mobile: just to reassure you, the job queue is (probably) working again, it was broken only for a short while
May 07 16:35:14 <legoktm>	T13|mobile: looks like they're all refreshLinks jobs
May 07 16:37:00 <MatmaRex>	T13|mobile: some jobs actually generate more jobs when executed :D
May 07 16:38:40 <legoktm>	well, it's executing jobs
May 07 16:40:36 <MatmaRex>	T13|mobile: for example: (simplifying, since i don't know exactly how it works) say you edit a template used on 200 000 pages. rather than generate 20 000 jobs to update the pages immediately, which itself would take a long time, MediaWiki instead generates (say) 100 jobs, each of which generates 2000 jobs, each of which actually updates a page.
May 07 16:45:22 <legoktm>	T13|mobile: don't complain about job queue length when you're the one who made it so long! :P
May 07 17:01:23 <manybubbles>	oh my that is a lot of jobs
May 07 18:15:52 <T13|away>	legoktm: would my guess that part of the reason the jobqueue is still ever expanding might be related to SULF?
May 07 18:39:18 <T13|away>	[18:15:52] legoktm: would my guess that part of the reason the jobqueue is still ever expanding might be related to SULF?
May 08 18:16:16 <Betacommand>	Are ops aware of the enwiki job queue issue?
May 08 18:19:45 <Glaisher>	 "jobs": 21894746,
May 08 18:20:38 <Krenair>	the other job types seem relatively low

<Krenair> This has been going on since April 30th at least

I did a quick look around the job runners, and they seem to be running fine without being starved of resources. A point of note is that it apparently is only the refreshLinks that are piling up.

Krenair renamed this task from enwiki's job is about 22m atm and increasing to enwiki's job is about 23m atm and increasing.May 9 2015, 4:09 PM
Krenair raised the priority of this task from High to Unbreak Now!.

Change 209719 had a related patch set uploaded (by Aaron Schulz):
Increase jobrunner::runners_basic

https://gerrit.wikimedia.org/r/209719

Change 209852 had a related patch set uploaded (by Aaron Schulz):
Made triggerOpportunisticLinksUpdate() jobs make use of parser cache

https://gerrit.wikimedia.org/r/209852

Change 209877 had a related patch set uploaded (by Aaron Schulz):
Removed duplicated jobs in triggerOpportunisticLinksUpdate()

https://gerrit.wikimedia.org/r/209877

EoRdE6 renamed this task from enwiki's job is about 23m atm and increasing to enwiki's job is about 24m atm and increasing.May 9 2015, 9:13 PM
EoRdE6 subscribed.

Now past 25.6m. I made edits to templates as far back as April 19th that haven't filtered through to the articles yet.

Change 209719 merged by Ori.livneh:
Increase jobrunner::runners_basic

https://gerrit.wikimedia.org/r/209719

3gg5amp1e renamed this task from enwiki's job is about 24m atm and increasing to enwiki's job is about 28m atm and increasing.May 11 2015, 6:47 PM
3gg5amp1e subscribed.

"jobs": 27803968

Change 209852 merged by jenkins-bot:
Made triggerOpportunisticLinksUpdate() jobs make use of parser cache

https://gerrit.wikimedia.org/r/209852

Change 210243 had a related patch set uploaded (by Aaron Schulz):
Made triggerOpportunisticLinksUpdate() jobs make use of parser cache

https://gerrit.wikimedia.org/r/210243

Change 210244 had a related patch set uploaded (by Aaron Schulz):
Made triggerOpportunisticLinksUpdate() jobs make use of parser cache

https://gerrit.wikimedia.org/r/210244

Change 209877 merged by jenkins-bot:
Removed duplicated jobs in triggerOpportunisticLinksUpdate()

https://gerrit.wikimedia.org/r/209877

Change 210244 merged by jenkins-bot:
Made triggerOpportunisticLinksUpdate() jobs make use of parser cache

https://gerrit.wikimedia.org/r/210244

Change 210243 merged by jenkins-bot:
Made triggerOpportunisticLinksUpdate() jobs make use of parser cache

https://gerrit.wikimedia.org/r/210243

Change 210246 had a related patch set uploaded (by Aaron Schulz):
Bumped the $wgJobBackoffThrottling refreshLinks limit

https://gerrit.wikimedia.org/r/210246

Well, for sure job runners are working harder now:

graph.php.png (257×397 px, 21 KB)

Queues on most wikis approach 0 or are in the thousands. s1 databases don't seem to have suffered from the bump, or at least there's no visible change in the graphs other than a brief (unrelated?) jump in mysql_innodb_buffer_pool_pages_dirty https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&tab=ch&vn=&hide-hf=false&hreg[]=db10%2852|51|55|57|65|66|72|73%29

Job queue seems to have begun to drop slowly, still near 29 million jobs though.

"mwscript showJobs.php enwiki --group" shows it as still going up
Edit: And not long after I said that I looked again and it had gone down. Okay then...

Change 210611 had a related patch set uploaded (by Aaron Schulz):
Temporary hack to drain excess refreshLinks jobs

https://gerrit.wikimedia.org/r/210611

Change 210610 had a related patch set uploaded (by Aaron Schulz):
Temporary hack to drain excess refreshLinks jobs

https://gerrit.wikimedia.org/r/210610

Change 210246 merged by Chad:
Bumped the $wgJobBackoffThrottling refreshLinks limit

https://gerrit.wikimedia.org/r/210246

Change 210610 merged by jenkins-bot:
Temporary hack to drain excess refreshLinks jobs

https://gerrit.wikimedia.org/r/210610

Change 210611 merged by jenkins-bot:
Temporary hack to drain excess refreshLinks jobs

https://gerrit.wikimedia.org/r/210611

Now en.wiki is merely at 21 millions. According to https://wikiapiary.com/wiki/Wikipedia_%28en%29 , it started dropping this morning at 8 UTC; at this speed, the queue should be drained in a matter of hours.

now down to 78 jobs! do we consider this done?

ori claimed this task.

@ArielGlenn how are you coming up with 78 jobs? I haven't seen it go below 1.15 million. I still consider it done, but it's still a little higher than normal according to the graphs.

And, for what it's worth, still template edits from back as far as April 19th that haven't filtered through. I don't know what order it works through the jobs, but I would have assumed oldest to newest.

@Mlaffs You would have assumed very wrong... From my understanding, the job queue is not a linear, easy to follow thing. Changing a template with 50K transclusions does not mean 50K jobs will be added in any specific order. It actually creates a job, that creates jobs based on a bunch of different factors and variables and then runs through those jobs to decide what jobs need jobs and what order to do them in. Then, once it finishes one of the jobs that decides what jobs to run, it creates more jobs to see if the jobs are actually done or if they need to be run again and it makes more jobs based on that including a job that reorders all the jobs... Or, something like that...

@Technical, I was watching the estimate provided at the en wp link I mentioned above. Given that it's only an estimate but still.

I thought that that was no longer an estimate when using redis job queue.

And, for what it's worth, still template edits from back as far as April 19th that haven't filtered through. I don't know what order it works through the jobs

If you check the graph at https://wikiapiary.com/wiki/Wikipedia_%28en%29 for exact timings, you'll see that 27 millions jobs were consumed in just 19 hours; that's probably the effect of rMWba91f0a2d339 which skipped certain "redundant" jobs. This bug was resolved when the abnormal mass of jobs has been removed.

In the next 24 hours the decrease was about 600k, so we're back to business as usual even though there is still some backlog to recover.

Yes, click the edit task button in the top right hand corner, @Wbm1058.

Current large job queue types, as of a few minutes ago:

ParsoidCacheUpdateJobOnDependencyChange10688
refreshLinks65786
cirrusSearchLinksUpdate113788
RestbaseUpdateJobOnDependencyChange91488