JobQueueDB::claimRandom very slow with lots of jobs, takes half of the total time for queries
Closed, ResolvedPublic

Description

mysql> EXPLAIN SELECT /* JobQueueDB::claimRandom 127.0.0.1 */ * FROM job WHERE job_cmd = 'MessageGroupStatesUpdaterJob' AND job_token = '' AND (job_random <= '2108540693') ORDER BY job_random DESC LIMIT 1;
+----+-------------+-------+------+----------------------------------------+---------------+---------+-------------+-------+-------------+

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra

+----+-------------+-------+------+----------------------------------------+---------------+---------+-------------+-------+-------------+

1SIMPLEjobrefjob_cmd,job_cmd_token,job_cmd_token_idjob_cmd_token96const,const43440Using where

+----+-------------+-------+------+----------------------------------------+---------------+---------+-------------+-------+-------------+
1 row in set (0.00 sec)

mysql> SELECT /* JobQueueDB::claimRandom 127.0.0.1 */ * FROM job WHERE job_cmd = 'MessageGroupStatesUpdaterJob' AND job_token = '' AND (job_random <= '2108540693') ORDER BY job_random DESC LIMIT 1;+--------+------------------------------+---------------+--------------------------------------------------------+----------------+------------+------------+-----------+---------------------+---------------------------------+--------------+

job_idjob_cmdjob_namespacejob_titlejob_timestampjob_paramsjob_randomjob_tokenjob_token_timestampjob_sha1job_attempts

+--------+------------------------------+---------------+--------------------------------------------------------+----------------+------------+------------+-----------+---------------------+---------------------------------+--------------+

447774MessageGroupStatesUpdaterJob1202Konsolekalendar-e2ca93-Add_location_to_incidence_(f/en20121201204852a:0:{}2108426131NULLf78vrjeo5e8f17ackthos9uhjzw686y0

+--------+------------------------------+---------------+--------------------------------------------------------+----------------+------------+------------+-----------+---------------------+---------------------------------+--------------+
1 row in set (3.62 sec)

It's going to take ages to clear all the jobs if picking each one takes 3 seconds.

php maintenance/showJobs.php
86324


Version: 1.21.x
Severity: critical
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=42715

bzimport set Reference to bz42614.
Nikerabbit created this task.Via LegacyDec 1 2012, 10:08 PM
Nikerabbit added a comment.Via ConduitDec 2 2012, 8:25 AM

It has now been running overnight:
php maintenance/showJobs.php
67285

Not good.

Nemo_bis added a comment.Via ConduitDec 3 2012, 11:09 AM

I don't know if it's the same bug (Niklas said "possibly"), but it seems to affect also Wikimedia projects, as reported by Seddon (I guess his translation notifications on Meta are stuck again).
Since the 29 very few new jobs are run, only on en.wiki the queue raised from 40k to 150k items and the load on the job runners decreased from about 100 to about 50.
https://gdash.wikimedia.org/dashboards/jobq/deploys
http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=spence.wikimedia.org&v=506&m=enwiki_JobQueue_length
http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Jobrunners+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4

(I suppose this deserves being highest priority and this now requires an assignee?)

aaron added a comment.Via ConduitDec 3 2012, 8:22 PM

According to https://ishmael.wikimedia.org for enwiki, that query is not in terms of average (or even p95) query time. If there was a problem with it being slow, I would expect gdash to show the same queue rate but lower run rate.

Nemo_bis added a comment.Via ConduitDec 3 2012, 8:41 PM

(In reply to comment #3)

According to https://ishmael.wikimedia.org for enwiki, that query is not in
terms of average (or even p95) query time. If there was a problem with it being
slow, I would expect gdash to show the same queue rate but lower run rate.

Are you saying that the problem on Wikimedia wikis is another bug? What number?

aaron added a comment.Via ConduitDec 3 2012, 10:03 PM

I am getting:
PHP Notice: Trying to get property of non-object in /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUser.php on line 115

trying to run jobs manually.

aaron added a comment.Via ConduitDec 3 2012, 10:08 PM

I added exception details to cli mode and added an exception there:

No row object provided.
Backtrace:
#0 /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUserArray.php(59): CentralAuthUser::newFromRow(false)
#1 /home/wikipedia/common/php-1.21wmf5/includes/UserArray.php(78): CentralAuthUserArrayFromResult->setCurrent(Object(stdClass))
#2 /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUserArray.php(21): UserArrayFromResult->construct(Object(ResultWrapper))
#3 /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUserArray.php(10): CentralAuthUserArrayFromResult->
construct(Object(ResultWrapper))
#4 /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthHooks.php(400): CentralAuthUserArray::newFromResult(Object(ResultWrapper))
#5 [internal function]: CentralAuthHooks::onUserArrayFromResult(NULL, Object(ResultWrapper))
#6 /home/wikipedia/common/php-1.21wmf5/includes/Hooks.php(253): call_user_func_array('CentralAuthHook...', Array)
#7 /home/wikipedia/common/php-1.21wmf5/includes/GlobalFunctions.php(3832): Hooks::run('UserArrayFromRe...', Array)
#8 /home/wikipedia/common/php-1.21wmf5/includes/UserArray.php(30): wfRunHooks('UserArrayFromRe...', Array)
#9 /home/wikipedia/common/php-1.21wmf5/includes/UserArray.php(52): UserArray::newFromResult(Object(ResultWrapper))
#10 /home/wikipedia/common/php-1.21wmf5/includes/UserMailer.php(552): UserArray::newFromIDs(Array)
#11 /home/wikipedia/common/php-1.21wmf5/includes/job/jobs/EnotifNotifyJob.php(53): EmailNotification->actuallyNotifyOnPageChange(Object(User), Object(Title), '20121203220703', '::I have to say...', 0, 526235288, Array)
#12 /home/wikipedia/common/php-1.21wmf5/maintenance/runJobs.php(83): EnotifNotifyJob->run()
#13 /home/wikipedia/common/php-1.21wmf5/maintenance/doMaintenance.php(110): RunJobs->execute()
#14 /home/wikipedia/common/php-1.21wmf5/maintenance/runJobs.php(116): require_once('/home/wikipedia...')
#15 /home/wikipedia/common/multiversion/MWScript.php(69): require_once('/home/wikipedia...')
#16 {main}

tstarling added a comment.Via ConduitDec 4 2012, 12:16 AM

(In reply to comment #5)

I am getting:

PHP Notice: Trying to get property of non-object in

/home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUser.php
on line 115

trying to run jobs manually.

https://gerrit.wikimedia.org/r/#/c/36683/

hoo added a comment.Via ConduitDec 4 2012, 12:36 AM

Merged the CentralAuth fix... leaving this one open, as it seems to cover two issues.

aaron added a comment.Via ConduitDec 4 2012, 12:40 AM

In any case, the CA thing is a red herring I noticed. More important is that after tailing the job running log, I can see that only 3 of 16 boxes are active. And some processes on those three may be stuck. I think they got gummed up by the itwikisource slowness bug with LST (runners can be stuck on old code even if we fix bugs).

aaron added a comment.Via ConduitDec 4 2012, 12:46 AM

(In reply to comment #9)

In any case, the CA thing is a red herring I noticed. More important is that
after tailing the job running log, I can see that only 3 of 16 boxes are
active. And some processes on those three may be stuck. I think they got gummed
up by the itwikisource slowness bug with LST (runners can be stuck on old code
even if we fix bugs).

Tim killed those processes and kicked the runners which seems to work per https://gdash.wikimedia.org/dashboards/jobq/.

aaron added a comment.Via ConduitDec 4 2012, 12:49 AM

(In reply to comment #3)

If there was a problem with it being
slow, I would expect gdash to show the same queue rate but lower run rate.

Ignore that, for the reference, since refreshlinks2 spawns refreshlinks job, which confounds this.

Nikerabbit added a comment.Via ConduitDec 4 2012, 7:19 AM

Any comments on the behavior I observe?

Nemo_bis added a comment.Via ConduitDec 4 2012, 8:27 AM

Now jobrunners are as active as on average before the 29th, but the job queue is still increasing steadily so this is apparently not enough, they're not consuming the backlog as they usually do.
Could this match the behaviour reported by Niklas?

bzimport added a comment.Via ConduitDec 5 2012, 10:33 AM

jseddon wrote:

I am wondering if this bug is being caused by the same issues we are seeing in the queue:

https://bugzilla.wikimedia.org/show_bug.cgi?id=42715

Is it possible that the queue is a result of jobs being duplcated?

Nikerabbit added a comment.Via ConduitDec 5 2012, 10:40 AM

I'm pretty sure nobody can make head or tails of this bug anymore. I reported very specific case where I observe very slow job run rate because job selection is slow due to table scanning.

This issue might contribute to the general issues we are currently seeing with the job queue, but all the other issues discussed in this bug are different from this particular issue.

Jarry1250 added a comment.Via ConduitDec 5 2012, 11:24 AM

Niklas: We could test the issue you report by making the queue no random temporarily? Why does it even need to be random in the first place?

Nemo_bis added a comment.Via ConduitDec 7 2012, 12:12 PM

(In reply to comment #1)

It has now been running overnight:
php maintenance/showJobs.php
67285

It's still at 67k after 5 more days, by the way.

bzimport added a comment.Via ConduitDec 9 2012, 4:08 PM

richardg_uk wrote:

Is this related to [[mw:Extension:GeoData]] affecting the 819,000 enwiki articles transcluding {{Coord}}, or is the timing coincidental?

The new {{#coordinates:}} parser function was added to the template at 2012-12-06 07:30:18 UTC:
http://en.wikipedia.org/w/index.php?title=Template:Coord&diff=526665541&oldid=368854805

Platonides added a comment.Via ConduitDec 9 2012, 4:13 PM

Those queries _should_ be fast, just an index traversal. :/

bzimport added a comment.Via ConduitDec 9 2012, 4:21 PM

richardg_uk wrote:

(In reply to comment #19)

Those queries _should_ be fast, just an index traversal. :/

But would still cause a one-off reparse of the affected pages to be sent to the job queue?

Nemo_bis added a comment.Via ConduitDec 9 2012, 4:22 PM

(In reply to comment #18)

Is this related to [[mw:Extension:GeoData]] affecting the 819,000 enwiki
articles transcluding {{Coord}}, or is the timing coincidental?

There isn't any coincidence in timing, the jobqueue graph has the same gradient for a whole day after that edit.

hashar added a comment.Via ConduitDec 11 2012, 9:28 AM

db39 in production showed JobQueueDB::claimRandom has using 21% of query time:

/* JobQueueDB::claimRandom */ SELECT * FROM job WHERE job_cmd = ? AND job_token = ? AND (job_random <= ?) ORDER BY job_random DESC limit ?

And 8% for:
/* JobQueueDB::claimRandom */ SELECT * FROM job WHERE job_cmd = ? AND job_token = ? AND (job_random >= ?) ORDER BY job_random limit ?

So that sum up to 29%. Roughly 30% were admin commands, so claimRandom took like half the time of the queries :-/

Nemo_bis added a comment.Via ConduitDec 12 2012, 11:53 PM

(In reply to comment #22)

So that sum up to 29%. Roughly 30% were admin commands, so claimRandom took
like half the time of the queries :-/

Gerrit change 38454.

From #wikimedia-operations:

23.26 < binasher> AaronSchulz: look at the explain on JobQueueDB::claimRandom
23.26 < AaronSchulz> I recall doing that before and not seeing anything interesting
23.26 < binasher> | 1 | SIMPLE | job | ref | job_cmd,job_cmd_token,job_cmd_token_id | job_cmd_token | 291 | const,const | 110700

| Using where |

23.27 < binasher> 110700 = the rows read column
23.29 < apergos> o_O
23.31 < binasher> no wonder the claimRandom calls were killing commons
23.31 < binasher> AaronSchulz: can you drop the order by from that query?
23.32 < MaxSem> mmm, ORDER BY kinda gives us the randomness:)
23.32 < binasher> its not needed for some randomness since the job_random constraint is itself picked randomly
23.32 < binasher> nor is totally random claiming of jobs a requirement

aaron added a comment.Via ConduitDec 13 2012, 12:04 AM

I've always found the rows count for EXPLAIN to be useless since it never accounts for LIMIT (unless something change recently). A lot of fast queries (deeply paged history view_ look slow going of that number.

Technically ORDER BY is needed to make sure the DB is not always just getting the top/bottom row. Testing around shows that this is the case. This will lead to more races/retries for claiming rows. I'll see if there is some compromise here.

aaron added a comment.Via ConduitDec 13 2012, 12:11 AM

(In reply to comment #24)

I've always found the rows count for EXPLAIN to be useless since it never
accounts for LIMIT (unless something change recently). A lot of fast queries
(deeply paged history view_ look slow going of that number.

Technically ORDER BY is needed to make sure the DB is not always just getting
the top/bottom row. Testing around shows that this is the case. This will
lead
to more races/retries for claiming rows. I'll see if there is some compromise
here.

More work can be done to deal with that but this issue can be closed now.

Platonides added a comment.Via ConduitDec 13 2012, 1:30 AM

It makes no sense that mysql needs to read so many rows with an index. Unless the table is really huge or the random numbers aren't really random.

In comment 23 (the expensive query) I see a key_len of 291. That suggests to me that job_cmd is not varbinary(60) in that server but varchar(255).
That change was done in 2007! (f38344aba)

Still, although less efficient, I don't see why that should make the index not be used “correctly“. I wonder if we may have slaves with different job_cmds.

Nikerabbit added a comment.Via ConduitDec 13 2012, 7:51 AM

Works for me now too, but did we really have to wait 10 days for binasher to raise the issue while this bug was marked as highest priority (also critical for couple of days) to get it fixed? Does not compute.

hashar added a comment.Via ConduitDec 13 2012, 12:48 PM

The fix was to remove the order by close as stated by Asher (comment 23) :

https://gerrit.wikimedia.org/r/#/c/38454/

It has been deployed in 1.21wmf5 and 1.21wmf6 https://gerrit.wikimedia.org/r/#/q/I6ba6cc65d0023de3b9d9df0cfcd6a403289c9848,n,z

Nemo_bis added a comment.Via ConduitDec 13 2012, 12:56 PM

Whether related or not, the jobqueue on en.wiki raised from about 50k to 1.08 millions since this bug was reported, but is now slowly catching up a bit since the query was fixed (queues elsewhere such as fr.wiki were truncated).

Nemo_bis added a comment.Via ConduitMar 4 2013, 7:31 PM

Still has to be backported to 1.20 (for 1.20.4; it didn't make it to 1.20.3).

Nemo_bis added a comment.Via ConduitJun 2 2013, 6:37 PM

(In reply to comment #30)

Still has to be backported to 1.20 (for 1.20.4; it didn't make it to 1.20.3).

I didn't manage to get answers on what would need to be done here, so I'm marking the backport request as rejected.

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.