Page MenuHomePhabricator

JobQueueDB::claimRandom very slow with lots of jobs, takes half of the total time for queries
Closed, ResolvedPublic

Description

mysql> EXPLAIN SELECT /* JobQueueDB::claimRandom 127.0.0.1 */ * FROM `job` WHERE job_cmd = 'MessageGroupStatesUpdaterJob' AND job_token = '' AND (job_random <= '2108540693') ORDER BY job_random DESC LIMIT 1;
+----+-------------+-------+------+----------------------------------------+---------------+---------+-------------+-------+-------------+
| id | select_type | table | type | possible_keys                          | key           | key_len | ref         | rows  | Extra       |
+----+-------------+-------+------+----------------------------------------+---------------+---------+-------------+-------+-------------+
|  1 | SIMPLE      | job   | ref  | job_cmd,job_cmd_token,job_cmd_token_id | job_cmd_token | 96      | const,const | 43440 | Using where |
+----+-------------+-------+------+----------------------------------------+---------------+---------+-------------+-------+-------------+
1 row in set (0.00 sec)

mysql> SELECT /* JobQueueDB::claimRandom 127.0.0.1 */ * FROM `job` WHERE job_cmd = 'MessageGroupStatesUpdaterJob' AND job_token = '' AND (job_random <= '2108540693') ORDER BY job_random DESC LIMIT 1;+--------+------------------------------+---------------+--------------------------------------------------------+----------------+------------+------------+-----------+---------------------+---------------------------------+--------------+
| job_id | job_cmd                      | job_namespace | job_title                                              | job_timestamp  | job_params | job_random | job_token | job_token_timestamp | job_sha1                        | job_attempts |
+--------+------------------------------+---------------+--------------------------------------------------------+----------------+------------+------------+-----------+---------------------+---------------------------------+--------------+
| 447774 | MessageGroupStatesUpdaterJob |          1202 | Konsolekalendar-e2ca93-Add_location_to_incidence_(f/en | 20121201204852 | a:0:{}     | 2108426131 |           | NULL                | f78vrjeo5e8f17ackthos9uhjzw686y |            0 |
+--------+------------------------------+---------------+--------------------------------------------------------+----------------+------------+------------+-----------+---------------------+---------------------------------+--------------+
1 row in set (3.62 sec)

It's going to take ages to clear all the jobs if picking each one takes 3 seconds.

php maintenance/showJobs.php
86324


Version: 1.21.x
Severity: critical
See Also:

Details

Reference
bz42614

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 1:10 AM
bzimport set Reference to bz42614.

It has now been running overnight:
php maintenance/showJobs.php
67285

Not good.

I don't know if it's the same bug (Niklas said "possibly"), but it seems to affect also Wikimedia projects, as reported by Seddon (I guess his translation notifications on Meta are stuck again).
Since the 29 very few new jobs are run, only on en.wiki the queue raised from 40k to 150k items and the load on the job runners decreased from about 100 to about 50.
https://gdash.wikimedia.org/dashboards/jobq/deploys
http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=spence.wikimedia.org&v=506&m=enwiki_JobQueue_length
http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&s=by+name&c=Jobrunners+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4

(I suppose this deserves being highest priority and this now requires an assignee?)

According to https://ishmael.wikimedia.org for enwiki, that query is not in terms of average (or even p95) query time. If there was a problem with it being slow, I would expect gdash to show the same queue rate but lower run rate.

(In reply to comment #3)

According to https://ishmael.wikimedia.org for enwiki, that query is not in
terms of average (or even p95) query time. If there was a problem with it being
slow, I would expect gdash to show the same queue rate but lower run rate.

Are you saying that the problem on Wikimedia wikis is another bug? What number?

I am getting:
PHP Notice: Trying to get property of non-object in /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUser.php on line 115

trying to run jobs manually.

I added exception details to cli mode and added an exception there:

No row object provided.
Backtrace:
#0 /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUserArray.php(59): CentralAuthUser::newFromRow(false)
#1 /home/wikipedia/common/php-1.21wmf5/includes/UserArray.php(78): CentralAuthUserArrayFromResult->setCurrent(Object(stdClass))
#2 /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUserArray.php(21): UserArrayFromResult->construct(Object(ResultWrapper))
#3 /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUserArray.php(10): CentralAuthUserArrayFromResult->
construct(Object(ResultWrapper))
#4 /home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthHooks.php(400): CentralAuthUserArray::newFromResult(Object(ResultWrapper))
#5 [internal function]: CentralAuthHooks::onUserArrayFromResult(NULL, Object(ResultWrapper))
#6 /home/wikipedia/common/php-1.21wmf5/includes/Hooks.php(253): call_user_func_array('CentralAuthHook...', Array)
#7 /home/wikipedia/common/php-1.21wmf5/includes/GlobalFunctions.php(3832): Hooks::run('UserArrayFromRe...', Array)
#8 /home/wikipedia/common/php-1.21wmf5/includes/UserArray.php(30): wfRunHooks('UserArrayFromRe...', Array)
#9 /home/wikipedia/common/php-1.21wmf5/includes/UserArray.php(52): UserArray::newFromResult(Object(ResultWrapper))
#10 /home/wikipedia/common/php-1.21wmf5/includes/UserMailer.php(552): UserArray::newFromIDs(Array)
#11 /home/wikipedia/common/php-1.21wmf5/includes/job/jobs/EnotifNotifyJob.php(53): EmailNotification->actuallyNotifyOnPageChange(Object(User), Object(Title), '20121203220703', '::I have to say...', 0, 526235288, Array)
#12 /home/wikipedia/common/php-1.21wmf5/maintenance/runJobs.php(83): EnotifNotifyJob->run()
#13 /home/wikipedia/common/php-1.21wmf5/maintenance/doMaintenance.php(110): RunJobs->execute()
#14 /home/wikipedia/common/php-1.21wmf5/maintenance/runJobs.php(116): require_once('/home/wikipedia...')
#15 /home/wikipedia/common/multiversion/MWScript.php(69): require_once('/home/wikipedia...')
#16 {main}

(In reply to comment #5)

I am getting:

PHP Notice: Trying to get property of non-object in

/home/wikipedia/common/php-1.21wmf5/extensions/CentralAuth/CentralAuthUser.php
on line 115

trying to run jobs manually.

https://gerrit.wikimedia.org/r/#/c/36683/

Merged the CentralAuth fix... leaving this one open, as it seems to cover two issues.

In any case, the CA thing is a red herring I noticed. More important is that after tailing the job running log, I can see that only 3 of 16 boxes are active. And some processes on those three may be stuck. I think they got gummed up by the itwikisource slowness bug with LST (runners can be stuck on old code even if we fix bugs).

(In reply to comment #9)

In any case, the CA thing is a red herring I noticed. More important is that
after tailing the job running log, I can see that only 3 of 16 boxes are
active. And some processes on those three may be stuck. I think they got gummed
up by the itwikisource slowness bug with LST (runners can be stuck on old code
even if we fix bugs).

Tim killed those processes and kicked the runners which seems to work per https://gdash.wikimedia.org/dashboards/jobq/.

(In reply to comment #3)

If there was a problem with it being
slow, I would expect gdash to show the same queue rate but lower run rate.

Ignore that, for the reference, since refreshlinks2 spawns refreshlinks job, which confounds this.

Any comments on the behavior I observe?

Now jobrunners are as active as on average before the 29th, but the job queue is still increasing steadily so this is apparently not enough, they're not consuming the backlog as they usually do.
Could this match the behaviour reported by Niklas?

jseddon wrote:

I am wondering if this bug is being caused by the same issues we are seeing in the queue:

https://bugzilla.wikimedia.org/show_bug.cgi?id=42715

Is it possible that the queue is a result of jobs being duplcated?

I'm pretty sure nobody can make head or tails of this bug anymore. I reported very specific case where I observe very slow job run rate because job selection is slow due to table scanning.

This issue might contribute to the general issues we are currently seeing with the job queue, but all the other issues discussed in this bug are different from this particular issue.

Niklas: We could test the issue you report by making the queue no random temporarily? Why does it even need to be random in the first place?

(In reply to comment #1)

It has now been running overnight:
php maintenance/showJobs.php
67285

It's still at 67k after 5 more days, by the way.

richardg_uk wrote:

Is this related to [[mw:Extension:GeoData]] affecting the 819,000 enwiki articles transcluding {{Coord}}, or is the timing coincidental?

Those queries _should_ be fast, just an index traversal. :/

richardg_uk wrote:

(In reply to comment #19)

Those queries _should_ be fast, just an index traversal. :/

But would still cause a one-off reparse of the affected pages to be sent to the job queue?

(In reply to comment #18)

Is this related to [[mw:Extension:GeoData]] affecting the 819,000 enwiki
articles transcluding {{Coord}}, or is the timing coincidental?

There isn't any coincidence in timing, the jobqueue graph has the same gradient for a whole day after that edit.

db39 in production showed JobQueueDB::claimRandom has using 21% of query time:

/* JobQueueDB::claimRandom */ SELECT * FROM job WHERE job_cmd = ? AND job_token = ? AND (job_random <= ?) ORDER BY job_random DESC limit ?

And 8% for:
/* JobQueueDB::claimRandom */ SELECT * FROM job WHERE job_cmd = ? AND job_token = ? AND (job_random >= ?) ORDER BY job_random limit ?

So that sum up to 29%. Roughly 30% were admin commands, so claimRandom took like half the time of the queries :-/

(In reply to comment #22)

So that sum up to 29%. Roughly 30% were admin commands, so claimRandom took
like half the time of the queries :-/

Gerrit change 38454.

From #wikimedia-operations:

23.26 < binasher> AaronSchulz: look at the explain on JobQueueDB::claimRandom
23.26 < AaronSchulz> I recall doing that before and not seeing anything interesting
23.26 < binasher> | 1 | SIMPLE | job | ref | job_cmd,job_cmd_token,job_cmd_token_id | job_cmd_token | 291 | const,const | 110700

| Using where |

23.27 < binasher> 110700 = the rows read column
23.29 < apergos> o_O
23.31 < binasher> no wonder the claimRandom calls were killing commons
23.31 < binasher> AaronSchulz: can you drop the order by from that query?
23.32 < MaxSem> mmm, ORDER BY kinda gives us the randomness:)
23.32 < binasher> its not needed for some randomness since the job_random constraint is itself picked randomly
23.32 < binasher> nor is totally random claiming of jobs a requirement

I've always found the rows count for EXPLAIN to be useless since it never accounts for LIMIT (unless something change recently). A lot of fast queries (deeply paged history view_ look slow going of that number.

Technically ORDER BY is needed to make sure the DB is not always just getting the top/bottom row. Testing around shows that this is the case. This will lead to more races/retries for claiming rows. I'll see if there is some compromise here.

(In reply to comment #24)

I've always found the rows count for EXPLAIN to be useless since it never
accounts for LIMIT (unless something change recently). A lot of fast queries
(deeply paged history view_ look slow going of that number.

Technically ORDER BY is needed to make sure the DB is not always just getting
the top/bottom row. Testing around shows that this is the case. This will
lead
to more races/retries for claiming rows. I'll see if there is some compromise
here.

More work can be done to deal with that but this issue can be closed now.

It makes no sense that mysql needs to read so many rows with an index. Unless the table is really huge or the random numbers aren't really random.

In comment 23 (the expensive query) I see a key_len of 291. That suggests to me that job_cmd is not varbinary(60) in that server but varchar(255).
That change was done in 2007! (f38344aba)

Still, although less efficient, I don't see why that should make the index not be used “correctly“. I wonder if we may have slaves with different job_cmds.

Works for me now too, but did we really have to wait 10 days for binasher to raise the issue while this bug was marked as highest priority (also critical for couple of days) to get it fixed? Does not compute.

The fix was to remove the order by close as stated by Asher (comment 23) :

https://gerrit.wikimedia.org/r/#/c/38454/

It has been deployed in 1.21wmf5 and 1.21wmf6 https://gerrit.wikimedia.org/r/#/q/I6ba6cc65d0023de3b9d9df0cfcd6a403289c9848,n,z

Whether related or not, the jobqueue on en.wiki raised from about 50k to 1.08 millions since this bug was reported, but is now slowly catching up a bit since the query was fixed (queues elsewhere such as fr.wiki were truncated).

Still has to be backported to 1.20 (for 1.20.4; it didn't make it to 1.20.3).

(In reply to comment #30)

Still has to be backported to 1.20 (for 1.20.4; it didn't make it to 1.20.3).

I didn't manage to get answers on what would need to be done here, so I'm marking the backport request as rejected.