Editing is slow due to CPU saturation on the job queue redis instances
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ori
	May 18 2015, 12:40 AM

Description

In the course of analyzing performance data for the quarterly report (T97378), I noticed that the time it takes to save a page increased by about 150ms on March 12 and has continued to increase since then. All in all, edits are almost 300ms slower now than they were two months ago.

The cause for this appears to be CPU saturation on the job queue redises. When an edit is made, MediaWiki and extensions enqueue jobs, which requires invoking a Lua script (using EVAL / EVALSHA) on the job queue redises. Redis executes Lua atomically. While Redis is executing a script, no other client can execute commands since the server is busy.

I don't know exactly why Redis started overloading on the 12th. It coincides roughly with RESTBase getting enabled for more wikis. Currently the RESTBase and Parsoid extensions each enqueue a job on each edit. More jobs means the Lua code gets evaluated more often, and may have to do more work if the data structures it uses to represent the queue grow bigger.

The fact that Redis runs on a multi-core machine but that Redis itself is bound to a single core has made this issue harder to spot than it ought to have been. rdb1001 and rdb1002 have twelve cores each, so a completely saturated core represents only 8.3% of overall CPU utilization. What first drew my attention were the network graphs, which spiked in an obvious way.

Related Objects

Mentioned Here: T97378: Provide read/write latency numbers for January-March 2015 WMF quarterly report

Event Timeline

ori created this task.May 18 2015, 12:40 AM

ori raised the priority of this task from to Unbreak Now!.

ori updated the task description. (Show Details)

ori added a project: MediaWiki-Core-JobQueue.

ori subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 18 2015, 12:40 AM

• MZMcBride subscribed.May 18 2015, 1:48 AM

Nemo_bis subscribed.May 18 2015, 9:58 AM

Krinkle added a project: Performance Issue.May 18 2015, 5:50 PM

Krinkle set Security to None.

Krinkle moved this task from Tag to Doing on the Performance Issue board.

Krinkle added a subscriber: tstarling.

Krinkle subscribed.

Redis also seems to prioritize queries from existing connections over new ones, which makes sense, but the things with persisting connections are jobrunner/jobchron and it's the web requests that make new ones each time. The prioritization is basically backwards in case of high CPU use.

Would also be nice to explain the bumps in CPU over the year.

Xenon showed only a small amount of time on edits spend in push() during the last round of performance changes. On top of that, the latest jobchron deploy reversed the CPU increase over the year (http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Redis%20eqiad&h=rdb1001.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1433364415&v=2.4&m=cpu_user&vl=%25&ti=CPU%20User&z=large).

A few more places can use lazyPush(), but this will only make a small difference in normal conditions.

	F168320: graph.php.png
	May 23 2015, 8:50 PM

	F165790: graph (1).png
	May 18 2015, 12:40 AM

	F165787: graph (2).png
	May 18 2015, 12:40 AM

Editing is slow due to CPU saturation on the job queue redis instances Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Editing is slow due to CPU saturation on the job queue redis instances
Closed, ResolvedPublic
Actions