Page MenuHomePhabricator

Editing is slow due to CPU saturation on the job queue redis instances
Closed, ResolvedPublic

Assigned To
Authored By
ori
May 18 2015, 12:40 AM
Referenced Files
F168320: graph.php.png
May 23 2015, 8:50 PM
F165790: graph (1).png
May 18 2015, 12:40 AM
F165784: NhnbjBS.png
May 18 2015, 12:40 AM
F165787: graph (2).png
May 18 2015, 12:40 AM

Description

In the course of analyzing performance data for the quarterly report (T97378), I noticed that the time it takes to save a page increased by about 150ms on March 12 and has continued to increase since then. All in all, edits are almost 300ms slower now than they were two months ago.

The cause for this appears to be CPU saturation on the job queue redises. When an edit is made, MediaWiki and extensions enqueue jobs, which requires invoking a Lua script (using EVAL / EVALSHA) on the job queue redises. Redis executes Lua atomically. While Redis is executing a script, no other client can execute commands since the server is busy.

I don't know exactly why Redis started overloading on the 12th. It coincides roughly with RESTBase getting enabled for more wikis. Currently the RESTBase and Parsoid extensions each enqueue a job on each edit. More jobs means the Lua code gets evaluated more often, and may have to do more work if the data structures it uses to represent the queue grow bigger.

The fact that Redis runs on a multi-core machine but that Redis itself is bound to a single core has made this issue harder to spot than it ought to have been. rdb1001 and rdb1002 have twelve cores each, so a completely saturated core represents only 8.3% of overall CPU utilization. What first drew my attention were the network graphs, which spiked in an obvious way.

NhnbjBS.png (371×600 px, 11 KB)

graph (2).png (387×747 px, 52 KB)

graph (1).png (415×747 px, 25 KB)

Event Timeline

ori raised the priority of this task from to Unbreak Now!.
ori updated the task description. (Show Details)
ori added a project: MediaWiki-Core-JobQueue.
ori subscribed.
Krinkle set Security to None.
Krinkle moved this task from Tag to Doing on the Performance Issue board.
Krinkle added a subscriber: tstarling.
Krinkle subscribed.

Redis also seems to prioritize queries from existing connections over new ones, which makes sense, but the things with persisting connections are jobrunner/jobchron and it's the web requests that make new ones each time. The prioritization is basically backwards in case of high CPU use.

Would also be nice to explain the bumps in CPU over the year.

graph.php.png (248×577 px, 18 KB)

aaron claimed this task.

Xenon showed only a small amount of time on edits spend in push() during the last round of performance changes. On top of that, the latest jobchron deploy reversed the CPU increase over the year (http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Redis%20eqiad&h=rdb1001.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1433364415&v=2.4&m=cpu_user&vl=%25&ti=CPU%20User&z=large).

A few more places can use lazyPush(), but this will only make a small difference in normal conditions.