In the course of analyzing performance data for the quarterly report (T97378), I noticed that the time it takes to save a page increased by about 150ms on March 12 and has continued to increase since then. All in all, edits are almost 300ms slower now than they were two months ago.
The cause for this appears to be CPU saturation on the job queue redises. When an edit is made, MediaWiki and extensions enqueue jobs, which requires invoking a Lua script (using EVAL / EVALSHA) on the job queue redises. Redis executes Lua atomically. While Redis is executing a script, no other client can execute commands since the server is busy.
I don't know exactly why Redis started overloading on the 12th. It coincides roughly with RESTBase getting enabled for more wikis. Currently the RESTBase and Parsoid extensions each enqueue a job on each edit. More jobs means the Lua code gets evaluated more often, and may have to do more work if the data structures it uses to represent the queue grow bigger.
The fact that Redis runs on a multi-core machine but that Redis itself is bound to a single core has made this issue harder to spot than it ought to have been. rdb1001 and rdb1002 have twelve cores each, so a completely saturated core represents only 8.3% of overall CPU utilization. What first drew my attention were the network graphs, which spiked in an obvious way.