Page MenuHomePhabricator

Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues
Closed, ResolvedPublic

Description

The jobqueue has exploded in size, starting on april 25th at around 5PM. Add to that that the job consume rate has spiked up from ~ 150 jobs/s to 1600 and you can see how this is endangering quite a few parts of the infrastructure, including breaking the redis replica for the jobqueue given the size of the databases that has exploded to over 3.8 GB in size for a single db.

At the same time, the rate of job failed/timed out has gone from ~ 0 to 220 per minute.

All of the jobs queued, as far as I can see, are cirrusSearchElasticaWrite; looking at logstash I see:

Dropping delayed ElasticaWrite job for DataSender::sendData in cluster codfw after waiting 14210s

Unless I misinterpret this, it seems the jobs fail to execute. The only slightly related item I find in the SAL is

12:55 gehel: starting elasticsearch codfw rolling restart for plugin update and NUMA config - T191543 / T191236

Event Timeline

Joe created this task.Apr 26 2018, 4:33 AM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptApr 26 2018, 4:33 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Joe triaged this task as Unbreak Now! priority.Apr 26 2018, 4:33 AM
Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptApr 26 2018, 4:33 AM

It looks like writes were frozen to the codfw cluster and never thaw'd. Moving forward we need a timestamp indexed along with the freeze. We should then start alerting on a freeze that has lasted more than N (60?) minutes so someone can unfreeze before we start dropping jobs on the ground.

We will need to reindex codfw probably from apr 25 14:00 to apr 26 5:00 once the job queue clears out

I suppose we should lower the drop timeout, in $wgCirrusSearchDropDelayedJobsAfter, to something more reasonable as well. This was added to put a cap on the amount of time we backup into the job queue to keep it's size limited, but it seems the current value of three hours created more load than the system can handle.

Joe added a comment.Apr 26 2018, 5:32 AM

The queue is getting back to normal sizes, and the job production is almost what I'd expect. Unfreezing writes solved the issue; I'll restore the correct state of redis replication once the queue has shrank enough.

Mentioned in SAL (#wikimedia-operations) [2018-04-26T07:16:52Z] <gehel> re-enabling puppet on rdb2* - T193112

Mentioned in SAL (#wikimedia-operations) [2018-04-26T07:21:38Z] <gehel> restarting redis masters in codfw - T193112

Mentioned in SAL (#wikimedia-operations) [2018-04-26T12:51:09Z] <gehel> reindexing lost updates on elasticsearch - T193112

debt lowered the priority of this task from Unbreak Now! to High.Apr 26 2018, 5:06 PM
Gehel claimed this task.May 1 2018, 5:28 PM
debt added a subscriber: debt.May 1 2018, 5:29 PM

@Gehel will write up some more information on this ticket.

debt added a comment.May 1 2018, 5:29 PM

@Gehel will write up some more information on this ticket.

Gehel added a comment.May 29 2018, 5:37 PM

Last action based on this issue is tracked in T193605, we can close this.

Gehel closed this task as Resolved.May 29 2018, 5:37 PM
Vvjjkkii renamed this task from Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues to 36daaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Gehel as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
AntiCompositeNumber renamed this task from 36daaaaaaa to Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues.Jul 1 2018, 12:44 PM
AntiCompositeNumber closed this task as Resolved.
AntiCompositeNumber assigned this task to Gehel.
AntiCompositeNumber updated the task description. (Show Details)
AntiCompositeNumber added a subscriber: Aklapper.