Maniphest T193112

Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues
Closed, ResolvedPublic
Actions

Description

The jobqueue has exploded in size, starting on april 25th at around 5PM. Add to that that the job consume rate has spiked up from ~ 150 jobs/s to 1600 and you can see how this is endangering quite a few parts of the infrastructure, including breaking the redis replica for the jobqueue given the size of the databases that has exploded to over 3.8 GB in size for a single db.

At the same time, the rate of job failed/timed out has gone from ~ 0 to 220 per minute.

All of the jobs queued, as far as I can see, are cirrusSearchElasticaWrite; looking at logstash I see:

Dropping delayed ElasticaWrite job for DataSender::sendData in cluster codfw after waiting 14210s

Unless I misinterpret this, it seems the jobs fail to execute. The only slightly related item I find in the SAL is

12:55 gehel: starting elasticsearch codfw rolling restart for plugin update and NUMA config - T191543 / T191236

Related Objects

Mentioned In: T193605: Alert when elasticsearch writes are frozen for too long
Mentioned Here: T193605: Alert when elasticsearch writes are frozen for too long

Event Timeline

Joe created this task.Apr 26 2018, 4:33 AM

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptApr 26 2018, 4:33 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Joe triaged this task as Unbreak Now! priority.Apr 26 2018, 4:33 AM

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptApr 26 2018, 4:33 AM

Joe added a project: SRE.Apr 26 2018, 4:36 AM

Smalyshev added subscribers: EBernhardson, dcausse, Gehel.Apr 26 2018, 4:36 AM

It looks like writes were frozen to the codfw cluster and never thaw'd. Moving forward we need a timestamp indexed along with the freeze. We should then start alerting on a freeze that has lasted more than N (60?) minutes so someone can unfreeze before we start dropping jobs on the ground.

We will need to reindex codfw probably from apr 25 14:00 to apr 26 5:00 once the job queue clears out

I suppose we should lower the drop timeout, in $wgCirrusSearchDropDelayedJobsAfter, to something more reasonable as well. This was added to put a cap on the amount of time we backup into the job queue to keep it's size limited, but it seems the current value of three hours created more load than the system can handle.

The queue is getting back to normal sizes, and the job production is almost what I'd expect. Unfreezing writes solved the issue; I'll restore the correct state of redis replication once the queue has shrank enough.

Mentioned in SAL (#wikimedia-operations) [2018-04-26T07:16:52Z] <gehel> re-enabling puppet on rdb2* - T193112

Mentioned in SAL (#wikimedia-operations) [2018-04-26T07:21:38Z] <gehel> restarting redis masters in codfw - T193112

Mentioned in SAL (#wikimedia-operations) [2018-04-26T12:51:09Z] <gehel> reindexing lost updates on elasticsearch - T193112

• EBjune moved this task from needs triage to Current work on the Discovery-Search board.Apr 26 2018, 5:06 PM

• EBjune edited projects, added Discovery-Search (Current work); removed Discovery-Search.

debt lowered the priority of this task from Unbreak Now! to High.Apr 26 2018, 5:06 PM

debt moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Gehel claimed this task.May 1 2018, 5:28 PM

@Gehel will write up some more information on this ticket.

Gehel mentioned this in T193605: Alert when elasticsearch writes are frozen for too long.May 2 2018, 8:29 AM

EBernhardson moved this task from not in use - please delete to Waiting on the Discovery-Search (Current work) board.May 2 2018, 11:31 PM

Last action based on this issue is tracked in T193605, we can close this.

Gehel closed this task as Resolved.May 29 2018, 5:37 PM

Liuxinyu970226 unsubscribed.Jun 3 2018, 6:24 AM

• Vvjjkkii renamed this task from Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues to 36daaaaaaa.Jul 1 2018, 1:13 AM

• Vvjjkkii reopened this task as Open.

• Vvjjkkii removed Gehel as the assignee of this task.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

AntiCompositeNumber renamed this task from 36daaaaaaa to Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues.Jul 1 2018, 12:44 PM

AntiCompositeNumber closed this task as Resolved.

AntiCompositeNumber assigned this task to Gehel.

AntiCompositeNumber removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

AntiCompositeNumber updated the task description. (Show Details)

AntiCompositeNumber added a subscriber: Aklapper.

EBernhardson moved this task from Waiting to Needs Reporting on the Discovery-Search (Current work) board.May 6 2019, 3:55 PM

Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issuesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues
Closed, ResolvedPublic
Actions