Page MenuHomePhabricator

EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade)
Closed, ResolvedPublic

Description

In the process of resolving T106165: Upgrade production to elasticsearch 1.7.1 we have come up with some cluster health and robustness questions.

Generally, this is related to the long standing issue of ES upgrades taking a long time and the failure of fast-restart. The failure of fast-restart seems possibly related to some general cluster cohesion issues as noted in https://phabricator.wikimedia.org/T108180#1517896. It seems like a symptom of a possibly deeper issue(s). In digging into this over the past week I have been bothering @dcausse daily :) and a lot of the questions raised are not new it seems as I see them in historical postmortems. These concerns seem to have fallen through the cracks a bit and so I am hoping to make tasks and link them there.

Related issues:

T76090
T102594
T90889
T89845

It's worth noting that this is all most likely an outcome of the success of the ES deployment here and the increased load on the cluster over time :) But search did have the worst uptime of any service we reported on last quarter using our external monitoring tool. We are maybe in a place where our failure modes are increasingly catastrophic since the last few instances of outage were resolved by restarting the entire cluster.

Related Objects

StatusSubtypeAssignedTask
ResolvedGehel
Resolveddcausse
DeclinedNone
ResolvedGehel
DeclinedNone
ResolvedGehel
ResolvedGehel
DeclinedNone
DuplicateNone
DeclinedNone
ResolvedGehel
ResolvedEBernhardson
ResolvedGehel
ResolvedGehel
ResolvedGehel
ResolvedPRODUCTION ERRORGehel
ResolvedEBernhardson
ResolvedGehel
ResolvedGehel
ResolvedNone

Event Timeline

chasemp raised the priority of this task from to Medium.
chasemp updated the task description. (Show Details)
chasemp added subscribers: chasemp, dcausse.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 14 2015, 4:41 PM
Krenair added a subscriber: Krenair.
Restricted Application added a project: Discovery. · View Herald TranscriptAug 14 2015, 5:09 PM
chasemp updated the task description. (Show Details)Aug 14 2015, 6:31 PM
chasemp set Security to None.
chasemp updated the task description. (Show Details)Aug 14 2015, 6:40 PM
chasemp updated the task description. (Show Details)Aug 14 2015, 7:03 PM
Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 14 2015, 7:15 PM
chasemp renamed this task from Cultivating the Elasticsearch garden to Cultivating the Elasticsearch garden (Lessons from 1.7.1 upgrade).Aug 14 2015, 7:55 PM
Deskana renamed this task from Cultivating the Elasticsearch garden (Lessons from 1.7.1 upgrade) to EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade).Aug 27 2015, 5:20 PM
Deskana added a project: Epic.

This would be a wonderful list of tasks for the Operations Engineer that Discovery is thinking of hiring. :-)

Deskana moved this task from Needs triage to Ops on the Discovery board.Dec 3 2015, 4:59 AM
Gehel added a subscriber: Gehel.Feb 1 2016, 1:57 PM

In the absence of anything else more pressing, this would be a good one for @Gehel to take a look at. @EBernhardson and @dcausse can provide context on the tasks, as they were very involved in the upgrading.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJun 8 2016, 12:32 AM

This task will truly live forever. Yay, scope creep! :-p

debt added a subscriber: debt.Oct 26 2017, 5:15 PM

This is so old... @Gehel, can you go through the subtasks and see if any of them are still useful? Thanks!

@debt: I finally went through all the subtasks and closed a bunch of them. There are still a few which make sense, but on low priority.

debt added a comment.Nov 7 2017, 3:12 PM

Perfect, thanks so much for spending the time on this minor tech debt cleanup, @Gehel! 👍

Gehel closed this task as Resolved.Sep 9 2020, 2:44 PM
Gehel claimed this task.