EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• chasemp
	Aug 14 2015, 4:41 PM

Description

In the process of resolving T106165: Upgrade production to elasticsearch 1.7.1 we have come up with some cluster health and robustness questions.

Generally, this is related to the long standing issue of ES upgrades taking a long time and the failure of fast-restart. The failure of fast-restart seems possibly related to some general cluster cohesion issues as noted in https://phabricator.wikimedia.org/T108180#1517896. It seems like a symptom of a possibly deeper issue(s). In digging into this over the past week I have been bothering @dcausse daily :) and a lot of the questions raised are not new it seems as I see them in historical postmortems. These concerns seem to have fallen through the cracks a bit and so I am hoping to make tasks and link them there.

Related issues:

T76090
T102594
T90889
T89845

It's worth noting that this is all most likely an outcome of the success of the ES deployment here and the increased load on the cluster over time :) But search did have the worst uptime of any service we reported on last quarter using our external monitoring tool. We are maybe in a place where our failure modes are increasingly catastrophic since the last few instances of outage were resolved by restarting the entire cluster.

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Gehel	T109089 EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade)
Resolved		dcausse	T108180 Investigate why synced-flush did not help to do fast rolling-upgrade on the production cluster (elasticsearch cluster in eqiad)
Declined		None	T109090 Investigate the need for master only (non data nodes) in our ES cluster
Resolved		Gehel	T109091 Investigate tweaking of the "wait for me" parameter for upgrades / restarts
Declined		None	T109093 Investigate and remove stale search test targets in trebuchet
Resolved		Gehel	T109097 investigate raising of indices.recovery parameters to enable faster convergence and recovery
Resolved		Gehel	T109101 Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally.
Declined		None	T109104 Investigate primary vs none allocation models for fast-restart
Duplicate		None	T109117 Make icinga monitoring more relevant
Declined		None	T133844 Improve Elasticsearch icinga alerting
Resolved		Gehel	T109120 Evaluate index.merge.scheduler.max_thread_count setting for our SSDs
Resolved		EBernhardson	T109122 CirrusSearch should send instances of Search backend error to graphite
Resolved		Gehel	T109126 logstash insertions to ElasticSearch cross-DC functionality needs figuring out
Resolved		Gehel	T109127 Investigate mysterious write load during general read-only maintenance
Resolved		Gehel	T110236 Use unicast instead of multicast for node communication
Resolved	PRODUCTION ERROR	Gehel	T133784 ElasticSearch Not enough active copies to meet write consistency
Resolved		EBernhardson	T107348 Timeouts when trying to create mappings.
Resolved		Gehel	T130209 Collect threaddumps from elasticsearch at regular intervals
Resolved		Gehel	T110171 Alert when ES indexes are freezed for more than 30 minutes
Resolved		None	T143195 align elasticsearch.yml template with the default configuration for elasticsearch 2.x

Event Timeline

• chasemp created this task.Aug 14 2015, 4:41 PM

• chasemp raised the priority of this task from to Medium.

• chasemp updated the task description. (Show Details)

• chasemp added subscribers: • chasemp, dcausse.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 14 2015, 4:41 PM

• chasemp added a subtask: T108180: Investigate why synced-flush did not help to do fast rolling-upgrade on the production cluster (elasticsearch cluster in eqiad).Aug 14 2015, 4:41 PM

Krenair added a project: Elasticsearch.Aug 14 2015, 5:09 PM

Krenair subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptAug 14 2015, 5:09 PM

• chasemp updated the task description. (Show Details)Aug 14 2015, 6:31 PM

• chasemp set Security to None.

• chasemp updated the task description. (Show Details)Aug 14 2015, 6:40 PM

EBernhardson subscribed.Aug 14 2015, 6:50 PM

• chasemp updated the task description. (Show Details)Aug 14 2015, 7:03 PM

• chasemp added a project: acl*sre-team.Aug 14 2015, 7:15 PM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 14 2015, 7:15 PM

• chasemp renamed this task from Cultivating the Elasticsearch garden to Cultivating the Elasticsearch garden (Lessons from 1.7.1 upgrade).Aug 14 2015, 7:55 PM

• Deskana renamed this task from Cultivating the Elasticsearch garden (Lessons from 1.7.1 upgrade) to EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade).Aug 27 2015, 5:20 PM

• Deskana added a project: Epic.

This would be a wonderful list of tasks for the Operations Engineer that Discovery is thinking of hiring. :-)

EBernhardson added a subtask: T107348: Timeouts when trying to create mappings..Sep 1 2015, 12:28 AM

• Deskana closed subtask T109122: CirrusSearch should send instances of Search backend error to graphite as Resolved.Sep 9 2015, 2:36 AM

• Deskana closed subtask T108180: Investigate why synced-flush did not help to do fast rolling-upgrade on the production cluster (elasticsearch cluster in eqiad) as Resolved.Sep 17 2015, 4:52 PM

• chasemp mentioned this in T89845: unattended elasticsearch restarts.Oct 21 2015, 11:29 PM

• Deskana moved this task from Needs triage to Ops on the Discovery-ARCHIVED board.Dec 3 2015, 4:59 AM

• Deskana closed subtask T107348: Timeouts when trying to create mappings. as Resolved.Dec 17 2015, 11:50 PM

Gehel subscribed.Feb 1 2016, 1:57 PM

In the absence of anything else more pressing, this would be a good one for @Gehel to take a look at. @EBernhardson and @dcausse can provide context on the tasks, as they were very involved in the upgrading.

• Deskana closed subtask T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. as Resolved.Feb 26 2016, 7:01 PM

Gehel created subtask T130209: Collect threaddumps from elasticsearch at regular intervals.Mar 17 2016, 2:17 PM

debt closed subtask T110236: Use unicast instead of multicast for node communication as Resolved.Jun 8 2016, 12:32 AM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJun 8 2016, 12:32 AM

Gehel added a subtask: T110171: Alert when ES indexes are freezed for more than 30 minutes.Jul 19 2016, 2:31 PM

Gehel added a subtask: T143195: align elasticsearch.yml template with the default configuration for elasticsearch 2.x.Aug 17 2016, 10:02 AM

This task will truly live forever. Yay, scope creep! :-p

• Deskana moved this task from needs triage to search-icebox on the Discovery-Search board.Oct 20 2016, 10:22 PM

• demon closed subtask T109093: Investigate and remove stale search test targets in trebuchet as Declined.Sep 18 2017, 4:24 PM

This is so old... @Gehel, can you go through the subtasks and see if any of them are still useful? Thanks!

Gehel closed subtask T109091: Investigate tweaking of the "wait for me" parameter for upgrades / restarts as Resolved.Nov 1 2017, 1:55 PM

Gehel closed subtask T109097: investigate raising of indices.recovery parameters to enable faster convergence and recovery as Resolved.Nov 2 2017, 10:57 AM

Gehel closed subtask T109120: Evaluate index.merge.scheduler.max_thread_count setting for our SSDs as Resolved.Nov 2 2017, 11:23 AM

Gehel closed subtask T109126: logstash insertions to ElasticSearch cross-DC functionality needs figuring out as Resolved.Nov 2 2017, 11:26 AM

Gehel closed subtask T109127: Investigate mysterious write load during general read-only maintenance as Resolved.

@debt: I finally went through all the subtasks and closed a bunch of them. There are still a few which make sense, but on low priority.

Perfect, thanks so much for spending the time on this minor tech debt cleanup, @Gehel! 👍

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:14 PM

TJones closed subtask T143195: align elasticsearch.yml template with the default configuration for elasticsearch 2.x as Resolved.Jan 29 2019, 6:38 PM

EBernhardson moved this task from search-icebox to [epic] on the Discovery-Search board.Feb 14 2019, 9:48 PM

EBernhardson closed subtask T109090: Investigate the need for master only (non data nodes) in our ES cluster as Declined.Feb 14 2019, 10:15 PM

Gehel closed subtask T110171: Alert when ES indexes are freezed for more than 30 minutes as Resolved.Feb 21 2019, 4:45 PM

Gehel closed this task as Resolved.Sep 9 2020, 2:44 PM

Gehel claimed this task.

Gehel closed subtask T109104: Investigate primary vs none allocation models for fast-restart as Declined.

EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade)
Closed, ResolvedPublic
Actions

Related Objects
Search...