Alert when ES indexes are freezed for more than 30 minutes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Aug 25 2015, 9:19 AM

Description

It went undetected for 12 hours last time, we ought to do slightly better I guess.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Gehel	T109089 EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade)
		Resolved		Gehel	T110171 Alert when ES indexes are freezed for more than 30 minutes

Event Timeline

Joe created this task.Aug 25 2015, 9:19 AM

Joe raised the priority of this task from to Needs Triage.

Joe updated the task description. (Show Details)

Joe added projects: acl*sre-team, observability.

Joe subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptAug 25 2015, 9:19 AM

akosiaris triaged this task as High priority.Aug 25 2015, 12:44 PM

akosiaris subscribed.

• chasemp added a project: Incident-20150825-Redis.Aug 25 2015, 12:56 PM

• chasemp set Security to None.

Gehel subscribed.Jul 19 2016, 1:52 PM

Gehel added a parent task: T109089: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade).Jul 19 2016, 2:31 PM

greg added a project: Wikimedia-Incident.Jul 28 2016, 10:17 PM

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Jul 28 2016, 10:18 PM

How would you manually check whether they are frozen and for how long?

fgiunchedi added projects: Discovery-ARCHIVED, Discovery-Search.Dec 1 2016, 8:26 PM

I guess that by frozen indices, we refer to freezing the jobs that write to elasticsearch, not closing the indices in elasticsearch itself. I'm not actually sure how that freezing works, I'll dig into the code see if I can understand.

This hasn't been touched in quite a while, so lowering priority and putting in the "Later" column. If this is important somehow, please feel free to let me know and we can shuffle it around.

• Deskana moved this task from needs triage to search-icebox on the Discovery-Search board.Dec 8 2016, 11:08 PM

It's an explicit follow-up from an incident. These should be prioritized along side other "fun/new" work appropriately (iow: not dropped).

In T110171#2858787, @greg wrote:

It's an explicit follow-up from an incident. These should be prioritized along side other "fun/new" work appropriately (iow: not dropped).

@greg Good to know. I chatted to @EBernhardson about it before reprioritising and he said it's unclear how relevant this is now given how our rolling restarts work now. Hopefully @Gehel should know more. :-)

This hasn't been touched in quite a while, so lowering priority

I know this is a general Phabricator workflow thing but i never understood this logic, in other ticket systems priority would be raised when things had not been touched in a long time, not the other way around.

In T110171#2870147, @Dzahn wrote:

I know this is a general Phabricator workflow thing but i never understood this logic, in other ticket systems priority would be raised when things had not been touched in a long time, not the other way around.

It's a fair point. I generally use task priority as descriptive; using that lens, if something hasn't been touched for over a year, then it's not really high priority, and keeping it marked as such is misleading. If everything is high priority, then nothing is. :-)

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:16 PM

EBernhardson moved this task from search-icebox to Ops / SRE on the Discovery-Search board.Feb 14 2019, 9:34 PM

Already implemented in https://gerrit.wikimedia.org/r/c/operations/puppet/+/431754 as part of T193605

Gehel closed this task as Resolved.Feb 21 2019, 4:45 PM

Gehel claimed this task.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:51 PM