Maniphest T193605

Alert when elasticsearch writes are frozen for too long
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	May 2 2018, 8:29 AM

Description

When we freeze writes to elasticsearch, jobs pile up in the job queue. During the recent T193112 issue, writes were not thawed in time, and caused "all type of issues". Monitoring of the jobqueue raised an alert, but we should raise an alert on the elasticsearch side, before this becomes an issue with the job queue.

@dcausse proposed to add a timestamp to the frozen index entry. This timestamp could be monitored and an alert raised if the freeze is longer than 1h.

Details

Subject	Repo	Branch	Lines +/-
elasticsearch: check frozen writes improvements	operations/puppet	production	+6 -4
elasticsearch: send frozen writes check over HTTPS	operations/puppet	production	+1 -1
elasticsearch: alert when cirrus writes are frozen for too long	operations/puppet	production	+130 -5

Customize query in gerrit

Related Objects

Mentioned In: T110171: Alert when ES indexes are freezed for more than 30 minutes
T193112: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues
Mentioned Here: T193112: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues

Event Timeline

Gehel created this task.May 2 2018, 8:29 AM

Restricted Application edited projects, added Discovery-ARCHIVED, Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptMay 2 2018, 8:29 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

After https://gerrit.wikimedia.org/r/430441 it will work fairly simply. Each cluster can have the following request issued:

GET https://search.svc.<cluster>.wmnet:9243/mw_cirrus_metastore/mw_cirrus_metastore/freeze-everything

This wlil either return something like:

{
  "_index": "mw_cirrus_metastore_1525208216",
  "_type": "mw_cirrus_metastore",
  "_id": "freeze-everything",
  "found": false
}

{
  "_index": "mw_cirrus_metastore_1525314062",
  "_type": "mw_cirrus_metastore",
  "_id": "freeze-everything",
  "_version": 1,
  "found": true,
  "_source": {
    "host": "mediawiki-vagrant",
    "timestamp": 1525314130
  }
}

As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in SRE and attempting to review if any are critical, or if they are normal priority.

This task appears to be normal priority, and I have set it such. If any one on this task disagrees, please comment and correct.

Thanks!

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.May 3 2018, 8:50 PM

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Gehel claimed this task.May 8 2018, 1:18 PM

Gehel moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Change 431754 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: alert when cirrus writes are frozen for too long

https://gerrit.wikimedia.org/r/431754

gerritbot added a project: Patch-For-Review.May 8 2018, 2:07 PM

Gehel moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.May 8 2018, 2:13 PM

Gehel mentioned this in T193112: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues.May 29 2018, 5:37 PM

Change 431754 merged by Gehel:
[operations/puppet@production] elasticsearch: alert when cirrus writes are frozen for too long

https://gerrit.wikimedia.org/r/431754

Change 437679 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: send frozen writes check over HTTPS

https://gerrit.wikimedia.org/r/437679

Change 437679 merged by Gehel:
[operations/puppet@production] elasticsearch: send frozen writes check over HTTPS

https://gerrit.wikimedia.org/r/437679

Change 437683 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: check frozen writes improvements

https://gerrit.wikimedia.org/r/437683

Change 437683 merged by Gehel:
[operations/puppet@production] elasticsearch: check frozen writes improvements

https://gerrit.wikimedia.org/r/437683