Page MenuHomePhabricator

Alert when elasticsearch writes are frozen for too long
Closed, ResolvedPublic

Description

When we freeze writes to elasticsearch, jobs pile up in the job queue. During the recent T193112 issue, writes were not thawed in time, and caused "all type of issues". Monitoring of the jobqueue raised an alert, but we should raise an alert on the elasticsearch side, before this becomes an issue with the job queue.

@dcausse proposed to add a timestamp to the frozen index entry. This timestamp could be monitored and an alert raised if the freeze is longer than 1h.

Event Timeline

Gehel created this task.May 2 2018, 8:29 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

After https://gerrit.wikimedia.org/r/430441 it will work fairly simply. Each cluster can have the following request issued:

GET https://search.svc.<cluster>.wmnet:9243/mw_cirrus_metastore/mw_cirrus_metastore/freeze-everything

This wlil either return something like:

{
  "_index": "mw_cirrus_metastore_1525208216",
  "_type": "mw_cirrus_metastore",
  "_id": "freeze-everything",
  "found": false
}

Or

{
  "_index": "mw_cirrus_metastore_1525314062",
  "_type": "mw_cirrus_metastore",
  "_id": "freeze-everything",
  "_version": 1,
  "found": true,
  "_source": {
    "host": "mediawiki-vagrant",
    "timestamp": 1525314130
  }
}
RobH triaged this task as Normal priority.May 3 2018, 4:07 PM
RobH added a subscriber: RobH.

As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in Operations and attempting to review if any are critical, or if they are normal priority.

This task appears to be normal priority, and I have set it such. If any one on this task disagrees, please comment and correct.

Thanks!

Change 431754 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: alert when cirrus writes are frozen for too long

https://gerrit.wikimedia.org/r/431754

Change 431754 merged by Gehel:
[operations/puppet@production] elasticsearch: alert when cirrus writes are frozen for too long

https://gerrit.wikimedia.org/r/431754

Change 437679 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: send frozen writes check over HTTPS

https://gerrit.wikimedia.org/r/437679

Change 437679 merged by Gehel:
[operations/puppet@production] elasticsearch: send frozen writes check over HTTPS

https://gerrit.wikimedia.org/r/437679

Change 437683 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: check frozen writes improvements

https://gerrit.wikimedia.org/r/437683

Change 437683 merged by Gehel:
[operations/puppet@production] elasticsearch: check frozen writes improvements

https://gerrit.wikimedia.org/r/437683

Deployed and seems to be working

debt closed this task as Resolved.Jun 14 2018, 8:09 PM
Vvjjkkii renamed this task from Alert when elasticsearch writes are frozen for too long to etdaaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Gehel as the assignee of this task.
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from etdaaaaaaa to Alert when elasticsearch writes are frozen for too long.Jul 2 2018, 2:12 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to Gehel.
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: gerritbot, Aklapper.