Publish "pending_tasks" count from Elastic search cluster to graphite
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	May 3 2016, 9:22 AM

Description

During recent cluster instabilities, we realized that the number of pending tasks raising is a good indication that the cluster is in trouble. As we don't collect this metric, there is a chance that we also see peaks in normal operation (unlikely), but the only way to know is to start collecting... Once we have graphs, we can think about putting alerting or adding this to our dashboards.

gehel@elastic2001:~$ curl -s https://search.svc.codfw.wmnet:9243/_cluster/health?pretty
{
  "cluster_name" : "production-search-codfw",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 24,
  "number_of_data_nodes" : 24,
  "active_primary_shards" : 2978,
  "active_shards" : 8848,
  "relocating_shards" : 0,
  "initializing_shards" : 5,
  "unassigned_shards" : 144,
  "delayed_unassigned_shards" : 0,
**  "number_of_pending_tasks" : 0,**
  "number_of_in_flight_fetch" : 0
}

Details

	Subject	Repo	Branch	Lines +/-
	Collect pending_tasks count metric from elasticsearch	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects

Mentioned In: rOPUP3049e2634b3d: Collect pending_tasks count metric from elasticsearch

Event Timeline

Gehel created this task.May 3 2016, 9:22 AM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 3 2016, 9:22 AM

EBernhardson subscribed.May 3 2016, 4:28 PM

Change 286756 had a related patch set uploaded (by Gehel):
Collect pending_tasks count metric from elasticsearch

https://gerrit.wikimedia.org/r/286756

gerritbot added a project: Patch-For-Review.May 3 2016, 8:53 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.May 3 2016, 8:54 PM

Gehel moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Gehel claimed this task.May 3 2016, 9:01 PM

Change 286756 merged by Gehel:
Collect pending_tasks count metric from elasticsearch

https://gerrit.wikimedia.org/r/286756

Gehel moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.May 4 2016, 12:36 PM

Data is now available in the Elasticsearch Grafana dashboard (at the bottom).

should we also get an icinga alert in there? (maybe different task though)

I wanted to have a look at the graph before thinking about alerts. Looking at it now that we have a few days of data, it seems that we do have daily peaks. That's a bit unexpected, so I'll take some time to see if I can understand those peaks, and maybe smooth them. If the graphs keep looking like they do at the moment, it seems unlikely that we can find a way to use them for alerting.

The spikes look to mostly (but not completely) coincide with the daily rebuild of the completion suggester indices (2am to 10am UTC). During that time we peak at ~10k indexes per second (very lightweight individually though). I'm surprised to see eqiad also climbs outside of those hours though. But you might be right and this would be difficult to alert on.

Gehel moved this task from Needs triage to Ops on the Discovery-ARCHIVED board.May 30 2016, 3:57 PM

Gehel mentioned this in rOPUP3049e2634b3d: Collect pending_tasks count metric from elasticsearch.Jun 17 2016, 6:10 PM

Metrics are published, added to grafana dashboard. Alerting on those metrics does not seem to make sense at the moment, so closing this task as there is not anything more to do.

Publish "pending_tasks" count from Elastic search cluster to graphiteClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Publish "pending_tasks" count from Elastic search cluster to graphite
Closed, ResolvedPublic
Actions