Page MenuHomePhabricator

Publish "pending_tasks" count from Elastic search cluster to graphite
Closed, ResolvedPublic

Description

During recent cluster instabilities, we realized that the number of pending tasks raising is a good indication that the cluster is in trouble. As we don't collect this metric, there is a chance that we also see peaks in normal operation (unlikely), but the only way to know is to start collecting... Once we have graphs, we can think about putting alerting or adding this to our dashboards.

gehel@elastic2001:~$ curl -s https://search.svc.codfw.wmnet:9243/_cluster/health?pretty
{
  "cluster_name" : "production-search-codfw",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 24,
  "number_of_data_nodes" : 24,
  "active_primary_shards" : 2978,
  "active_shards" : 8848,
  "relocating_shards" : 0,
  "initializing_shards" : 5,
  "unassigned_shards" : 144,
  "delayed_unassigned_shards" : 0,
**  "number_of_pending_tasks" : 0,**
  "number_of_in_flight_fetch" : 0
}

Event Timeline

Change 286756 had a related patch set uploaded (by Gehel):
Collect pending_tasks count metric from elasticsearch

https://gerrit.wikimedia.org/r/286756

Change 286756 merged by Gehel:
Collect pending_tasks count metric from elasticsearch

https://gerrit.wikimedia.org/r/286756

should we also get an icinga alert in there? (maybe different task though)

I wanted to have a look at the graph before thinking about alerts. Looking at it now that we have a few days of data, it seems that we do have daily peaks. That's a bit unexpected, so I'll take some time to see if I can understand those peaks, and maybe smooth them. If the graphs keep looking like they do at the moment, it seems unlikely that we can find a way to use them for alerting.

The spikes look to mostly (but not completely) coincide with the daily rebuild of the completion suggester indices (2am to 10am UTC). During that time we peak at ~10k indexes per second (very lightweight individually though). I'm surprised to see eqiad also climbs outside of those hours though. But you might be right and this would be difficult to alert on.

Metrics are published, added to grafana dashboard. Alerting on those metrics does not seem to make sense at the moment, so closing this task as there is not anything more to do.