Page MenuHomePhabricator

Collect per node latency percentiles on our elasticsearch cirrus clusters
Closed, ResolvedPublic

Description

Our elasticsearch clusters expose some metrics specific to our us, including per node latency percentiles. Those are not collected by the standard elasticsearch_exporter. We want to create a new custom exporter for those metrics.The prometheus-blazegraph-exporter can be used as an example / starting point. This exporter has no reason to be reused outside of our deployment, so deploying it directly with puppet is probably fine.

Event Timeline

metrics are now collected. @EBernhardson if you could have a look and validate that this is what you expected...

I put together a very basic attempt at a first dashboard: https://grafana.wikimedia.org/dashboard/db/elasticsearch-per-node-percentiles?orgId=1
The overall numbers look sane and roughly what is expected.

I suppose the :9109 in the instance names is a bit annoying, in that is makes the list of instances much longer. Really though there are too many instances to list and it needs to be further filtered (top-N?) anyways.

debt subscribed.

Looks nice!