As a maintainer of the search infrastructure I want to have more precise metrics regarding errors that occur between CirrusSearch and Elasticsearch so that I can better understand the problems on the cluster.
The CirrusSearch failures are currently categorized into 3 buckets: rejected, failed and unknown. The unknown bucket is currently seeing 1 error/minute so it would be interesting to know what these are, especially if they relate to indexing documents.
Graph: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&viewPanel=9
AC:
- the number of unknown errors should be exceptional (close to 0/day)