Page MenuHomePhabricator

Icinga reports read time out error for some checks on cloudelastic cluster
Closed, ResolvedPublic

Description

Looking at icinga UNKNOWN errors, It shows that shard size check and unassigned shards check are failing with
UNKNOWN - HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=4)

However, testing manually on the one of the nodes that reported failure with an increased timeout seems to not throw this error.
We should figure out why this is happening as resources in cloudelastic hosts does not show a very high load.

https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cloudelastic&var-instance=All&from=now-7d&to=now

Event Timeline

Restricted Application edited projects, added Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptAug 12 2019, 4:36 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Mathew.onipe triaged this task as Normal priority.Aug 14 2019, 2:12 PM
Mathew.onipe added a project: Operations.
Mathew.onipe added a comment.EditedAug 14 2019, 2:27 PM

After some conversation with @EBernhardson, it was discovered dump are currently being loaded into the cloudelastic cluster (https://phabricator.wikimedia.org/T220625) and this might be related to the slow response time. There's a heavy indexing going on this cluster (9200). This causes icinga alerts requests to timeout.
Also we think this slow response time should not impact users.

We will increase timeout for now and revert the timeout to default(4s) when indexing/dump loading is complete.

PS: cloudelastic clusters are running at a far lower capacity than our main clusters. So this contributes to the response time as well.

Change 529806 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] icinga: add timeout option to elastic checks

https://gerrit.wikimedia.org/r/529806

Change 529806 merged by Gehel:
[operations/puppet@production] icinga: add timeout option to elastic checks

https://gerrit.wikimedia.org/r/529806

Change 530256 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] icinga: add the option separator for elastic shard size alerts

https://gerrit.wikimedia.org/r/530256

Change 530256 merged by Gehel:
[operations/puppet@production] icinga: add the option separator for elastic shard size alerts

https://gerrit.wikimedia.org/r/530256

debt closed this task as Resolved.Sep 5 2019, 6:32 PM
debt claimed this task.