Page MenuHomePhabricator

Improve robustness of es-tool
Closed, ResolvedPublic

Description

While doing the restart of the elasticsearch cluster, I had a few timeouts while stopping and starting replication. The change of settings was actually applied. It should be possible to catch the timeout, check if the setting change is applied and either fail gracefully or exit in success.

Event Timeline

I think we should increase the timeout, rather then trying to catch the failure (although perhaps both?). Elasticsearch has a default 30s master timeout when not specified, it looks like the http library es-tool was using timed out after 10s though. Ideally we should put these two timeouts in lockstep.

The master timeout for elasticsearch is not configurable as a cluster wide setting. Instead individual actions need to provide a master_timeout=2m query string parameter. Within cirrussearch we have adjusted a few of the calls that commonly timeout to provide a 2m (2 minute) timeout.

Change 282472 had a related patch set uploaded (by Adedommelin):
Improve robustness of es-tool

https://gerrit.wikimedia.org/r/282472

Change 282472 merged by Gehel:
Improve robustness of es-tool

https://gerrit.wikimedia.org/r/282472

removing from discovery backlog as it is already implemented

Deskana triaged this task as Medium priority.