While doing the restart of the elasticsearch cluster, I had a few timeouts while stopping and starting replication. The change of settings was actually applied. It should be possible to catch the timeout, check if the setting change is applied and either fail gracefully or exit in success.
Description
Description
Details
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Improve robustness of es-tool | operations/puppet | production | +28 -20 |
Event Timeline
Comment Actions
I think we should increase the timeout, rather then trying to catch the failure (although perhaps both?). Elasticsearch has a default 30s master timeout when not specified, it looks like the http library es-tool was using timed out after 10s though. Ideally we should put these two timeouts in lockstep.
The master timeout for elasticsearch is not configurable as a cluster wide setting. Instead individual actions need to provide a master_timeout=2m query string parameter. Within cirrussearch we have adjusted a few of the calls that commonly timeout to provide a 2m (2 minute) timeout.
Comment Actions
Change 282472 had a related patch set uploaded (by Adedommelin):
Improve robustness of es-tool