Event Timeline
bking@cumin1001:~$ sudo cookbook sre.elasticsearch.rolling-operation relforge "relforge cluster restart" \
--upgrade --nodes-per-run 1 --without-lvs --start-datetime 2022-03-15T21:54:40 --task-id T301955
START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
waiting for clusters to be green
Fetch 1 node(s) from relforge to perform rolling restart on
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: relforge1004
Disabling Puppet with reason "relforge cluster restart - bking@cumin1001 - T301955" on 1 hosts: relforge1004.eqiad.wmnet
- OUTPUT of 'disable-puppet "...n1001 - T301955"' -----
PASS |███████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.76s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'disable-puppet "...n1001 - T301955"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Freezing writes on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7ff7ef3151c0>, <spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7ff7ef315130>]
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Wait for a minimum time of 60sec to make sure all CirrusSearch writes are terminated
Stopping elasticsearch replication in a safe way on relforge
stopping replication on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7ff7ef3151c0>, <spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7ff7ef315130>]
stop replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
stop replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
flush markers on <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
flush markers on <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
stop all elasticsearch instances on relforge1004.eqiad.wmnet
- OUTPUT of 'cat /etc/elastic...s systemctl stop' -----
PASS |███████████████████████████████████████████████| 100% (1/1) [00:03<00:00, 3.21s/hosts]
FAIL | | 0% (0/1) [00:03<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cat /etc/elastic...s systemctl stop'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Starting rolling_operation Operation.UPGRADE on relforge1004.eqiad.wmnet at time 2022-03-15 21:56:45.212698
start replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
start replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
Unfreezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Could not unfreeze writes, trying to freeze and unfreeze again: Encountered error while deleting document to unfreeze cluster writes
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Unfreezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Could not unfreeze writes, trying to freeze and unfreeze again: Encountered error while deleting document to unfreeze cluster writes
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Enabling Puppet with reason "relforge cluster restart - bking@cumin1001 - T301955" on 1 hosts: relforge1004.eqiad.wmnet
- OUTPUT of 'enable-puppet "r...n1001 - T301955"' -----
PASS |███████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.75s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'enable-puppet "r...n1001 - T301955"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
File "/usr/lib/python3/dist-packages/curator/utils.py", line 175, in report_failure raise exceptions.FailedExecution(
curator.exceptions.FailedExecution: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(502, '<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body bgcolor="white">\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.13.9</center>\r\n</body>\r\n</html>\r\n')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/curator/actions.py", line 496, in do_action self.client.cluster.put_settings(body=self.body) File "/usr/lib/python3/dist-packages/elasticsearch/client/utils.py", line 84, in _wrapped return func(*args, params=params, **kwargs) File "/usr/lib/python3/dist-packages/elasticsearch/client/cluster.py", line 184, in put_settings return self.transport.perform_request( File "/usr/lib/python3/dist-packages/elasticsear self._elasticsearch.delete( File "/usr/lib/python3/dist-packages/elasticsearch/client/utils.py", line 84, in _wrapped return func(*args, params=params, **kwargs) File "/usr/lib/python3/dist-packages/elasticsearch/client/__init__.py", line 1375, in delete return self.transport.perform_request( File "/usr/lib/python3/dist-packages/elasticsearch/transport.py", line 351, in perform_request status, headers_response, data = connection.perform_request( File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 257, in perform_request self._raise_error(response.status, raw_data) File "/usr/lib/python3/dist-packages/elasticsearch/connection/base.py", line 181, in _raise_error raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.TransportError: TransportError(502, '<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body bgcolor="white">\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>
e
return self.transport.perform_request( File "/usr/lib/python3/dist-packages/elasticsearch/transport.py", line 351, in perform_request status, headers_response, data = connection.perform_request( File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 257, in perform_request self._raise_error(response.status, raw_data) File "/usr/lib/python3/dist-packages/elasticsearch/connection/base.py", line 181, in _raise_error raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.TransportError: TransportError(502, '<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body bgcolor="white">\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.13.9</center>\r\n</body>\r\n</html>\r\n')
The above exception was the direct cause of the following exception:
Trac
en_writes
yield [stack.enter_context(cluster.frozen_writes(reason)) for cluster in self._clusters] File "/usr/lib/python3.9/contextlib.py", line 513, in __exit__ raise exc_details[1] [0/837] File "/usr/lib/python3.9/contextlib.py", line 498, in __exit__ if cb(*exc_details): File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__ self.gen.throw(type, value, traceback) File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 514, in froz
en_writes
self._freeze_writes(reason) File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 540, in _fre
eze_writes
raise ElasticsearchClusterError("Encountered error while creating document to freeze clus
ter writes") from e
spicerack.elasticsearch_cluster.ElasticsearchClusterError: Encountered error while creating d
ocument to freeze cluster writes
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1
nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1
001 - T301955