Page MenuHomePhabricator
Paste P22359

ES restart cookbook error T301955
ActivePublic

Authored by bking on Mar 10 2022, 10:16 PM.
Referenced Files
F34998230: ES restart cookbook error T301955
Mar 10 2022, 10:23 PM
F34998203: ES restart cookbook error T301955
Mar 10 2022, 10:16 PM
Subscribers
None
bking@cumin1001:/srv/deployment/spicerack$ sudo cookbook sre.elasticsearch.rolling-operation relforge "relforge cluster restart" --upgrade --nodes-per-run 1 --without-lvs --start-datetime 2022-03-10T22:03:38 --task-id T301955
START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
waiting for clusters to be green
Fetch 1 node(s) from relforge to perform rolling restart on
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: relforge1003
Disabling Puppet with reason "relforge cluster restart - bking@cumin1001 - T301955" on 1 hosts: relforge1003.eqiad.wmnet
----- OUTPUT of 'disable-puppet "...n1001 - T301955"' -----
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.76s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'disable-puppet "...n1001 - T301955"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Freezing writes on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7fb061ce5190>, <spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7fb061ce5100>]
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Wait for a minimum time of 60sec to make sure all CirrusSearch writes are terminated
Stopping elasticsearch replication in a safe way on relforge
stopping replication on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7fb061ce5190>, <spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7fb061ce5100>]
stop replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
stop replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
flush markers on <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
flush markers on <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
stop all elasticsearch instances on relforge1003.eqiad.wmnet
----- OUTPUT of 'cat /etc/elastic...s systemctl stop' -----
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:03<00:00, 3.13s/hosts]
FAIL | | 0% (0/1) [00:03<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cat /etc/elastic...s systemctl stop'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Starting rolling_operation Operation.UPGRADE on relforge1003.eqiad.wmnet at time 2022-03-10 22:07:00.949941
start replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
start replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
Unfreezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Could not unfreeze writes, trying to freeze and unfreeze again: Encountered error while deleting document to unfreeze cluster writes
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Unfreezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Could not unfreeze writes, trying to freeze and unfreeze again: Encountered error while deleting document to unfreeze cluster writes
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Enabling Puppet with reason "relforge cluster restart - bking@cumin1001 - T301955" on 1 hosts: relforge1003.eqiad.wmnet
----- OUTPUT of 'enable-puppet "r...n1001 - T301955"' -----
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.78s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'enable-puppet "r...n1001 - T301955"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Exception raised while executing cookbook sre.elasticsearch.rolling-operation:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.9/http/client.py", line 1347, in getresponse
response.begin()
File "/usr/lib/python3.9/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.9/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.9/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.9/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 237, in perform_request
response = self.pool.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 507, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3/dist-packages/six.py", line 719, in reraise
raise value
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 336, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='relforge1004.eqiad.wmnet', port=9243): Read timed out. (read timeout=10)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 533, in _freeze_writes
self._elasticsearch.index(
File "/usr/lib/python3/dist-packages/elasticsearch/client/utils.py", line 84, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/lib/python3/dist-packages/elasticsearch/client/__init__.py", line 369, in index
return self.transport.perform_request(
File "/usr/lib/python3/dist-packages/elasticsearch/transport.py", line 351, in perform_request
status, headers_response, data = connection.perform_request(
File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 249, in perform_request
raise ConnectionTimeout("TIMEOUT", str(e), e)
elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='relforge1004.eqiad.wmnet', port=9243): Read timed out. (read timeout=10))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
raw_ret = runner.run()
File "/srv/deployment/spicerack/cookbooks/sre/elasticsearch/rolling-operation.py", line 174, in run
logger.info('Cluster not yet green, thawing writes and resume waiting for green')
File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 213, in frozen_writes
yield [stack.enter_context(cluster.frozen_writes(reason)) for cluster in self._clusters]
File "/usr/lib/python3.9/contextlib.py", line 513, in __exit__
raise exc_details[1]
File "/usr/lib/python3.9/contextlib.py", line 498, in __exit__
if cb(*exc_details):
File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 514, in frozen_writes
self._freeze_writes(reason)
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 540, in _freeze_writes
raise ElasticsearchClusterError("Encountered error while creating document to freeze cluster writes") from e
spicerack.elasticsearch_cluster.ElasticsearchClusterError: Encountered error while creating document to freeze cluster writes
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Event Timeline

bking@cumin1001:~$ sudo cookbook sre.elasticsearch.rolling-operation relforge "relforge cluster restart" \
--upgrade --nodes-per-run 1 --without-lvs --start-datetime 2022-03-15T21:54:40 --task-id T301955
START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
waiting for clusters to be green
Fetch 1 node(s) from relforge to perform rolling restart on
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: relforge1004
Disabling Puppet with reason "relforge cluster restart - bking@cumin1001 - T301955" on 1 hosts: relforge1004.eqiad.wmnet

  • OUTPUT of 'disable-puppet "...n1001 - T301955"' -----

PASS |███████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.76s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'disable-puppet "...n1001 - T301955"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Freezing writes on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7ff7ef3151c0>, <spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7ff7ef315130>]
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Wait for a minimum time of 60sec to make sure all CirrusSearch writes are terminated
Stopping elasticsearch replication in a safe way on relforge
stopping replication on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7ff7ef3151c0>, <spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7ff7ef315130>]
stop replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
stop replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
flush markers on <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
flush markers on <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
stop all elasticsearch instances on relforge1004.eqiad.wmnet

  • OUTPUT of 'cat /etc/elastic...s systemctl stop' -----

PASS |███████████████████████████████████████████████| 100% (1/1) [00:03<00:00, 3.21s/hosts]
FAIL | | 0% (0/1) [00:03<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cat /etc/elastic...s systemctl stop'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Starting rolling_operation Operation.UPGRADE on relforge1004.eqiad.wmnet at time 2022-03-15 21:56:45.212698
start replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
start replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
Unfreezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Could not unfreeze writes, trying to freeze and unfreeze again: Encountered error while deleting document to unfreeze cluster writes
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Unfreezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Could not unfreeze writes, trying to freeze and unfreeze again: Encountered error while deleting document to unfreeze cluster writes
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Enabling Puppet with reason "relforge cluster restart - bking@cumin1001 - T301955" on 1 hosts: relforge1004.eqiad.wmnet

  • OUTPUT of 'enable-puppet "r...n1001 - T301955"' -----

PASS |███████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.75s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'enable-puppet "r...n1001 - T301955"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

File "/usr/lib/python3/dist-packages/curator/utils.py", line 175, in report_failure
  raise exceptions.FailedExecution(

curator.exceptions.FailedExecution: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(502, '<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body bgcolor="white">\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.13.9</center>\r\n</body>\r\n</html>\r\n')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/curator/actions.py", line 496, in do_action
  self.client.cluster.put_settings(body=self.body)
File "/usr/lib/python3/dist-packages/elasticsearch/client/utils.py", line 84, in _wrapped
  return func(*args, params=params, **kwargs)
File "/usr/lib/python3/dist-packages/elasticsearch/client/cluster.py", line 184, in put_settings
  return self.transport.perform_request(
File "/usr/lib/python3/dist-packages/elasticsear
  self._elasticsearch.delete(
File "/usr/lib/python3/dist-packages/elasticsearch/client/utils.py", line 84, in _wrapped
  return func(*args, params=params, **kwargs)
File "/usr/lib/python3/dist-packages/elasticsearch/client/__init__.py", line 1375, in delete
  return self.transport.perform_request(
File "/usr/lib/python3/dist-packages/elasticsearch/transport.py", line 351, in perform_request
  status, headers_response, data = connection.perform_request(
File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 257, in perform_request
  self._raise_error(response.status, raw_data)
File "/usr/lib/python3/dist-packages/elasticsearch/connection/base.py", line 181, in _raise_error
  raise HTTP_EXCEPTIONS.get(status_code, TransportError)(

elasticsearch.exceptions.TransportError: TransportError(502, '<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body bgcolor="white">\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>
e

  return self.transport.perform_request(
File "/usr/lib/python3/dist-packages/elasticsearch/transport.py", line 351, in perform_request
  status, headers_response, data = connection.perform_request(
File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 257, in perform_request
  self._raise_error(response.status, raw_data)
File "/usr/lib/python3/dist-packages/elasticsearch/connection/base.py", line 181, in _raise_error
  raise HTTP_EXCEPTIONS.get(status_code, TransportError)(

elasticsearch.exceptions.TransportError: TransportError(502, '<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body bgcolor="white">\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.13.9</center>\r\n</body>\r\n</html>\r\n')

The above exception was the direct cause of the following exception:

Trac
en_writes

  yield [stack.enter_context(cluster.frozen_writes(reason)) for cluster in self._clusters]
File "/usr/lib/python3.9/contextlib.py", line 513, in __exit__
  raise exc_details[1]                                                              [0/837]
File "/usr/lib/python3.9/contextlib.py", line 498, in __exit__
  if cb(*exc_details):
File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
  self.gen.throw(type, value, traceback)
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 514, in froz

en_writes

  self._freeze_writes(reason)
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 540, in _fre

eze_writes

raise ElasticsearchClusterError("Encountered error while creating document to freeze clus

ter writes") from e
spicerack.elasticsearch_cluster.ElasticsearchClusterError: Encountered error while creating d
ocument to freeze cluster writes
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1
nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1
001 - T301955