ES restart cookbook error T301955
ActivePublic
Actions

Authored by bking on Mar 10 2022, 10:16 PM.

Tags

Referenced Files

	F34998230: ES restart cookbook error T301955
	Mar 10 2022, 10:23 PM

	F34998203: ES restart cookbook error T301955
	Mar 10 2022, 10:16 PM

Subscribers

None

	bking@cumin1001:/srv/deployment/spicerack$ sudo cookbook sre.elasticsearch.rolling-operation relforge "relforge cluster restart" --upgrade --nodes-per-run 1 --without-lvs --start-datetime 2022-03-10T22:03:38 --task-id T301955
	START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
	waiting for clusters to be green
	Fetch 1 node(s) from relforge to perform rolling restart on
	Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: relforge1003
	Disabling Puppet with reason "relforge cluster restart - bking@cumin1001 - T301955" on 1 hosts: relforge1003.eqiad.wmnet
	----- OUTPUT of 'disable-puppet "...n1001 - T301955"' -----
	================
	PASS \|██████████████████████████████████████████████████████████████████████████\| 100% (1/1) [00:01<00:00, 1.76s/hosts]
	FAIL \| \| 0% (0/1) [00:01<?, ?hosts/s]
	100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'disable-puppet "...n1001 - T301955"'.
	100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
	Freezing writes on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7fb061ce5190>, <spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7fb061ce5100>]
	Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
	Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
	Wait for a minimum time of 60sec to make sure all CirrusSearch writes are terminated
	Stopping elasticsearch replication in a safe way on relforge
	stopping replication on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7fb061ce5190>, <spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7fb061ce5100>]
	stop replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
	Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
	stop replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
	Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
	flush markers on <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
	flush markers on <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
	stop all elasticsearch instances on relforge1003.eqiad.wmnet
	----- OUTPUT of 'cat /etc/elastic...s systemctl stop' -----
	================
	PASS \|██████████████████████████████████████████████████████████████████████████\| 100% (1/1) [00:03<00:00, 3.13s/hosts]
	FAIL \| \| 0% (0/1) [00:03<?, ?hosts/s]
	100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cat /etc/elastic...s systemctl stop'.
	100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
	Starting rolling_operation Operation.UPGRADE on relforge1003.eqiad.wmnet at time 2022-03-10 22:07:00.949941
	start replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
	Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
	start replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
	Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
	Unfreezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
	Could not unfreeze writes, trying to freeze and unfreeze again: Encountered error while deleting document to unfreeze cluster writes
	Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
	Unfreezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
	Could not unfreeze writes, trying to freeze and unfreeze again: Encountered error while deleting document to unfreeze cluster writes
	Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
	Enabling Puppet with reason "relforge cluster restart - bking@cumin1001 - T301955" on 1 hosts: relforge1003.eqiad.wmnet
	----- OUTPUT of 'enable-puppet "r...n1001 - T301955"' -----
	================
	PASS \|██████████████████████████████████████████████████████████████████████████\| 100% (1/1) [00:01<00:00, 1.78s/hosts]
	FAIL \| \| 0% (0/1) [00:01<?, ?hosts/s]
	100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'enable-puppet "r...n1001 - T301955"'.
	100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
	Exception raised while executing cookbook sre.elasticsearch.rolling-operation:
	Traceback (most recent call last):
	File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 445, in _make_request
	six.raise_from(e, None)
	File "<string>", line 3, in raise_from
	File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 440, in _make_request
	httplib_response = conn.getresponse()
	File "/usr/lib/python3.9/http/client.py", line 1347, in getresponse
	response.begin()
	File "/usr/lib/python3.9/http/client.py", line 307, in begin
	version, status, reason = self._read_status()
	File "/usr/lib/python3.9/http/client.py", line 268, in _read_status
	line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
	File "/usr/lib/python3.9/socket.py", line 704, in readinto
	return self._sock.recv_into(b)
	File "/usr/lib/python3.9/ssl.py", line 1241, in recv_into
	return self.read(nbytes, buffer)
	File "/usr/lib/python3.9/ssl.py", line 1099, in read
	return self._sslobj.read(len, buffer)
	socket.timeout: The read operation timed out

	During handling of the above exception, another exception occurred:

	Traceback (most recent call last):
	File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 237, in perform_request
	response = self.pool.urlopen(
	File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 755, in urlopen
	retries = retries.increment(
	File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 507, in increment
	raise six.reraise(type(error), error, _stacktrace)
	File "/usr/lib/python3/dist-packages/six.py", line 719, in reraise
	raise value
	File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
	httplib_response = self._make_request(
	File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 447, in _make_request
	self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
	File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 336, in _raise_timeout
	raise ReadTimeoutError(
	urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='relforge1004.eqiad.wmnet', port=9243): Read timed out. (read timeout=10)

	During handling of the above exception, another exception occurred:

	Traceback (most recent call last):
	File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 533, in _freeze_writes
	self._elasticsearch.index(
	File "/usr/lib/python3/dist-packages/elasticsearch/client/utils.py", line 84, in _wrapped
	return func(args, params=params, *kwargs)
	File "/usr/lib/python3/dist-packages/elasticsearch/client/__init__.py", line 369, in index
	return self.transport.perform_request(
	File "/usr/lib/python3/dist-packages/elasticsearch/transport.py", line 351, in perform_request
	status, headers_response, data = connection.perform_request(
	File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 249, in perform_request
	raise ConnectionTimeout("TIMEOUT", str(e), e)
	elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='relforge1004.eqiad.wmnet', port=9243): Read timed out. (read timeout=10))

	The above exception was the direct cause of the following exception:

	Traceback (most recent call last):
	File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
	raw_ret = runner.run()
	File "/srv/deployment/spicerack/cookbooks/sre/elasticsearch/rolling-operation.py", line 174, in run
	logger.info('Cluster not yet green, thawing writes and resume waiting for green')
	File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
	self.gen.throw(type, value, traceback)
	File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 213, in frozen_writes
	yield [stack.enter_context(cluster.frozen_writes(reason)) for cluster in self._clusters]
	File "/usr/lib/python3.9/contextlib.py", line 513, in __exit__
	raise exc_details[1]
	File "/usr/lib/python3.9/contextlib.py", line 498, in __exit__
	if cb(*exc_details):
	File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
	self.gen.throw(type, value, traceback)
	File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 514, in frozen_writes
	self._freeze_writes(reason)
	File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 540, in _freeze_writes
	raise ElasticsearchClusterError("Encountered error while creating document to freeze cluster writes") from e
	spicerack.elasticsearch_cluster.ElasticsearchClusterError: Encountered error while creating document to freeze cluster writes
	END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Event Timeline

bking created this paste.Mar 10 2022, 10:16 PM

RKemper edited the content of this paste. (Show Details)Mar 10 2022, 10:23 PM

bking@cumin1001:~$ sudo cookbook sre.elasticsearch.rolling-operation relforge "relforge cluster restart" \
--upgrade --nodes-per-run 1 --without-lvs --start-datetime 2022-03-15T21:54:40 --task-id T301955
START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
waiting for clusters to be green
Fetch 1 node(s) from relforge to perform rolling restart on
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: relforge1004
Disabling Puppet with reason "relforge cluster restart - bking@cumin1001 - T301955" on 1 hosts: relforge1004.eqiad.wmnet

OUTPUT of 'disable-puppet "...n1001 - T301955"' -----

PASS |███████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.76s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'disable-puppet "...n1001 - T301955"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Freezing writes on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7ff7ef3151c0>, <spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7ff7ef315130>]
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Wait for a minimum time of 60sec to make sure all CirrusSearch writes are terminated
Stopping elasticsearch replication in a safe way on relforge
stopping replication on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7ff7ef3151c0>, <spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7ff7ef315130>]
stop replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
stop replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
flush markers on <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
flush markers on <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
stop all elasticsearch instances on relforge1004.eqiad.wmnet

OUTPUT of 'cat /etc/elastic...s systemctl stop' -----

PASS |███████████████████████████████████████████████| 100% (1/1) [00:03<00:00, 3.21s/hosts]
FAIL | | 0% (0/1) [00:03<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cat /etc/elastic...s systemctl stop'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Starting rolling_operation Operation.UPGRADE on relforge1004.eqiad.wmnet at time 2022-03-15 21:56:45.212698
start replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
start replication - <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
Unfreezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Could not unfreeze writes, trying to freeze and unfreeze again: Encountered error while deleting document to unfreeze cluster writes
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9443, 'use_ssl': True}])>
Unfreezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Could not unfreeze writes, trying to freeze and unfreeze again: Encountered error while deleting document to unfreeze cluster writes
Freezing all indices in <Elasticsearch([{'host': 'relforge1004.eqiad.wmnet', 'port': 9243, 'use_ssl': True}])>
Enabling Puppet with reason "relforge cluster restart - bking@cumin1001 - T301955" on 1 hosts: relforge1004.eqiad.wmnet

OUTPUT of 'enable-puppet "r...n1001 - T301955"' -----

PASS |███████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.75s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'enable-puppet "r...n1001 - T301955"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

File "/usr/lib/python3/dist-packages/curator/utils.py", line 175, in report_failure
  raise exceptions.FailedExecution(

curator.exceptions.FailedExecution: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(502, '<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body bgcolor="white">\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.13.9</center>\r\n</body>\r\n</html>\r\n')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/curator/actions.py", line 496, in do_action
  self.client.cluster.put_settings(body=self.body)
File "/usr/lib/python3/dist-packages/elasticsearch/client/utils.py", line 84, in _wrapped
  return func(*args, params=params, **kwargs)
File "/usr/lib/python3/dist-packages/elasticsearch/client/cluster.py", line 184, in put_settings
  return self.transport.perform_request(
File "/usr/lib/python3/dist-packages/elasticsear
  self._elasticsearch.delete(
File "/usr/lib/python3/dist-packages/elasticsearch/client/utils.py", line 84, in _wrapped
  return func(*args, params=params, **kwargs)
File "/usr/lib/python3/dist-packages/elasticsearch/client/__init__.py", line 1375, in delete
  return self.transport.perform_request(
File "/usr/lib/python3/dist-packages/elasticsearch/transport.py", line 351, in perform_request
  status, headers_response, data = connection.perform_request(
File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 257, in perform_request
  self._raise_error(response.status, raw_data)
File "/usr/lib/python3/dist-packages/elasticsearch/connection/base.py", line 181, in _raise_error
  raise HTTP_EXCEPTIONS.get(status_code, TransportError)(

elasticsearch.exceptions.TransportError: TransportError(502, '<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body bgcolor="white">\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>
e

  return self.transport.perform_request(
File "/usr/lib/python3/dist-packages/elasticsearch/transport.py", line 351, in perform_request
  status, headers_response, data = connection.perform_request(
File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 257, in perform_request
  self._raise_error(response.status, raw_data)
File "/usr/lib/python3/dist-packages/elasticsearch/connection/base.py", line 181, in _raise_error
  raise HTTP_EXCEPTIONS.get(status_code, TransportError)(

The above exception was the direct cause of the following exception:

Trac
en_writes

  yield [stack.enter_context(cluster.frozen_writes(reason)) for cluster in self._clusters]
File "/usr/lib/python3.9/contextlib.py", line 513, in __exit__
  raise exc_details[1]                                                              [0/837]
File "/usr/lib/python3.9/contextlib.py", line 498, in __exit__
  if cb(*exc_details):
File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
  self.gen.throw(type, value, traceback)
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 514, in froz

en_writes

  self._freeze_writes(reason)
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 540, in _fre

eze_writes

raise ElasticsearchClusterError("Encountered error while creating document to freeze clus

ter writes") from e
spicerack.elasticsearch_cluster.ElasticsearchClusterError: Encountered error while creating d
ocument to freeze cluster writes
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1
nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1
001 - T301955

ES restart cookbook error T301955ActivePublicActions

Event Timeline

ES restart cookbook error T301955
ActivePublic
Actions