(An Untitled Masterwork)
ActivePublic
Actions

Authored by Gehel on Apr 3 2019, 3:04 PM.

Tags

None

Referenced Files

	F28570781: raw.txt
	Apr 3 2019, 3:04 PM

Subscribers

None

	gehel@cumin2001:~$ sudo -i cookbook sre.elasticsearch.rolling-reboot search_codfw "reboot for JVM + kernel upgrade" --start-datetime 2019-04-03T12:00:00 --nodes-per-run 3
	START - Cookbook sre.elasticsearch.rolling-reboot
	waiting for clusters to be green
	Fetch 3 node(s) from search_codfw to perform rolling restart on
	Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: elastic[2030,2041,2043].codfw.wmnet
	Disabling Puppet with reason "reboot for JVM + kernel upgrade - gehel@cumin2001" on 3 hosts: elastic[2030,2041,2043].codfw.wmnet
	Freezing writes on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7f1836f31898>, <spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7f1836f31710>, <spicerack.elasticsearch_clus
	ter.ElasticsearchCluster object at 0x7f1836f31780>]
	Freezing all indices in <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9243, 'use_ssl': True}])>
	Freezing all indices in <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9443, 'use_ssl': True}])>
	Freezing all indices in <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9643, 'use_ssl': True}])>
	Wait for a minimum time of 60sec to make sure all CirrusSearch writes are terminated
	Stopping elasticsearch replication in a safe way on search_codfw
	stopping replication on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7f1836f31898>, <spicerack.elasticsearch_cluster.ElasticsearchClusterobject at 0x7f1836f31710>, <spicerack.elasticsearch_
	cluster.ElasticsearchCluster object at 0x7f1836f31780>]
	stop replication - <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9243, 'use_ssl': True}])>
	Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
	stop replication - <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9443, 'use_ssl': True}])>
	Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
	stop replication - <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9643, 'use_ssl': True}])>
	Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
	flush markers on <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9243, 'use_ssl': True}])>
	Not all shards were synced flushed on <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9243, 'use_ssl': True}])>.
	flush markers on <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9443, 'use_ssl': True}])>
	flush markers on <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9643, 'use_ssl': True}])>
	Depooling elastic[2030,2041,2043].codfw.wmnet
	Rebooting 3 hosts in batches of 3 with 0.0s of sleep in between: elastic[2030,2041,2043].codfw.wmnet
	start replication - <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9643, 'use_ssl': True}])>
	Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
	start replication - <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9443, 'use_ssl': True}])>
	Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
	start replication - <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9243, 'use_ssl': True}])>
	Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
	Unfreezing all indices in <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9643, 'use_ssl': True}])>
	Unfreezing all indices in <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9443, 'use_ssl': True}])>
	Unfreezing all indices in <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9243, 'use_ssl': True}])>
	Enabling Puppet with reason "reboot for JVM + kernel upgrade - gehel@cumin2001"on 3 hosts: elastic[2030,2041,2043].codfw.wmnet
	Exception raised while executing cookbook sre.elasticsearch.rolling-reboot:
	Traceback (most recent call last):
	File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 391, in_make_request
	six.raise_from(e, None)
	File "<string>", line 2, in raise_from
	File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 387, in_make_request
	httplib_response = conn.getresponse()
	File "/usr/lib/python3.5/http/client.py", line 1198, in getresponse
	response.begin()
	File "/usr/lib/python3.5/http/client.py", line 297, in begin
	version, status, reason = self._read_status()
	File "/usr/lib/python3.5/http/client.py", line 258, in _read_status
	line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
	File "/usr/lib/python3.5/socket.py", line 576, in readinto
	return self._sock.recv_into(b)
	File "/usr/lib/python3.5/ssl.py", line 937, in recv_into
	return self.read(nbytes, buffer)
	File "/usr/lib/python3.5/ssl.py", line 799, in read
	return self._sslobj.read(len, buffer)
	File "/usr/lib/python3.5/ssl.py", line 583, in read
	v = self._sslobj.read(len, buffer)
	socket.timeout: The read operation timed out

	During handling of the above exception, another exception occurred:

	Traceback (most recent call last):
	File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 114, in perform_request
	response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
	File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 643, inurlopen
	_stacktrace=sys.exc_info()[2])
	File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 311, in increment
	raise six.reraise(type(error), error, _stacktrace)
	File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise
	raise value
	File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 594, inurlopen
	chunked=chunked)
	File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 393, in_make_request
	self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
	File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 313, in_raise_timeout
	raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
	urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='search.svc.codfw.wmnet', port=9643): Read timed out. (read timeout=10)

	During handling of the above exception, another exception occurred:

	Traceback (most recent call last):
	File "/usr/lib/python3/dist-packages/curator/actions.py", line 384, in do_action
	self.client.cluster.put_settings(body=self.body)
	File "/usr/lib/python3/dist-packages/elasticsearch/client/utils.py", line 73,in _wrapped
	return func(args, params=params, *kwargs)
	File "/usr/lib/python3/dist-packages/elasticsearch/client/cluster.py", line 155, in put_settings
	params=params, body=body)
	File "/usr/lib/python3/dist-packages/elasticsearch/transport.py", line 312, in perform_request
	status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
	File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 122, in perform_request
	raise ConnectionTimeout('TIMEOUT', str(e), e)
	elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='search.svc.codfw.wmnet', port=9643): Read timed out. (read timeout=10))

	During handling of the above exception, another exception occurred:

	Traceback (most recent call last):
	File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 414, in _run
	ret = self.module.run(args, self.spicerack)
	File "/srv/deployment/spicerack/cookbooks/sre/elasticsearch/rolling-reboot.py", line 31, in run
	reboot
	File "/srv/deployment/spicerack/cookbooks/sre/elasticsearch/__init__.py", line 99, in execute_on_clusters
	nodes.pool_nodes()
	File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
	self.gen.throw(type, value, traceback)
	File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 212, in stopped_replication
	yield [stack.enter_context(cluster.stopped_replication()) for cluster in self._clusters]
	File "/usr/lib/python3.5/contextlib.py", line 360, in __exit__
	raise exc_details[1]
	File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
	self.gen.throw(type, value, traceback)
	File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 362, in stopped_replication
	yield
	File "/usr/lib/python3.5/contextlib.py", line 345, in __exit__
	if cb(*exc_details):
	File "/usr/lib/python3.5/contextlib.py", line 261, in _exit_wrapper
	return cm_exit(cm, *exc_details)
	File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
	self.gen.throw(type, value, traceback)
	File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 364, in stopped_replication
	self._start_replication()
	File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 379, in _start_replication
	value='all', wait_for_completion=False)
	File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 391, in _do_cluster_routing
	cluster_routing.do_action()
	File "/usr/lib/python3/dist-packages/curator/actions.py", line 394, in do_action
	report_failure(e)
	File "/usr/lib/python3/dist-packages/curator/utils.py", line 173, in report_failure
	'Exception: {0}'.format(exception)
	curator.exceptions.FailedExecution: Exception encountered. Rerun with loglevelDEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectio
	nPool(host='search.svc.codfw.wmnet', port=9643): Read timed out. (read timeout=10))
	END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99)

Event Timeline

Gehel created this paste.Apr 3 2019, 3:04 PM

(An Untitled Masterwork)ActivePublicActions

Event Timeline

(An Untitled Masterwork)
ActivePublic
Actions