Page MenuHomePhabricator
Paste P8339

(An Untitled Masterwork)
ActivePublic

Authored by Gehel on Apr 3 2019, 3:04 PM.
Tags
None
Referenced Files
F28570781: raw.txt
Apr 3 2019, 3:04 PM
Subscribers
None
gehel@cumin2001:~$ sudo -i cookbook sre.elasticsearch.rolling-reboot search_codfw "reboot for JVM + kernel upgrade" --start-datetime 2019-04-03T12:00:00 --nodes-per-run 3
START - Cookbook sre.elasticsearch.rolling-reboot
waiting for clusters to be green
Fetch 3 node(s) from search_codfw to perform rolling restart on
Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: elastic[2030,2041,2043].codfw.wmnet
Disabling Puppet with reason "reboot for JVM + kernel upgrade - gehel@cumin2001" on 3 hosts: elastic[2030,2041,2043].codfw.wmnet
Freezing writes on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7f1836f31898>, <spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7f1836f31710>, <spicerack.elasticsearch_clus
ter.ElasticsearchCluster object at 0x7f1836f31780>]
Freezing all indices in <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9243, 'use_ssl': True}])>
Freezing all indices in <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9443, 'use_ssl': True}])>
Freezing all indices in <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9643, 'use_ssl': True}])>
Wait for a minimum time of 60sec to make sure all CirrusSearch writes are terminated
Stopping elasticsearch replication in a safe way on search_codfw
stopping replication on [<spicerack.elasticsearch_cluster.ElasticsearchCluster object at 0x7f1836f31898>, <spicerack.elasticsearch_cluster.ElasticsearchClusterobject at 0x7f1836f31710>, <spicerack.elasticsearch_
cluster.ElasticsearchCluster object at 0x7f1836f31780>]
stop replication - <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9243, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
stop replication - <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9443, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
stop replication - <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9643, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'primaries'}}
flush markers on <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9243, 'use_ssl': True}])>
Not all shards were synced flushed on <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9243, 'use_ssl': True}])>.
flush markers on <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9443, 'use_ssl': True}])>
flush markers on <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9643, 'use_ssl': True}])>
Depooling elastic[2030,2041,2043].codfw.wmnet
Rebooting 3 hosts in batches of 3 with 0.0s of sleep in between: elastic[2030,2041,2043].codfw.wmnet
start replication - <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9643, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
start replication - <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9443, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
start replication - <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9243, 'use_ssl': True}])>
Updating cluster settings: {'transient': {'cluster.routing.allocation.enable': 'all'}}
Unfreezing all indices in <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9643, 'use_ssl': True}])>
Unfreezing all indices in <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9443, 'use_ssl': True}])>
Unfreezing all indices in <Elasticsearch([{'host': 'search.svc.codfw.wmnet', 'port': 9243, 'use_ssl': True}])>
Enabling Puppet with reason "reboot for JVM + kernel upgrade - gehel@cumin2001"on 3 hosts: elastic[2030,2041,2043].codfw.wmnet
Exception raised while executing cookbook sre.elasticsearch.rolling-reboot:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 391, in_make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 387, in_make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.5/http/client.py", line 1198, in getresponse
response.begin()
File "/usr/lib/python3.5/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.5/http/client.py", line 258, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.5/socket.py", line 576, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.5/ssl.py", line 937, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.5/ssl.py", line 799, in read
return self._sslobj.read(len, buffer)
File "/usr/lib/python3.5/ssl.py", line 583, in read
v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 114, in perform_request
response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 643, inurlopen
_stacktrace=sys.exc_info()[2])
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 311, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise
raise value
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 594, inurlopen
chunked=chunked)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 393, in_make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 313, in_raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='search.svc.codfw.wmnet', port=9643): Read timed out. (read timeout=10)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/curator/actions.py", line 384, in do_action
self.client.cluster.put_settings(body=self.body)
File "/usr/lib/python3/dist-packages/elasticsearch/client/utils.py", line 73,in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/lib/python3/dist-packages/elasticsearch/client/cluster.py", line 155, in put_settings
params=params, body=body)
File "/usr/lib/python3/dist-packages/elasticsearch/transport.py", line 312, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 122, in perform_request
raise ConnectionTimeout('TIMEOUT', str(e), e)
elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='search.svc.codfw.wmnet', port=9643): Read timed out. (read timeout=10))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 414, in _run
ret = self.module.run(args, self.spicerack)
File "/srv/deployment/spicerack/cookbooks/sre/elasticsearch/rolling-reboot.py", line 31, in run
reboot
File "/srv/deployment/spicerack/cookbooks/sre/elasticsearch/__init__.py", line 99, in execute_on_clusters
nodes.pool_nodes()
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 212, in stopped_replication
yield [stack.enter_context(cluster.stopped_replication()) for cluster in self._clusters]
File "/usr/lib/python3.5/contextlib.py", line 360, in __exit__
raise exc_details[1]
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 362, in stopped_replication
yield
File "/usr/lib/python3.5/contextlib.py", line 345, in __exit__
if cb(*exc_details):
File "/usr/lib/python3.5/contextlib.py", line 261, in _exit_wrapper
return cm_exit(cm, *exc_details)
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 364, in stopped_replication
self._start_replication()
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 379, in _start_replication
value='all', wait_for_completion=False)
File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 391, in _do_cluster_routing
cluster_routing.do_action()
File "/usr/lib/python3/dist-packages/curator/actions.py", line 394, in do_action
report_failure(e)
File "/usr/lib/python3/dist-packages/curator/utils.py", line 173, in report_failure
'Exception: {0}'.format(exception)
curator.exceptions.FailedExecution: Exception encountered. Rerun with loglevelDEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectio
nPool(host='search.svc.codfw.wmnet', port=9643): Read timed out. (read timeout=10))
END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99)

Event Timeline