Page MenuHomePhabricator

Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail
Closed, ResolvedPublic

Description

[18:45]  <icinga-wm>	PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 499 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

https://wikitech.wikimedia.org/wiki/Incident_documentation/20190910-toolforge-kubernetes


Original report from @JeanFred

Trying to restart either of my tools (via webservice restart) results in ConnectionError

Example with rTHER (a PHP app):

tools.heritage@tools-sgebastion-07:~$ webservice restart
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 180, in <module>
    if job.get_state() != Backend.STATE_RUNNING:
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 474, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 272, in _find_obj
    o for o in objs
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 125, in __iter__
    return iter(self.query_cache["objects"])
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 115, in query_cache
    cache["response"] = self.execute().json()
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 99, in execute
    r = self.api.get(**kwargs)
  File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 125, in get
    return self.session.get(*args, **self.get_kwargs(**kwargs))
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 501, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='k8s-master.tools.wmflabs.org', port=6443): Max retries exceeded with url: /api/v1/namespaces/heritage/pods?labelSelector=tools.wmflabs.org%2Fwebservice-version%3D1%2Cname%3Dheritage%2Ctools.wmflabs.org%2Fwebservice%3Dtrue (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f1adccc5c50>: Failed to establish a new connection: [Errno 111] Connection refused',))

and with R1969 (a Python 2.7 app):

tools.wikiloves@tools-sgebastion-07:~$ webservice restart
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 180, in <module>
    if job.get_state() != Backend.STATE_RUNNING:
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 474, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 272, in _find_obj
    o for o in objs
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 125, in __iter__
    return iter(self.query_cache["objects"])
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 115, in query_cache
    cache["response"] = self.execute().json()
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 99, in execute
    r = self.api.get(**kwargs)
  File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 125, in get
    return self.session.get(*args, **self.get_kwargs(**kwargs))
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 501, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='k8s-master.tools.wmflabs.org', port=6443): Max retries exceeded with url: /api/v1/namespaces/wikiloves/pods?labelSelector=tools.wmflabs.org%2Fwebservice-version%3D1%2Cname%3Dwikiloves%2Ctools.wmflabs.org%2Fwebservice%3Dtrue (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fef82277c50>: Failed to establish a new connection: [Errno 111] Connection refused',))

Related Objects

StatusSubtypeAssignedTask
Resolvedbd808
ResolvedAndrew
Resolved Bstorm
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedJprorama
Resolvedaborrero
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
Resolveddduvall
Resolved Bstorm
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
DeclinedNone
Resolvedaborrero
OpenNone
Resolvedaborrero
StalledNone
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolvedyuvipanda
DuplicateNone
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
DuplicateNone
Resolved Bstorm
ResolvedSecurityaborrero
Resolvedaborrero
DuplicateNone
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
DuplicateNone
Resolvedaborrero
OpenNone
Resolved Bstorm
Resolvedbd808
Invalidaborrero
Resolvedbd808
Resolvedbd808
ResolvedSecurity Bstorm
Resolvedaborrero
Resolvedbd808
DuplicateNone
Resolved Bstorm
Resolvedbd808
Resolvedbd808
Resolved Phamhi
InvalidNone

Event Timeline

bd808 triaged this task as High priority.
bd808 added subscribers: Phamhi, Bstorm, bd808.

We have been working on this since around 2019-09-10T18:54. The first signal was an email and irc alert from our monitoring system for at failure of our "k8s/nodes/ready" check.

Caught exception: HTTPSConnectionPool(host='k8s-master.tools.wmflabs.org', port=6443): Max retries exceeded with url: /api/v1/nodes (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',))

Within a few minutes we were able to determine that at least part of the problem was the etcd cluster which tracks state for the Toolforge Kubernetes cluster. @Bstorm and @Phamhi are currently attempting to diagnose and correct the problems with etcd.

bd808 raised the priority of this task from High to Unbreak Now!.Sep 10 2019, 9:22 PM
bd808 renamed this task from webservice restart fails with ConnectionError to Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail.Sep 10 2019, 9:23 PM
[00:28]  <  bstorm_>	!log tools broke etcd trying to fix it and then restored it

The etcd cluster is back to a functional state. Now investigation moves back to kube-apiserver and its connection to etcd.

bd808 lowered the priority of this task from Unbreak Now! to Medium.Sep 11 2019, 2:00 AM
[01:30]  <icinga-wm>	RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 4.115 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

kube-apiserver is working as expected again. The TL;DR is that some change, likely part of T171188: Move the main WMCS puppetmaster into the Labs realm, tricked Puppet into installing an old version of the x509 signing cert used to secure communication between the etcd cluster and kube-apiserver. It is currently unclear if this misconfiguration also caused the etcd cluster failure, or if that was an unrelated and unfortunate coincidence.

I'm downgrading this from UBN! to Normal with the remaining work being documenting the outage and creating follow up tasks (probably mostly documentation tasks).

18:46:22 <revi> job 6568688 (tools.stewardbots) is not being deleted for few minutes, looks like something abnormal? (in the past it has been deleted within minutes)

(the Q I left on #wikimedia-cloud, crossposting here, assuming it's related)

18:46:22 <revi> job 6568688 (tools.stewardbots) is not being deleted for few minutes, looks like something abnormal? (in the past it has been deleted within minutes)

(the Q I left on #wikimedia-cloud, crossposting here, assuming it's related)

The SGE exec node hosting your job was locked up. I've restarted the node and verified that your job was deleted. Thanks for reporting the issue.