Page MenuHomePhabricator

Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail
Open, NormalPublic

Description

[18:45]  <icinga-wm>	PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 499 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

https://wikitech.wikimedia.org/wiki/Incident_documentation/20190910-toolforge-kubernetes


Original report from @JeanFred

Trying to restart either of my tools (via webservice restart) results in ConnectionError

Example with rTHER (a PHP app):

tools.heritage@tools-sgebastion-07:~$ webservice restart
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 180, in <module>
    if job.get_state() != Backend.STATE_RUNNING:
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 474, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 272, in _find_obj
    o for o in objs
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 125, in __iter__
    return iter(self.query_cache["objects"])
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 115, in query_cache
    cache["response"] = self.execute().json()
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 99, in execute
    r = self.api.get(**kwargs)
  File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 125, in get
    return self.session.get(*args, **self.get_kwargs(**kwargs))
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 501, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='k8s-master.tools.wmflabs.org', port=6443): Max retries exceeded with url: /api/v1/namespaces/heritage/pods?labelSelector=tools.wmflabs.org%2Fwebservice-version%3D1%2Cname%3Dheritage%2Ctools.wmflabs.org%2Fwebservice%3Dtrue (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f1adccc5c50>: Failed to establish a new connection: [Errno 111] Connection refused',))

and with R1969 (a Python 2.7 app):

tools.wikiloves@tools-sgebastion-07:~$ webservice restart
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 180, in <module>
    if job.get_state() != Backend.STATE_RUNNING:
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 474, in get_state
    pod = self._find_obj(pykube.Pod, self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 272, in _find_obj
    o for o in objs
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 125, in __iter__
    return iter(self.query_cache["objects"])
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 115, in query_cache
    cache["response"] = self.execute().json()
  File "/usr/lib/python2.7/dist-packages/pykube/query.py", line 99, in execute
    r = self.api.get(**kwargs)
  File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 125, in get
    return self.session.get(*args, **self.get_kwargs(**kwargs))
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 501, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='k8s-master.tools.wmflabs.org', port=6443): Max retries exceeded with url: /api/v1/namespaces/wikiloves/pods?labelSelector=tools.wmflabs.org%2Fwebservice-version%3D1%2Cname%3Dwikiloves%2Ctools.wmflabs.org%2Fwebservice%3Dtrue (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fef82277c50>: Failed to establish a new connection: [Errno 111] Connection refused',))

Related Objects

StatusAssignedTask
Openbd808
ResolvedAndrew
OpenNone
OpenNone
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
OpenNone
Resolvedaborrero
OpenJprorama
OpenNone
OpenNone
Openaborrero
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
OpenBstorm
Openaborrero
OpenNone
OpenNone
Resolvedaborrero
OpenBstorm
OpenNone
OpenVgutierrez
ResolvedNone
OpenAndrew
Resolvedaborrero
ResolvedBstorm
OpenBstorm
OpenBstorm
OpenBstorm
ResolvedBstorm
Resolved yuvipanda
DuplicateNone
OpenBstorm
OpenBstorm
OpenBstorm
DuplicateNone
OpenNone
Openaborrero
OpenNone
OpenBstorm
OpenNone

Event Timeline

JeanFred created this task.Sep 10 2019, 8:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 10 2019, 8:53 PM
bd808 claimed this task.Sep 10 2019, 9:22 PM
bd808 triaged this task as High priority.
bd808 added subscribers: Phamhi, Bstorm, bd808.

We have been working on this since around 2019-09-10T18:54. The first signal was an email and irc alert from our monitoring system for at failure of our "k8s/nodes/ready" check.

Caught exception: HTTPSConnectionPool(host='k8s-master.tools.wmflabs.org', port=6443): Max retries exceeded with url: /api/v1/nodes (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',))

Within a few minutes we were able to determine that at least part of the problem was the etcd cluster which tracks state for the Toolforge Kubernetes cluster. @Bstorm and @Phamhi are currently attempting to diagnose and correct the problems with etcd.

bd808 raised the priority of this task from High to Unbreak Now!.Sep 10 2019, 9:22 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptSep 10 2019, 9:22 PM
bd808 renamed this task from webservice restart fails with ConnectionError to Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail.Sep 10 2019, 9:23 PM
bd808 updated the task description. (Show Details)Sep 10 2019, 9:28 PM
bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
[00:28]  <  bstorm_>	!log tools broke etcd trying to fix it and then restored it

The etcd cluster is back to a functional state. Now investigation moves back to kube-apiserver and its connection to etcd.

bd808 lowered the priority of this task from Unbreak Now! to Normal.Sep 11 2019, 2:00 AM
[01:30]  <icinga-wm>	RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 4.115 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

kube-apiserver is working as expected again. The TL;DR is that some change, likely part of T171188: Move the main WMCS puppetmaster into the Labs realm, tricked Puppet into installing an old version of the x509 signing cert used to secure communication between the etcd cluster and kube-apiserver. It is currently unclear if this misconfiguration also caused the etcd cluster failure, or if that was an unrelated and unfortunate coincidence.

I'm downgrading this from UBN! to Normal with the remaining work being documenting the outage and creating follow up tasks (probably mostly documentation tasks).

revi added a subscriber: revi.Sep 11 2019, 9:56 AM
18:46:22 <revi> job 6568688 (tools.stewardbots) is not being deleted for few minutes, looks like something abnormal? (in the past it has been deleted within minutes)

(the Q I left on #wikimedia-cloud, crossposting here, assuming it's related)

18:46:22 <revi> job 6568688 (tools.stewardbots) is not being deleted for few minutes, looks like something abnormal? (in the past it has been deleted within minutes)

(the Q I left on #wikimedia-cloud, crossposting here, assuming it's related)

The SGE exec node hosting your job was locked up. I've restarted the node and verified that your job was deleted. Thanks for reporting the issue.

bd808 updated the task description. (Show Details)Sep 12 2019, 8:36 PM
bd808 updated the task description. (Show Details)Sep 12 2019, 8:40 PM