[2022-10-04 21:04:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 21:09:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 21:29:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 21:34:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 21:35:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 21:40:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 22:13:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 22:33:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 22:37:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 22:47:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 22:55:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 23:00:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 23:15:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 23:20:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 23:25:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 23:30:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 23:32:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 23:37:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 23:46:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-04 23:56:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 00:01:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 00:06:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 00:21:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 00:26:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 00:43:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 00:48:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 00:59:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 01:19:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 01:35:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 01:40:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 02:21:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 02:26:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 02:32:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 02:37:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 04:34:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 04:39:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [2022-10-05 04:54:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
Description
Related Objects
- Mentioned In
- T320222: Simplify network stack
T319350: nbserve high load - Mentioned Here
- T308873: Investigate moving PAWS to magnum
Event Timeline
hub pod is failing its health check
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 16s (x5 over 66m) kubelet Readiness probe failed: Get "http://192.168.45.252:8081/hub/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
health check was due to startup fuzziness. 'Client.Timeout exceeded while awaiting headers' suggests an issue with the scraper getting the data that it wants. Maybe an issue between the PAWS prometheus instance and the PAWS cluster
I found this on the logs:
aborrero@paws-k8s-control-1:~$ sudo -i kubectl -n prod logs deployment/hub [..] [E 2022-10-05 09:57:03.153 JupyterHub reflector:350] Error when watching resources, retrying in 0.2s Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/kubespawner/reflector.py", line 286, in _watch_and_update resource_version = self._list_and_update() File "/usr/local/lib/python3.8/dist-packages/kubespawner/reflector.py", line 227, in _list_and_update initial_resources = getattr(self.api, self.list_method_name)(**kwargs) File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 14825, in list_namespaced_event return self.list_namespaced_event_with_http_info(namespace, **kwargs) # noqa: E501 File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 14936, in list_namespaced_event_with_http_info return self.api_client.call_api( File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 373, in request return self.rest_client.GET(url, File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 239, in GET return self.request("GET", url, File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 233, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (500) Reason: Internal Server Error HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'bd21b1bc-3979-402f-864c-32c11eecc5f2', 'X-Kubernetes-Pf-Prioritylevel-Uid': '3b13b59b-d963-4c3e-b458-e88aca1d8028', 'Date': 'Wed, 05 Oct 2022 09:57:03 GMT', 'Content-Length': '122'}) HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"etcdserver: request timed out","code":500}\n'
Found this on the logs:
aborrero@paws-k8s-control-1:~$ sudo -i kubectl -n prod logs deployment/proxy [..] 10:02:59.420 [ConfigProxy] error: 503 GET /hub/metrics read ECONNRESET 10:02:59.427 [ConfigProxy] error: Uncaught Exception: Cannot set headers after they are sent to the client 10:02:59.428 [ConfigProxy] error: Error [ERR_HTTP_HEADERS_SENT]: Cannot set headers after they are sent to the client at ServerResponse.setHeader (_http_outgoing.js:561:11) at /srv/configurable-http-proxy/lib/configproxy.js:475:34 at Array.map (<anonymous>) at ClientRequest.<anonymous> (/srv/configurable-http-proxy/lib/configproxy.js:473:46) at Object.onceWrapper (events.js:482:26) at ClientRequest.emit (events.js:375:28) at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:647:27) at HTTPParser.parserOnHeadersComplete (_http_common.js:126:17) at HTTPParser.execute (<anonymous>) at Socket.socketOnData (_http_client.js:515:22) 10:03:00.092 [ConfigProxy] error: 503 GET /user/Arturoborrero/static/lab/main.cbec7eacf35e133ecd7f.js.map read ECONNRESET 10:03:00.101 [ConfigProxy] error: Uncaught Exception: Cannot set headers after they are sent to the client 10:03:00.101 [ConfigProxy] error: Error [ERR_HTTP_HEADERS_SENT]: Cannot set headers after they are sent to the client at ServerResponse.setHeader (_http_outgoing.js:561:11) at /srv/configurable-http-proxy/lib/configproxy.js:475:34 at Array.map (<anonymous>) at ClientRequest.<anonymous> (/srv/configurable-http-proxy/lib/configproxy.js:473:46) at Object.onceWrapper (events.js:482:26) at ClientRequest.emit (events.js:375:28) at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:647:27) at HTTPParser.parserOnHeadersComplete (_http_common.js:126:17) at HTTPParser.execute (<anonymous>)
There are 1000s of similar entries.
Mentioned in SAL (#wikimedia-cloud) [2022-10-05T10:18:21Z] <arturo> aborrero@paws-k8s-control-1:~$ sudo -i kubectl -n prod rollout restart deployment/proxy (T319366)
The restart seems to have solved the problem.
aborrero@paws-k8s-control-1:~$ sudo -i kubectl -n prod logs deployment/proxy [...] 10:18:23.044 [ConfigProxy] info: 201 POST /api/routes/user/Ainali 10:18:23.045 [ConfigProxy] info: Adding route /user/Arturoborrero -> http://192.168.45.229:8888 10:18:23.045 [ConfigProxy] info: Route added /user/Arturoborrero -> http://192.168.45.229:8888 10:18:23.045 [ConfigProxy] info: 201 POST /api/routes/user/Arturoborrero
errors are gone.
This was likely an issue with nfs servers being removed by puppet, but the nfs mounts remained. The flapping was probably due to one of the HA proxy nodes being wedged as a result of a stale NFS mount, and the other working alright. In this state prometheus could get its data about 1/2 the time. The services weren't down, but were only accessible about 1/2 the time.
Much of this, though not all of it, would be solved as a side effect of T308873, the main limitation to T308873 being that we mount nfs straight into the nodes and use a hostmap from there. We don't have to do that, and had we not done so, the control and ingress nodes would not have had the above issue. HAproxy would, however, not be repaired through this. It may be wise to investigate the removal of the HAproxy nodes, in favor of connections going directly to the k8s cluster.