Page MenuHomePhabricator

PawsJupyterHubDown alert flapping
Closed, ResolvedPublic

Description

[2022-10-04 21:04:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 21:09:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 21:29:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 21:34:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 21:35:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 21:40:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 22:13:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 22:33:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 22:37:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 22:47:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 22:55:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 23:00:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 23:15:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 23:20:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 23:25:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 23:30:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 23:32:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 23:37:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 23:46:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-04 23:56:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 00:01:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 00:06:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 00:21:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 00:26:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 00:43:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 00:48:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 00:59:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 01:19:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 01:35:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 01:40:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 02:21:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 02:26:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 02:32:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 02:37:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 04:34:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 04:39:50] <wmcs-alerts> (PawsJupyterHubDown) resolved: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown
[2022-10-05 04:54:50] <wmcs-alerts> (PawsJupyterHubDown) firing: PAWS JupyterHub is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown

Event Timeline

hub pod is failing its health check

Events:
  Type     Reason     Age                From     Message
  ----     ------     ----               ----     -------
  Warning  Unhealthy  16s (x5 over 66m)  kubelet  Readiness probe failed: Get "http://192.168.45.252:8081/hub/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

health check was due to startup fuzziness. 'Client.Timeout exceeded while awaiting headers' suggests an issue with the scraper getting the data that it wants. Maybe an issue between the PAWS prometheus instance and the PAWS cluster

I found this on the logs:

aborrero@paws-k8s-control-1:~$ sudo -i kubectl -n prod logs deployment/hub
[..]
[E 2022-10-05 09:57:03.153 JupyterHub reflector:350] Error when watching resources, retrying in 0.2s
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/kubespawner/reflector.py", line 286, in _watch_and_update
        resource_version = self._list_and_update()
      File "/usr/local/lib/python3.8/dist-packages/kubespawner/reflector.py", line 227, in _list_and_update
        initial_resources = getattr(self.api, self.list_method_name)(**kwargs)
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 14825, in list_namespaced_event
        return self.list_namespaced_event_with_http_info(namespace, **kwargs)  # noqa: E501
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 14936, in list_namespaced_event_with_http_info
        return self.api_client.call_api(
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 348, in call_api
        return self.__call_api(resource_path, method,
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 180, in __call_api
        response_data = self.request(
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 373, in request
        return self.rest_client.GET(url,
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 239, in GET
        return self.request("GET", url,
      File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 233, in request
        raise ApiException(http_resp=r)
    kubernetes.client.exceptions.ApiException: (500)
    Reason: Internal Server Error
    HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'bd21b1bc-3979-402f-864c-32c11eecc5f2', 'X-Kubernetes-Pf-Prioritylevel-Uid': '3b13b59b-d963-4c3e-b458-e88aca1d8028', 'Date': 'Wed, 05 Oct 2022 09:57:03 GMT', 'Content-Length': '122'})
    HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"etcdserver: request timed out","code":500}\n'

Found this on the logs:

aborrero@paws-k8s-control-1:~$ sudo -i kubectl -n prod logs deployment/proxy 
[..]
10:02:59.420 [ConfigProxy] error: 503 GET /hub/metrics read ECONNRESET
10:02:59.427 [ConfigProxy] error: Uncaught Exception: Cannot set headers after they are sent to the client
10:02:59.428 [ConfigProxy] error: Error [ERR_HTTP_HEADERS_SENT]: Cannot set headers after they are sent to the client
    at ServerResponse.setHeader (_http_outgoing.js:561:11)
    at /srv/configurable-http-proxy/lib/configproxy.js:475:34
    at Array.map (<anonymous>)
    at ClientRequest.<anonymous> (/srv/configurable-http-proxy/lib/configproxy.js:473:46)
    at Object.onceWrapper (events.js:482:26)
    at ClientRequest.emit (events.js:375:28)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:647:27)
    at HTTPParser.parserOnHeadersComplete (_http_common.js:126:17)
    at HTTPParser.execute (<anonymous>)
    at Socket.socketOnData (_http_client.js:515:22)
10:03:00.092 [ConfigProxy] error: 503 GET /user/Arturoborrero/static/lab/main.cbec7eacf35e133ecd7f.js.map read ECONNRESET
10:03:00.101 [ConfigProxy] error: Uncaught Exception: Cannot set headers after they are sent to the client
10:03:00.101 [ConfigProxy] error: Error [ERR_HTTP_HEADERS_SENT]: Cannot set headers after they are sent to the client
    at ServerResponse.setHeader (_http_outgoing.js:561:11)
    at /srv/configurable-http-proxy/lib/configproxy.js:475:34
    at Array.map (<anonymous>)
    at ClientRequest.<anonymous> (/srv/configurable-http-proxy/lib/configproxy.js:473:46)
    at Object.onceWrapper (events.js:482:26)
    at ClientRequest.emit (events.js:375:28)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:647:27)
    at HTTPParser.parserOnHeadersComplete (_http_common.js:126:17)
    at HTTPParser.execute (<anonymous>)

There are 1000s of similar entries.

Mentioned in SAL (#wikimedia-cloud) [2022-10-05T10:18:21Z] <arturo> aborrero@paws-k8s-control-1:~$ sudo -i kubectl -n prod rollout restart deployment/proxy (T319366)

The restart seems to have solved the problem.

aborrero@paws-k8s-control-1:~$ sudo -i kubectl -n prod logs deployment/proxy
[...]
10:18:23.044 [ConfigProxy] info: 201 POST /api/routes/user/Ainali 
10:18:23.045 [ConfigProxy] info: Adding route /user/Arturoborrero -> http://192.168.45.229:8888
10:18:23.045 [ConfigProxy] info: Route added /user/Arturoborrero -> http://192.168.45.229:8888
10:18:23.045 [ConfigProxy] info: 201 POST /api/routes/user/Arturoborrero 

errors are gone.

aborrero triaged this task as Medium priority.Oct 5 2022, 10:27 AM

This was likely an issue with nfs servers being removed by puppet, but the nfs mounts remained. The flapping was probably due to one of the HA proxy nodes being wedged as a result of a stale NFS mount, and the other working alright. In this state prometheus could get its data about 1/2 the time. The services weren't down, but were only accessible about 1/2 the time.

Much of this, though not all of it, would be solved as a side effect of T308873, the main limitation to T308873 being that we mount nfs straight into the nodes and use a hostmap from there. We don't have to do that, and had we not done so, the control and ingress nodes would not have had the above issue. HAproxy would, however, not be repaired through this. It may be wise to investigate the removal of the HAproxy nodes, in favor of connections going directly to the k8s cluster.