Page MenuHomePhabricator

User reports getting 500's error on on https://k8s-status.toolforge.org
Closed, ResolvedPublic

Description

Taavi Väänänen reports getting 500's on https://k8s-status.toolforge.org/

Event Timeline

Phamhi triaged this task as Medium priority.Feb 5 2021, 7:34 PM
Phamhi updated the task description. (Show Details)
    collection_formats=collection_formats)
  File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 335, in call_api
    _preload_content, _request_timeout)
  File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 166, in __call_api
    _request_timeout=_request_timeout)
  File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 356, in request
    headers=headers)
  File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 241, in GET
    query_params=query_params)
  File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
  File "./k8s/cache.py", line 60, in wrapper
    r = f(*args, **kwargs)
  File "./k8s/client.py", line 242, in get_pod
    "pod": v1.read_namespaced_pod(name=pod, namespace=namespace),
  File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 19078, in read_namespaced_pod
    (data) = self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 19169, in read_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 335, in call_api
    _preload_content, _request_timeout)
  File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 166, in __call_api
    _request_timeout=_request_timeout)
  File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 356, in request
    headers=headers)
  File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 241, in GET
    query_params=query_params)
  File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Sun, 07 Feb 2021 20:24:45 GMT', 'Content-Length': '256'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"citationhunt-update-zh-hans-1609905600-kpdn8\" not found","reason":"NotFound","details":{"name":"citationhunt-update-zh-hans-1609905600-kpdn8","kind":"pods"},"code":404}

It hit a pod that died and all workers committed suicide, it seems. Restarting the service and pinging @bd808 that this could use a liveness probe to restart itself or resilience to 404 pods, maybe?

That partial stack trace would suggest to me that the tool caches a pod name somewhere and tries to use the cached tool name somewhere else without checking if the pod still exists.

Just a side note while trying to look into this, https://k8s-status.toolforge.org/namespaces/tool-vpsalertmanager/ does seem to constantly give a 500, other paths seem to eventually work.

K8s-status and openstack-browser have almost no internal error handling and recovery code. This leads to asking "why?", and my answer that tools were both written quickly and not given any long term resources for hardening the code outside of the "happy path". It feels like both tools have shown their utility to the Toolforge and Cloud VPS communities, so it also seems reasonable that they could be revisited by folks who have time (paid or volunteer) to fix up some of the rough edges.

As an admin in Toolforge and Cloud VPS I always hated hearing from a maintainer that they had no time for fixing a thing they had built because they had moved on to other work. I am now personally declaring that for k8s-status and openstack-browser however. I love these tools and want them to work, but I also do not have time in the foreseeable future do make more than small drive-by fixes for either.

taavi claimed this task.