Taavi Väänänen reports getting 500's on https://k8s-status.toolforge.org/
Description
Event Timeline
collection_formats=collection_formats) File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 335, in call_api _preload_content, _request_timeout) File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 166, in __call_api _request_timeout=_request_timeout) File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 356, in request headers=headers) File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 241, in GET query_params=query_params) File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in request raise ApiException(http_resp=r) kubernetes.client.rest.ApiException: (404) Reason: Not Found File "./k8s/cache.py", line 60, in wrapper r = f(*args, **kwargs) File "./k8s/client.py", line 242, in get_pod "pod": v1.read_namespaced_pod(name=pod, namespace=namespace), File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 19078, in read_namespaced_pod (data) = self.read_namespaced_pod_with_http_info(name, namespace, **kwargs) # noqa: E501 File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 19169, in read_namespaced_pod_with_http_info collection_formats=collection_formats) File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 335, in call_api _preload_content, _request_timeout) File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 166, in __call_api _request_timeout=_request_timeout) File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 356, in request headers=headers) File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 241, in GET query_params=query_params) File "/data/project/k8s-status/www/python/venv/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in request raise ApiException(http_resp=r) kubernetes.client.rest.ApiException: (404) Reason: Not Found HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Sun, 07 Feb 2021 20:24:45 GMT', 'Content-Length': '256'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"citationhunt-update-zh-hans-1609905600-kpdn8\" not found","reason":"NotFound","details":{"name":"citationhunt-update-zh-hans-1609905600-kpdn8","kind":"pods"},"code":404}
It hit a pod that died and all workers committed suicide, it seems. Restarting the service and pinging @bd808 that this could use a liveness probe to restart itself or resilience to 404 pods, maybe?
Mentioned in SAL (#wikimedia-cloud) [2021-02-08T15:24:00Z] <bstorm> restarted service T274002
That partial stack trace would suggest to me that the tool caches a pod name somewhere and tries to use the cached tool name somewhere else without checking if the pod still exists.
Just a side note while trying to look into this, https://k8s-status.toolforge.org/namespaces/tool-vpsalertmanager/ does seem to constantly give a 500, other paths seem to eventually work.
K8s-status and openstack-browser have almost no internal error handling and recovery code. This leads to asking "why?", and my answer that tools were both written quickly and not given any long term resources for hardening the code outside of the "happy path". It feels like both tools have shown their utility to the Toolforge and Cloud VPS communities, so it also seems reasonable that they could be revisited by folks who have time (paid or volunteer) to fix up some of the rough edges.
As an admin in Toolforge and Cloud VPS I always hated hearing from a maintainer that they had no time for fixing a thing they had built because they had moved on to other work. I am now personally declaring that for k8s-status and openstack-browser however. I love these tools and want them to work, but I also do not have time in the foreseeable future do make more than small drive-by fixes for either.