Since 16th December 2017 the PAWS system is misbehaving. First, @Andrew detected a 500 Internal Server Error, Redirect loop detected issue at https://paws.wmflabs.org/.
On Monday 18th December I detected 502 Bad gateway.
I started looking at the deployment and reading docs: https://wikitech.wikimedia.org/wiki/PAWS/Tools/Admin but a fix is not obvious.
I checked several hosts and nginx (both in paws and in tools project) and ended reading the logs of the hub-deployment pod.
2017-12-17 04:08:17,574 WARNING Connection pool is full, discarding connection: 10.96.0.1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1 [I 2017-12-17 04:08:17.581 JupyterHub app:1228] noxski still running 2017-12-17 04:08:17,585 WARNING Connection pool is full, discarding connection: 10.96.0.1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1 2017-12-17 04:08:17,586 WARNING Connection pool is full, discarding connection: 10.96.0.1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1 2017-12-17 04:08:17,586 WARNING Connection pool is full, discarding connection: 10.96.0.1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1 2017-12-17 04:08:17,605 WARNING Connection pool is full, discarding connection: 10.96.0.1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1 2017-12-17 04:08:20,601 WARNING Connection pool is full, discarding connection: 10.96.0.1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1
Also:
Traceback (most recent call last): File "/usr/local/bin/cull_idle_servers.py", line 88, in <module> loop.run_sync(cull) File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 458, in run_sync return future_cell[0].result() File "/usr/local/lib/python3.5/dist-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/usr/local/lib/python3.5/dist-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/usr/local/bin/cull_idle_servers.py", line 55, in cull_idle resp = yield client.fetch(req) File "/usr/local/lib/python3.5/dist-packages/tornado/gen.py", line 1055, in run value = future.result() File "/usr/local/lib/python3.5/dist-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 316, in wrapped ret = fn(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tornado/simple_httpclient.py", line 307, in _on_timeout raise HTTPError(599, error_message) tornado.httpclient.HTTPError: HTTP 599: Timeout while connecting
Then I restarted the pod with kubectl get pod -o yaml hub-deployment-1381799904-b5g5j -n prod | kubectl replace --force -f -. After the restart, another issue appeared in the logs:
[E 2017-12-18 11:59:49.896 JupyterHub app:904] Failed to connect to db: sqlite:///jupyterhub.sqlite
In this case, it seems like something was not well configured in this container.