Since 16th December 2017 the PAWS system is misbehaving. First, @Andrew detected a 500 Internal Server Error, Redirect loop detected issue at https://paws.wmflabs.org/.
On Monday 18th December I detected 502 Bad gateway.
I started looking at the deployment and reading docs: https://wikitech.wikimedia.org/wiki/PAWS/Tools/Admin but a fix is not obvious.
I checked several hosts and nginx (both in paws and in tools project) and ended reading the logs of the hub-deployment pod.
2017-12-17 04:08:17,574 WARNING Connection pool is full, discarding connection: 10.96.0.1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1 [I 2017-12-17 04:08:17.581 JupyterHub app:1228] noxski still running 2017-12-17 04:08:17,585 WARNING Connection pool is full, discarding connection: 10.96.0.1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1 2017-12-17 04:08:17,586 WARNING Connection pool is full, discarding connection: 10.96.0.1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1 2017-12-17 04:08:17,586 WARNING Connection pool is full, discarding connection: 10.96.0.1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1 2017-12-17 04:08:17,605 WARNING Connection pool is full, discarding connection: 10.96.0.1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1 2017-12-17 04:08:20,601 WARNING Connection pool is full, discarding connection: 10.96.0.1 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1
Also:
Traceback (most recent call last):
File "/usr/local/bin/cull_idle_servers.py", line 88, in <module>
loop.run_sync(cull)
File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 458, in run_sync
return future_cell[0].result()
File "/usr/local/lib/python3.5/dist-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/local/lib/python3.5/dist-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/local/bin/cull_idle_servers.py", line 55, in cull_idle
resp = yield client.fetch(req)
File "/usr/local/lib/python3.5/dist-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/local/lib/python3.5/dist-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 316, in wrapped
ret = fn(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tornado/simple_httpclient.py", line 307, in _on_timeout
raise HTTPError(599, error_message)
tornado.httpclient.HTTPError: HTTP 599: Timeout while connectingThen I restarted the pod with kubectl get pod -o yaml hub-deployment-1381799904-b5g5j -n prod | kubectl replace --force -f -. After the restart, another issue appeared in the logs:
[E 2017-12-18 11:59:49.896 JupyterHub app:904] Failed to connect to db: sqlite:///jupyterhub.sqlite
In this case, it seems like something was not well configured in this container.