Page MenuHomePhabricator

paws: 502 bad gateway
Closed, ResolvedPublic

Description

Since 16th December 2017 the PAWS system is misbehaving. First, @Andrew detected a 500 Internal Server Error, Redirect loop detected issue at https://paws.wmflabs.org/.
On Monday 18th December I detected 502 Bad gateway.

I started looking at the deployment and reading docs: https://wikitech.wikimedia.org/wiki/PAWS/Tools/Admin but a fix is not obvious.

I checked several hosts and nginx (both in paws and in tools project) and ended reading the logs of the hub-deployment pod.

2017-12-17 04:08:17,574 WARNING Connection pool is full, discarding connection: 10.96.0.1
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1
[I 2017-12-17 04:08:17.581 JupyterHub app:1228] noxski still running
2017-12-17 04:08:17,585 WARNING Connection pool is full, discarding connection: 10.96.0.1
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1
2017-12-17 04:08:17,586 WARNING Connection pool is full, discarding connection: 10.96.0.1
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1
2017-12-17 04:08:17,586 WARNING Connection pool is full, discarding connection: 10.96.0.1
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1
2017-12-17 04:08:17,605 WARNING Connection pool is full, discarding connection: 10.96.0.1
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1
2017-12-17 04:08:20,601 WARNING Connection pool is full, discarding connection: 10.96.0.1
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: 10.96.0.1

Also:

Traceback (most recent call last):
  File "/usr/local/bin/cull_idle_servers.py", line 88, in <module>
    loop.run_sync(cull)
  File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 458, in run_sync
    return future_cell[0].result()
  File "/usr/local/lib/python3.5/dist-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/local/lib/python3.5/dist-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/local/bin/cull_idle_servers.py", line 55, in cull_idle
    resp = yield client.fetch(req)
  File "/usr/local/lib/python3.5/dist-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/local/lib/python3.5/dist-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 316, in wrapped
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tornado/simple_httpclient.py", line 307, in _on_timeout
    raise HTTPError(599, error_message)
tornado.httpclient.HTTPError: HTTP 599: Timeout while connecting

Then I restarted the pod with kubectl get pod -o yaml hub-deployment-1381799904-b5g5j -n prod | kubectl replace --force -f -. After the restart, another issue appeared in the logs:

[E 2017-12-18 11:59:49.896 JupyterHub app:904] Failed to connect to db: sqlite:///jupyterhub.sqlite

In this case, it seems like something was not well configured in this container.

Event Timeline

So underlying symptoms: docker won't start, and accompanying that is the apiserver seems to have gone away, and so kubelet is confused and unable to do anything.

@yuvipanda was able to take a look and suspects that somehow we picked up a change that disabled overlayfs module.

https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/base/manifests/kernel.pp

The following gets docker to start on the master:

182  2017-12-19 21:24:49 puppet agent --disable 'chase'

remove ban manually

184  2017-12-19 21:25:02 vim /etc/modprobe.d/blacklist-wmf.conf
191  2017-12-19 21:26:20 modprobe overlayfs
192  2017-12-19 21:26:25 echo "overlay" > /etc/modules-load.d/overlay.conf
193  2017-12-19 21:26:56 /sbin/reboot
195  2017-12-19 21:29:15 lsmod | grep overlay
196  2017-12-19 21:29:20 service docker restart
197  2017-12-19 21:29:29 service kubelet restart
199  2017-12-19 21:29:38 docker ps

I don't understand why this isn't in teh same bad state:

root@tools-paws-worker-1001:~# lsmod | grep overlay
overlay                49152  17
root@tools-paws-worker-1001:~# puppet agent --test
Warning: Setting configtimeout is deprecated.
   (at /usr/lib/ruby/vendor_ruby/puppet/settings.rb:1146:in `issue_deprecation_warning')
Info: Using configured environment 'future'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tools-paws-worker-1001.tools.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1513719360'
Notice: Applied catalog in 6.24 seconds
root@tools-paws-worker-1001:~# cat /etc/issue
Debian GNU/Linux 9 \n \l

root@tools-paws-worker-1001:~# service docker status
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2017-10-04 00:23:10 UTC; 2 months 16 days ago

/etc/modprobe.d/blacklist-wmf.conf

...

install overlay /bin/true
blacklist overlayfs

On tools-paws-master-01 as yuvi this now works:

kubectl --namespace=prod get pod

When and how did this module get banned on this instance and why did it suddenly take effect?

I suspect a debugging reboot picked up the pending ban...but that speaks to issues preexisting that would now surface

Do we need overlay in other arenas like the main Toolforge cluser?

I think we don't but that's a bit scary atm

After Chase's manual changes got us back to being able to talk to the service via HTTPS, I poked around and found that the tiller pod was in CrashLoopBackoff.

$ kubectl --namespace=kube-system get pod -o wide | grep -v Running
NAME                                           READY     STATUS             RESTARTS   AGE       IP             NODE
tiller-deploy-968903632-kln7f                  0/1       CrashLoopBackOff   17         12d       10.244.0.71    tools-paws-master-01
$ kubectl --namespace=kube-system describe po/tiller-deploy-968903632-kln7f
...
Error: failed to create symbolic link "/var/log/pods/75cb4062-db74-11e7-930a-fa163e582fc4/tiller_0.log" to the container log file "/var/lib/docker/containers/tiller/tiller-json.log" for container "tiller": symlink /var/lib/docker/containers/tiller/tiller-json.log /var/log/pods/75cb4062-db74-11e7-930a-fa163e582fc4/tiller_0.log: file exists
...
$ kubectl --namespace=kube-system delete po/tiller-deploy-968903632-kln7f
pod "tiller-deploy-968903632-kln7f" deleted
$ kubectl --namespace=kube-system get pod -o wid
e|grep tiller
tiller-deploy-968903632-t7dz2                  1/1       Running   0          1m        10.244.5.103   tools-paws-worker-1006
$ kubectl --namespace=kube-system get pod -o wide|grep -v Running
NAME                                           READY     STATUS    RESTARTS   AGE       IP             NODE

A few minutes later I was able to start, stop, and restart my user's pod again via the web ui.

The website no longer shows the 502, but the 500:

500 : Internal Server Error

Redirect loop detected.

Should we followup in T175454 ?

I think it's intermittent... I got a 500 on my first connection but subsequent loads work correctly for me.

Thing were ok for a day or two, but then people began to report errors starting their pods again. I tried restarting the hub-deployment pod to see if this would fix things, but it seems to have only made everything worse.

Framawiki awarded a token.

I think this is fixed now, and PAWS is back up.

Hub pods were on CrashLoopBackOff, and I looked at the hub pod logs and found this:

[E 2017-12-28 18:51:08.521 JupyterHub app:1626]
    Traceback (most recent call last):
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/app.py", line 1624, in launch_instance_async
        yield self.start()
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/app.py", line 1572, in start
        yield self.proxy.check_routes(self.users, self._service_map)
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/proxy.py", line 293, in check_routes
        routes = yield self.get_all_routes()
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/proxy.py", line 588, in get_all_routes
        resp = yield self.api_request('', client=client)
    tornado.curl_httpclient.CurlError: HTTP 599: Connection timed out after 20000 milliseconds

This looked like the hub was trying to talk to the proxy and failing to do so. The hub and proxy pods don't have any debugging tools like dig and curl, and I found out from Yuvi that the user pods did.

On checking a user pod, I found that all networking was failing, curl to external things like google as well as internal IPs. This pointed to pod networking being broken.

This has happened once before due to https://github.com/kubernetes/kubernetes/issues/40182, and the symptom was the iptables FORWARD chain rule defaulted to DROP policy instead of accept in the paws node. On looking at the paws-master, I found that this was indeed the case.

Chain FORWARD (policy DROP)
target     prot opt source               destination
DOCKER-USER  all  --  anywhere             anywhere
DOCKER-ISOLATION  all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere

I ran sudo iptables -P FORWARD ACCEPT, which switched the FORWARD chain to policy ACCEPT. Which in turn seems to have fixed everything. Not sure why this rule mysteriously disappeared again.

The paws-workers seem to have the right rules applied to them for the most part - https://phabricator.wikimedia.org/P6505. 1017 is on SchedulingDisabled so that's okay, 1001 has no rules in the FORWARD chain which seems weird.

For now though, things seem to be happy.

madhuvishy lowered the priority of this task from High to Medium.Dec 28 2017, 8:00 PM
madhuvishy claimed this task.

With T184018, this should be resolved for now. We plan to upgrade to the newer k8s version to mitigate the iptables rule issue soon. Let's reopen/make a new task if this issue happens again.

Restricted Application added a comment. · View Herald TranscriptJul 18 2023, 4:57 PM

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!