Page MenuHomePhabricator

wikiloves goes in 504 since webservice restart
Closed, ResolvedPublic

Description

I made a restart of the webservice (webservice restart ; since then also did a stop / start).

The tool now returns on any URL (http://tools.wmflabs.org/wikiloves/)

504 Gateway Time-out
nginx/1.13.6

The UWSGI logs do not report anything:

*** Starting uWSGI 2.0.7-debian (64bit) on [Wed Mar 27 12:39:42 2019] ***
compiled with version: 4.9.2 on 17 March 2018 15:40:38
os: Linux-4.9.0-0.bpo.6-amd64 #1 SMP Debian 4.9.88-1+deb9u1~bpo8+1 (2018-05-13)
nodename: wikiloves-1326522376-rywwi
machine: x86_64
clock source: unix
pcre jit disabled
detected number of CPU cores: 4
current working directory: /data/project/wikiloves
detected binary path: /usr/bin/uwsgi-core
your memory page size is 4096 bytes
detected max file descriptor number: 65536
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address :8000 fd 3
Python version: 2.7.9 (default, Sep 25 2018, 20:46:16)  [GCC 4.9.2]
Set PythonHome to /data/project/wikiloves/www/python/venv
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x21363b0
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 363840 bytes (355 KB) for 4 cores
*** Operational MODE: preforking ***
mounting /data/project/wikiloves/www/python/src/app.py on /wikiloves

WSGI app 0 (mountpoint='/wikiloves') ready in 16 seconds on interpreter 0x21363b0 pid: 1 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 1)
spawned uWSGI worker 1 (pid: 9, cores: 1)
spawned uWSGI worker 2 (pid: 10, cores: 1)
spawned uWSGI worker 3 (pid: 11, cores: 1)
spawned uWSGI worker 4 (pid: 12, cores: 1)

Related Objects

Event Timeline

Is there any chance I can have a peak at the nginx logs ?

JeanFred renamed this task from wikiloves goes in 504 to wikiloves goes in 504 since webservice restart.Mar 27 2019, 1:08 PM
JeanFred claimed this task.
JeanFred triaged this task as High priority.

I connected into the running container to see if it was working properly:

$ sudo become wikiloves
$ kubectl get po
NAME                         READY     STATUS    RESTARTS   AGE
wikiloves-1326522376-0w0dn   1/1       Running   0          1h
$ kubectl exec -it wikiloves-1326522376-0w0dn /bin/bash
$ curl localhost:8000
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
...
       </div>
    </body>
</html>

The kubernetes pod itself and the python2 uwsgi container inside it seem to be working as expected.

Mentioned in SAL (#wikimedia-cloud) [2019-03-27T15:53:23Z] <bd808> Restarted webservice to see if that fixes connectivity with tools.wmflabs.org proxy (T219377)

The kube2proxy service, which adds these things to the proxy had crashed. Thank you for finding this!

I've started it back up, and that seems to be good enough for now.

Apparently, the kubernetes API server was down at the time:
Mar 27 11:24:48 tools-k8s-master-01 systemd[1]: Stopping Kubernetes API Server...
and then everything goes to chaos on the k8s end
Then Mar 27 11:24:53 tools-k8s-master-01 systemd[1]: Started Kubernetes API Server.

The fact that the server was down for just 5 seconds suggests I can adjust the systemd params on the kube2proxy service so it waits longer to restart itself. That should make it more resilient to an API server restart. I don't know just now why the API server restarted today, but since that's all it really was, kube2proxy should be capable of weathering such an event without going hard down.

Change 499538 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] kube2proxy: Set a 10 sec wait between service restarts on failure

https://gerrit.wikimedia.org/r/499538

Change 499538 merged by Bstorm:
[operations/puppet@production] kube2proxy: Set a 10 sec wait between service restarts on failure

https://gerrit.wikimedia.org/r/499538

Puppet couldn't start the service either because it was broken for our proxies: https://gerrit.wikimedia.org/r/c/operations/puppet/+/499547

Actually, that last patch seems to have left something still mixed up, and puppet needs to be working to keep this sane.

Change 499569 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dynamicproxy: expressly convert to integers for error page sizes

https://gerrit.wikimedia.org/r/499569

Change 499569 abandoned by Bstorm:
dynamicproxy: expressly convert to integers for error page sizes

Reason:
Fixed by 0e0778b

https://gerrit.wikimedia.org/r/499569

The change is deployed, now closing.