I'm currently receiving a 500 Internal Server Error message.
Description
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2020-08-04T05:45:12Z] <wm-bot> <root> restarted webservice (T259560)
I restarted it and it's up for now, but I'm going to leave this task open for @bd808 to investigate why it went down.
The log has basically two sets of error messages:
... 2020-08-04 03:38:32: (gw_backend.c.335) child signalled: 9 2020-08-04 03:38:33: (gw_backend.c.476) unlink /var/run/lighttpd/php.socket.sal-1 after connect failed: Connection refused ... 2020-08-04 05:43:47: (http-header-glue.c.1250) read(): Connection reset by peer 7 8 2020-08-04 05:43:47: (gw_backend.c.2149) response not received, request sent: 1199 on socket: unix:/var/run/lighttpd/php.socket.sal-1 for /index.php?, closing connection
The first message in the logs about the fcgi container being down start at 2020-07-27 07:18:35. There is not anything material in the error.log leading up to those errors. It looks like only one of the 2 fcgi processes died so I would guess that ~50% of requests were returning errors between 2020-07-27 07:18:35 and 2020-08-04 05:44:29.
Ideally I think we would figure out a proper health check for the Toolforge Kubernetes PHP containers that makes the container restart when lighttpd dies (this happens now because it is PID 1 in the container) or when the fcgi container dies (what happened here). Maybe even more ideal would be to separate the fcgi processes into their own container within the pod like https://github.com/jitesoft/docker-lighttpd does.