Page MenuHomePhabricator is down
Closed, ResolvedPublic


I'm currently receiving a 500 Internal Server Error message.

Event Timeline

sbassett triaged this task as Low priority.

...and back up again.

Well, it now appears to be down again.

Mentioned in SAL (#wikimedia-cloud) [2020-08-04T05:45:12Z] <wm-bot> <root> restarted webservice (T259560)

I restarted it and it's up for now, but I'm going to leave this task open for @bd808 to investigate why it went down.

The log has basically two sets of error messages:

2020-08-04 03:38:32: (gw_backend.c.335) child signalled: 9 
2020-08-04 03:38:33: (gw_backend.c.476) unlink /var/run/lighttpd/php.socket.sal-1 after connect failed: Connection refused 
2020-08-04 05:43:47: (http-header-glue.c.1250) read(): Connection reset by peer 7 8 
2020-08-04 05:43:47: (gw_backend.c.2149) response not received, request sent: 1199 on socket: unix:/var/run/lighttpd/php.socket.sal-1 for /index.php?, closing connection
bd808 assigned this task to Legoktm.
bd808 edited projects, added Stashbot; removed Tools.

The first message in the logs about the fcgi container being down start at 2020-07-27 07:18:35. There is not anything material in the error.log leading up to those errors. It looks like only one of the 2 fcgi processes died so I would guess that ~50% of requests were returning errors between 2020-07-27 07:18:35 and 2020-08-04 05:44:29.

Ideally I think we would figure out a proper health check for the Toolforge Kubernetes PHP containers that makes the container restart when lighttpd dies (this happens now because it is PID 1 in the container) or when the fcgi container dies (what happened here). Maybe even more ideal would be to separate the fcgi processes into their own container within the pod like does.