Page MenuHomePhabricator

sal.toolforge.org is down
Closed, ResolvedPublic

Description

I'm currently receiving a 500 Internal Server Error message.

Event Timeline

sbassett created this task.Aug 3 2020, 9:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 3 2020, 9:42 PM
sbassett closed this task as Invalid.Aug 3 2020, 9:43 PM
sbassett triaged this task as Low priority.

...and back up again.

sbassett reopened this task as Open.Aug 3 2020, 9:44 PM

Well, it now appears to be down again.

Nintendofan885 edited projects, added Tools; removed Toolforge.Aug 3 2020, 9:44 PM
Nintendofan885 added a subscriber: Nintendofan885.

Mentioned in SAL (#wikimedia-cloud) [2020-08-04T05:45:12Z] <wm-bot> <root> restarted webservice (T259560)

I restarted it and it's up for now, but I'm going to leave this task open for @bd808 to investigate why it went down.

The log has basically two sets of error messages:

error.log
...
2020-08-04 03:38:32: (gw_backend.c.335) child signalled: 9 
2020-08-04 03:38:33: (gw_backend.c.476) unlink /var/run/lighttpd/php.socket.sal-1 after connect failed: Connection refused 
...
2020-08-04 05:43:47: (http-header-glue.c.1250) read(): Connection reset by peer 7 8 
2020-08-04 05:43:47: (gw_backend.c.2149) response not received, request sent: 1199 on socket: unix:/var/run/lighttpd/php.socket.sal-1 for /index.php?, closing connection
bd808 closed this task as Resolved.Aug 4 2020, 4:25 PM
bd808 assigned this task to Legoktm.
bd808 edited projects, added Stashbot; removed Tools.

The first message in the logs about the fcgi container being down start at 2020-07-27 07:18:35. There is not anything material in the error.log leading up to those errors. It looks like only one of the 2 fcgi processes died so I would guess that ~50% of requests were returning errors between 2020-07-27 07:18:35 and 2020-08-04 05:44:29.

Ideally I think we would figure out a proper health check for the Toolforge Kubernetes PHP containers that makes the container restart when lighttpd dies (this happens now because it is PID 1 in the container) or when the fcgi container dies (what happened here). Maybe even more ideal would be to separate the fcgi processes into their own container within the pod like https://github.com/jitesoft/docker-lighttpd does.