Page MenuHomePhabricator

Webservice no longer restarted automatically (lighttpd)
Closed, ResolvedPublic

Description

Until approximately 16 March, if a tool's webservice stopped working, it would be automatically restarted (eventually) by a Toolforge backend process of some kind. This is no longer happening, and it means that my tool's web interfaces are unavailable for long periods of time. (/data/projects/dplbot/service.log contains an entry for each automatic restart or attempt; the last entry in that logfile is for 2019-03-16T13:12:53.903120).

Event Timeline

russblau created this task.Mar 21 2019, 4:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 21 2019, 4:28 PM
bd808 added a subscriber: bd808.Mar 24 2019, 8:16 PM

dplbot seems to be running a lighttpd grid engine webservice. The current $HOME/service.log seems to have thousands of lines about attempted service restarts in March:

$ grep 2019-03- service.log | wc -l
$ grep 2019-03- service.log | wc -l
2317
bd808 closed this task as Resolved.Jul 1 2019, 4:38 AM
bd808 claimed this task.

/data/project/dplbot/service.log shows watchdog restarts working as expected.

russblau reopened this task as Open.Dec 24 2019, 3:50 PM

This issue has resurfaced. /data/project/dplbot/service.log shows that the last attempted restart was at 2019-12-22T06:24:17.778139; it is now Tue Dec 24 15:49:50 and the webservice has been down for over 48 hours. I can restart it manually, but it will inevitably crash again sometime in the near future.

bd808 removed bd808 as the assignee of this task.Dec 29 2019, 9:14 PM
/data/project/dplbot/service.log
2019-12-22T00:15:24.537358 No running webservice job found, attempting to start it
2019-12-22T00:24:19.558222 No running webservice job found, attempting to start it
2019-12-22T01:15:35.921747 No running webservice job found, attempting to start it
2019-12-22T01:24:24.037877 No running webservice job found, attempting to start it
2019-12-22T02:15:29.975260 No running webservice job found, attempting to start it
2019-12-22T02:24:17.061066 No running webservice job found, attempting to start it
2019-12-22T03:15:29.159078 No running webservice job found, attempting to start it
2019-12-22T03:24:19.906177 No running webservice job found, attempting to start it
2019-12-22T04:15:25.168456 No running webservice job found, attempting to start it
2019-12-22T04:24:17.065994 No running webservice job found, attempting to start it
2019-12-22T05:15:38.132504 No running webservice job found, attempting to start it
2019-12-22T05:24:16.897726 No running webservice job found, attempting to start it
2019-12-22T06:15:32.938537 No running webservice job found, attempting to start it
2019-12-22T06:24:17.778139 No running webservice job found, attempting to start it

The first thing I would suggest investigating is why the webservice crashes so often. Hopefully there are some clues in the 40M of $HOME/error.log output that has been generated in the month of December. There appears to be a pattern of log entries like this for most (all?) of the restart log lines:

error.log
2019-12-22 00:15:25: (server.c.1751) [note] graceful shutdown started
2019-12-22 00:15:25: (server.c.1828) server stopped by UID = 51290 PID = 20204
2019-12-22 00:15:43: (log.c.217) server started

This is a sign that the lighttpd process was actually running, but that the job grid had marked the job state as something other than "r(unning)", "s(uspended)", "w(aiting)", or "h(old)". Unfortunately, the current watchdog script does not record the state it finds.

The current /data/project/dplbot/service.manifest shows the webservice for this tool running on the Kubernetes backend. This backend uses a completely different mechanism for monitoring and restarting a webservice. The home grown monitoring service used to attempt to keep grid engine webservice jobs running is known to be imperfect. The Kubernetes system is much more robust and likely to succeed in restarting crashed webservices. It however is not magical and will only restart a webservice container where the "master" process (lighttpd in the case of this PHP 5.6 container) has exited. This means that it is still possible to have a Kubernetes powered webservice become unresponsive to client requests due to an internal deadlock or resource exhaustion issue in the application which does not also lead to a crash of the lighttpd process itself.

russblau closed this task as Resolved.Jan 2 2020, 3:45 PM
russblau claimed this task.

This issue is no longer relevant, since I have been able to start the webservice under kubernetes and no longer rely on the grid servers.