Page MenuHomePhabricator

Limit webservice manifest restarts
Closed, ResolvedPublic

Description

When the service cannot start for one reason or another, the service should not be restarted every 15 seconds. In this case, kasparbot had incorrect permissions on error.log:

(from emails sent to root@tools)

Job 1418711 caused action: Job 1418711 set to ERROR
 User        = tools.kasparbot
 Queue       = webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs
 Start Time  = <unknown>
 End Time    = <unknown>
failed opening input/output file:08/04/2015 13:39:34 [52604:12097]: error: can't open output file "/data/project/kasparbot/error.log"

yet the webservice is restarted every 15s (webservice.log):

2015-08-04T12:59:19.576720 No running webservice job found, attempting to start it
2015-08-04T12:59:34.525900 No running webservice job found, attempting to start it
2015-08-04T12:59:49.692731 No running webservice job found, attempting to start it
2015-08-04T13:00:07.003549 No running webservice job found, attempting to start it
2015-08-04T13:00:22.404999 No running webservice job found, attempting to start it
2015-08-04T13:00:43.220261 No running webservice job found, attempting to start it
2015-08-04T13:00:58.744333 No running webservice job found, attempting to start it
2015-08-04T13:01:14.976039 No running webservice job found, attempting to start it
2015-08-04T13:01:46.150506 Timed out attempting to start webservice (15s)
2015-08-04T13:01:59.616354 No running webservice job found, attempting to start it
2015-08-04T13:02:15.885807 No running webservice job found, attempting to start it
2015-08-04T13:02:31.134630 No running webservice job found, attempting to start it
2015-08-04T13:02:46.190989 No running webservice job found, attempting to start it

Event Timeline

valhallasw raised the priority of this task from to Medium.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added a subscriber: valhallasw.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 239377 had a related patch set uploaded (by coren):
webservicemonitor: some improvements

https://gerrit.wikimedia.org/r/239377

Change 239377 merged by coren:
webservicemonitor: some improvements

https://gerrit.wikimedia.org/r/239377

If the webservice fails 3 times with the same error it should not be restarted again.

Note that this causes large amounts of outgoing mail (one per 20 seconds, to the same destination), which is not very nice.

bd808 raised the priority of this task from Medium to High.Apr 13 2017, 10:41 PM
bd808 added a subscriber: bd808.

framabot was in an infinite loop of restarts (28799 in the last 7 days) because of a bad .lighttpd.conf file. We need to do something at least like bigbrother's 10 and done restart limit.

Change 479181 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/software/tools-manifest@master] Limit manifest starts (max 10)

https://gerrit.wikimedia.org/r/479181

Why is max 3 times an hour limit not good enough?

Edit: I see that was added in the patch above, but it doesn't work or something? I think it's better to fix it and depend on a resetting state rather than a non-resetting state.

Why is max 3 times an hour limit not good enough?

Edit: I see that was added in the patch above, but it doesn't work or something? I think it's better to fix it and depend on a resetting state rather than a non-resetting state.

It turns out that the 3/hr limit doesn't work. There is a fundamental bug in the application logic that prevents the restart history from persisting from iteration to iteration of the restart loop. The same bug will render the proposed changes in https://gerrit.wikimedia.org/r/479181 moot as well.

Change 484849 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-manifest@master] Preserve restart attempt timestamps between runs

https://gerrit.wikimedia.org/r/484849

Change 479181 abandoned by GTirloni:
Limit manifest starts (max 10)

Reason:
Not need once 3/hour limit is fixed

https://gerrit.wikimedia.org/r/479181

Change 484849 merged by Bstorm:
[operations/software/tools-manifest@master] Preserve restart attempt timestamps between runs

https://gerrit.wikimedia.org/r/484849

After some testing (live!) I can see that this needs more work. Right now it will really only rate limit if the tool is crashing immediately on startup. This is not the most common scenario. Its more common that startup works but a crash happens as soon as the service receives traffic. I think we want to rate limit restarts for that too.

Change 486417 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-manifest@master] Refactor and simplify python package

https://gerrit.wikimedia.org/r/486417

Change 486417 merged by Bstorm:
[operations/software/tools-manifest@master] Refactor and simplify python package

https://gerrit.wikimedia.org/r/486417

Mentioned in SAL (#wikimedia-cloud) [2019-02-11T18:08:25Z] <bd808> Built tools-manifest_0.17_all.deb and published to aptly repos (T107878)

Mentioned in SAL (#wikimedia-cloud) [2019-02-11T18:12:20Z] <bd808> Upgraded tools-manifest on tools-sgecron-01 to v0.17 (T107878)

Change 489752 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-manifest@master] Fix graphite metric naming

https://gerrit.wikimedia.org/r/489752

Change 489752 merged by jenkins-bot:
[operations/software/tools-manifest@master] Fix graphite metric naming

https://gerrit.wikimedia.org/r/489752

Mentioned in SAL (#wikimedia-cloud) [2019-02-11T18:25:13Z] <bd808> Built tools-manifest_0.18_all.deb and published to aptly repos (T107878)

Mentioned in SAL (#wikimedia-cloud) [2019-02-11T18:26:44Z] <bd808> Upgraded tools-manifest on tools-sgecron-01 to v0.18 (T107878)

Mentioned in SAL (#wikimedia-cloud) [2019-02-11T18:57:47Z] <bd808> Built tools-manifest_0.19_all.deb and published to aptly repos (T107878)

Mentioned in SAL (#wikimedia-cloud) [2019-02-11T18:57:52Z] <bd808> Upgraded tools-manifest on tools-sgecron-01 to v0.19 (T107878)

Mentioned in SAL (#wikimedia-cloud) [2019-02-11T19:08:59Z] <bd808> Upgraded tools-manifest on tools-cron-01 to v0.19 (T107878)

I'm going to mark this resolved with the newly refactored code base. I did testing using my bd808-test2 tool and was able to verify that the 3 restarts per hour rate limit is applied to a chronicly failing service (in this case one that is manually killed repeatedly) and that this limit is eventually lifted when past restart attempts age out of the 1 hour window.

From the point of view of the tool maintainer, the log messages in $HOME/service.log look like this:

2019-02-11T18:59:43.763975 No running webservice job found, attempting to start it
2019-02-11T19:00:52.117938 No running webservice job found, attempting to start it
2019-02-11T19:02:07.510411 No running webservice job found, attempting to start it
2019-02-11T19:03:21.835317 Throttled for 3 restarts in last 3600 seconds
...
2019-02-11T19:59:12.286910 Throttled for 3 restarts in last 3600 seconds
2019-02-11T20:00:19.385414 No running webservice job found, attempting to start it

I will send a short note to the cloud-announce list announcing this change.