Page MenuHomePhabricator

giftbot webservice outages and/or issues
Closed, DeclinedPublic

Description

Hi!

Several times the week the webservice on tools.giftbot becomes not available due to different errors. It doesn't matter whether it's running in the grid, local, on cron-tools, k8s, nothing helps. And I don't know, if it is the webservice at tools.giftbot only or anywhere else.

So I put this question on the labs list: is there a way, to restart the webservice _automatically_, if an error occurs and the webservice not available is? Has it to be monitored, what the webservice makes break its running status? Questions on the freenode channel Cloud-Services concerning this problem haven't been replied.

One more trial was it, to check the running status of the webservice by an error return of an interrogation, to restart the webservice automatically, but there is no possibility to restart the webservice by a script. We try to run it on cron-tools now, but without success.

@Magnus asked in his reply: "Is it the webservice that fails, or the bot part of giftbot?" but I think, I have explained it, that the webservice itself does not work without issues or outages. The bot resp. the tool is running properly: https://tools.wmflabs.org/giftbot/weblinksuche.fcgi

My last monitoring shows:

  • I check the run status of the webservice every hour. Last OK report was 2017-01-17 09:04:07 UTC
  • next OK report should be 2017-01-17 10:04:07 UTC, but was missing
  • 2017-01-17 10:43:15 UTC I get a webservice restart report on k8s, but it did not restart, possibly due to connection timeout (I cannot return the error case) or the non-possibility, to start the webservice by a simple script, that does not more as "webservice restart" in bash.

Restarting the webservice manually by "webservice restart" in bash works, but because of the many webservice outages the webservice should be started automatically, if an error was detected, by a solution as a script

Thank you very much for an explaining reply ...

@Aklapper: please assign this error report because we need a properly running webservice for a service offered for the entire dewiki community and it's implemented in searching for defective weblinks. Thank you ...

Event Timeline

doctaxon updated the task description. (Show Details)

https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Grid#Bigbrother

The webservice system uses manifest monitors to provide similar functionality automatically.

See also T90561: Replace bigbrother and ssh-cron-thingy with service manifests. I'm pretty sure webservices should restart automatically.

Have you checked why the webservice errors? When a program is error-prone, ideally you should fix it, not depend on automatic restarts.

I'm not aware of any script errors. Maybe there are db queries or other things that cause it to hang.

I see two possibilities:

  • Let us run the control script on a tools bastion (would the dev bastion be ok?) or
  • Provide the 'webservice' script on the submit host/the grid nodes (it fails when it tries to restart the webservice there)

Bigbrother and the restart system that is built into webservice both only monitor process existence. Functional what both systems do is ask the execution container (either Open Grid Engine or Kubernetes) if there is a process named X currently active. If there is then it is left alone. If there is not then a new process is submitted.

It sounds like you have some conditions under which the process is running from the point of view of the execution container, but the HTTP layer of the application is not responsive to some/all requests. Kubernetes has a concept of "liveness probes" which can be used to define an HTTP request to make to the application along with how frequently to check that such a request returns a successful HTTP status code. When the probe fails Kubernetes will kill the current process and spawn another one to replace it. We don't expose this bit of Kubernetes configuration in the current webservice wrapper, but in theory we could figure out how to let a tool define such a check and pass that definition on to the Kubernetes backend.

Having automatic restarts for non-responsive processes can help with perceived availability, but they come at a cost. Without having more detail about what is going wrong inside giftbot when it locks up it is hard to say definitively if automated restarts will do more than hide a deeper problem in the application logic.

bd808 renamed this task from Webservice outages and/or issues to giftbot webservice outages and/or issues.Jan 18 2017, 5:50 PM
scfc triaged this task as Low priority.Feb 16 2017, 11:14 PM
scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.