Page MenuHomePhabricator

`webservice restart` does not always wait for service to stop before trying to start again
Closed, ResolvedPublic

Description

Steps I've found to reliably reproduce, assuming the tool uses Kubernetes:

  • webservice restart outputs the normal Restarting webservice...
  • Refresh the page in your browser. It hangs for a few seconds before getting a 502
  • Attempting webservice restart again does not work
  • Running webservice stop, waiting a few seconds, then webservice start successfully brings the service back up
  • Running webservice restart shortly thereafter works just fine, and does not bring the tool down like it did the first time
  • Wait some amount of time (in my case ~24 hours), and the issue will surface again

Some IRC discussion that might help:

...
16:23:05 <bd808> and it didn't give you an error message at the cli when it failed?
16:23:12 <chasemp> it seems like the restart vs stop/start is a red herring, it's just different state post stop/start regardless of mechanism
16:23:45 <bd808> there is a "# FIXME: Treat pending state differently" in start() that might be somehow related
16:24:15 <musikanimal> I didn't see any errors, no
16:25:15 <bd808> oh... i see a way it could go badly
16:25:23 <bd808> stop() waits max 15s and then returns

Event Timeline

I just did 'webservice restart' on lolrrit-wm tool, and it came back up immediately with no issues...

This could be possibly because of T140262? I just fixed that up earlier today, and might've been related.

I just tried restarting some of the tools and had no issues :) Maybe T140262 is what did it!

bd808 renamed this task from Apparent issue with restarting Kubernetes webservice to `webservice restart` does not always wait for service to stop before trying to start again.Mar 26 2017, 7:14 PM

Was the alias (im assuming its an alias) update post k8s setup as if it was not then it may not be set up properly to work with k8s pods and such. Regardless simply deleting the kubectl pod that webservice is using will restart it (or atleast stop it).

Change 444879 had a related patch set uploaded (by Nehajha; owner: Nehajha):
[operations/software/tools-webservice@master] Providing users more clue when kuberenetes is unable to delete all the objects

https://gerrit.wikimedia.org/r/444879

Change 444879 merged by jenkins-bot:
[operations/software/tools-webservice@master] Providing users more clue when kuberenetes is unable to delete all the objects

https://gerrit.wikimedia.org/r/444879

Change 563624 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/software/tools-webservice@master] k8s: Don't restart all k8s machinery to reboot a basic webservice

https://gerrit.wikimedia.org/r/563624

Change 563624 merged by Bstorm:
[operations/software/tools-webservice@master] k8s: Don't restart all k8s machinery to reboot a basic webservice

https://gerrit.wikimedia.org/r/563624

Bstorm claimed this task.
Bstorm subscribed.

At this point, the restart function is a simple killing of pods. The new cluster also responds differently in general.