Page MenuHomePhabricator

Restarting tools after NFS issues
Open, MediumPublic

Subscribers
Tokens
"Mountain of Wealth" token, awarded by Jeff_G."Piece of Eight" token, awarded by Daniel_Mietchen."Like" token, awarded by Pigsonthewing."Orange Medal" token, awarded by Jane023.
Assigned To
None
Authored By
Magnus, Jun 29 2017

Description

I understand there are NFS issues being fixed today. I understand that this is necessary.

But for the second time in about a month, my tools fail with various variations of "can't find that file", and the webservice needs to be restarted.

Manually.

For each of my tools.

Could we find a way that (a) doesn't require a webservice restart, or (b) does the restart automatically for all tools, after NFS work has finished?

Event Timeline

Magnus created this task.Jun 29 2017, 12:44 PM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptJun 29 2017, 12:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jane023 rescinded a token.
Jane023 awarded a token.

Adding the Tools tag in the hope that this helps find people who could help address this issue.

I'm a heavy user of several of Magnus's tools, and them being down thus impedes a lot of my activities. I fully understand that he does not want to restart any of them manually (let alone all) in such cases, so some pointers on how to move forward with this would be appreciated.

bd808 added a comment.EditedJul 4 2017, 11:23 PM

Could we find a way that (a) doesn't require a webservice restart

This would require that we first know what is causing the problem for your tools and then somehow avoiding it. Possible certainly, but we'd need a much better understanding of the problem.

or (b) does the restart automatically for all tools, after NFS work has finished?

We could probably come up with something that can restart all Kubernetes webservices en-mass. We can do something similar on grid engine already by rescheduling all continuous jobs.

Somewhere on wikitech we have an equivalent procedure to https://wikitech.wikimedia.org/wiki/Portal:Tool_Labs/Admin#Restarting_all_webservices for restarting all webservices running on k8s and I have not yet found it. @yuvipanda any idea where that lives?

Jeff_G added a subscriber: Jeff_G.Jul 6 2017, 12:33 AM

When I was doing it, I'd just do some shell scripting to delete all the pods in all namespaces that aren't paws. k8s will start them back up.

You can get a list of all pods with kubectl get --all-namespaces pods and then do bash magic from there.

All tools should have predefined documented ways to recover from the return to service of formerly failed dependencies, the more automated the better. For manually initiated scripts, the people who run the dependencies should be allowed to run the scripts to restart the dependent tools.

bd808 edited projects, added Toolforge; removed Cloud-VPS.Jul 6 2017, 12:55 AM

When I was doing it, I'd just do some shell scripting to delete all the pods in all namespaces that aren't paws. k8s will start them back up.

You can get a list of all pods with kubectl get --all-namespaces pods and then do bash magic from there.

This makes sense and is pretty much the equivalent of the grid engine procedure. The Kubernetes replica controller for each deployment would notice that the expected pod count was not met and then spawn the pods.

Bstorm triaged this task as Medium priority.Feb 11 2020, 4:08 PM