Page MenuHomePhabricator

Restarting tools after NFS issues
Closed, ResolvedPublic

Assigned To
Authored By
Magnus
Jun 29 2017, 12:44 PM
Referenced Files
None
Tokens
"Mountain of Wealth" token, awarded by Jeff_G."Piece of Eight" token, awarded by Daniel_Mietchen."Like" token, awarded by Pigsonthewing."Orange Medal" token, awarded by Jane023.

Description

I understand there are NFS issues being fixed today. I understand that this is necessary.

But for the second time in about a month, my tools fail with various variations of "can't find that file", and the webservice needs to be restarted.

Manually.

For each of my tools.

Could we find a way that (a) doesn't require a webservice restart, or (b) does the restart automatically for all tools, after NFS work has finished?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jane023 rescinded a token.
Jane023 awarded a token.

Adding the Tools tag in the hope that this helps find people who could help address this issue.

I'm a heavy user of several of Magnus's tools, and them being down thus impedes a lot of my activities. I fully understand that he does not want to restart any of them manually (let alone all) in such cases, so some pointers on how to move forward with this would be appreciated.

Could we find a way that (a) doesn't require a webservice restart

This would require that we first know what is causing the problem for your tools and then somehow avoiding it. Possible certainly, but we'd need a much better understanding of the problem.

or (b) does the restart automatically for all tools, after NFS work has finished?

We could probably come up with something that can restart all Kubernetes webservices en-mass. We can do something similar on grid engine already by rescheduling all continuous jobs.

Somewhere on wikitech we have an equivalent procedure to https://wikitech.wikimedia.org/wiki/Portal:Tool_Labs/Admin#Restarting_all_webservices for restarting all webservices running on k8s and I have not yet found it. @yuvipanda any idea where that lives?

When I was doing it, I'd just do some shell scripting to delete all the pods in all namespaces that aren't paws. k8s will start them back up.

You can get a list of all pods with kubectl get --all-namespaces pods and then do bash magic from there.

All tools should have predefined documented ways to recover from the return to service of formerly failed dependencies, the more automated the better. For manually initiated scripts, the people who run the dependencies should be allowed to run the scripts to restart the dependent tools.

When I was doing it, I'd just do some shell scripting to delete all the pods in all namespaces that aren't paws. k8s will start them back up.

You can get a list of all pods with kubectl get --all-namespaces pods and then do bash magic from there.

This makes sense and is pretty much the equivalent of the grid engine procedure. The Kubernetes replica controller for each deployment would notice that the expected pod count was not met and then spawn the pods.

Bstorm triaged this task as Medium priority.Feb 11 2020, 4:08 PM
Bstorm claimed this task.
Bstorm subscribed.

There was a delay during the failover of NFS today (the root cause has been fixed so from here on in NFS failover should be smoother than it ever has been), and I'm interested if anyone's tools are currently broken by that. We are on a new Kubernetes cluster, which has been more stable so far. We also have made improvements over time to the webservice monitor for the grid (which is also upgraded since this ticket was created).

I am hoping that the requested actions in here are largely automated away. A spot check of tools on https://tools.wmflabs.org/magnustools/ shows most or all are live. https://tool-db-usage.toolforge.org/ accesses the database and is live.

So I think, this is largely deprecated at this point...I hope! I am going to close this, but please feel free to re-open if you see the problem again in a widespread fashion.