Restarting tools after NFS issues
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Magnus
	Jun 29 2017, 12:44 PM

Description

I understand there are NFS issues being fixed today. I understand that this is necessary.

But for the second time in about a month, my tools fail with various variations of "can't find that file", and the webservice needs to be restarted.

Manually.

For each of my tools.

Could we find a way that (a) doesn't require a webservice restart, or (b) does the restart automatically for all tools, after NFS work has finished?

Related Objects

Mentioned Here: T166949: Homedir/UID info breaks after a while in Tools Kubernetes (can't read replica.my.cnf)

Event Timeline

Magnus created this task.Jun 29 2017, 12:44 PM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJun 29 2017, 12:44 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Jane023 awarded a token.Jun 29 2017, 1:43 PM

Jane023 rescinded a token.

Jane023 awarded a token.

Pigsonthewing awarded a token.Jun 29 2017, 5:46 PM

Pigsonthewing subscribed.

Daniel_Mietchen awarded a token.Jun 29 2017, 8:38 PM

Daniel_Mietchen subscribed.

Adding the Tools tag in the hope that this helps find people who could help address this issue.

I'm a heavy user of several of Magnus's tools, and them being down thus impedes a lot of my activities. I fully understand that he does not want to restart any of them manually (let alone all) in such cases, so some pointers on how to move forward with this would be appreciated.

Could we find a way that (a) doesn't require a webservice restart

This would require that we first know what is causing the problem for your tools and then somehow avoiding it. Possible certainly, but we'd need a much better understanding of the problem.

or (b) does the restart automatically for all tools, after NFS work has finished?

We could probably come up with something that can restart all Kubernetes webservices en-mass. We can do something similar on grid engine already by rescheduling all continuous jobs.

Somewhere on wikitech we have an equivalent procedure to https://wikitech.wikimedia.org/wiki/Portal:Tool_Labs/Admin#Restarting_all_webservices for restarting all webservices running on k8s and I have not yet found it. @yuvipanda any idea where that lives?

Jeff_G subscribed.Jul 6 2017, 12:33 AM

When I was doing it, I'd just do some shell scripting to delete all the pods in all namespaces that aren't paws. k8s will start them back up.

You can get a list of all pods with kubectl get --all-namespaces pods and then do bash magic from there.

yuvipanda unsubscribed.Jul 6 2017, 12:51 AM

All tools should have predefined documented ways to recover from the return to service of formerly failed dependencies, the more automated the better. For manually initiated scripts, the people who run the dependencies should be allowed to run the scripts to restart the dependent tools.

bd808 edited projects, added Toolforge; removed Cloud-VPS.Jul 6 2017, 12:55 AM

In T169210#3410163, @yuvipanda wrote:

When I was doing it, I'd just do some shell scripting to delete all the pods in all namespaces that aren't paws. k8s will start them back up.

In T169210#3410164, @yuvipanda wrote:

You can get a list of all pods with kubectl get --all-namespaces pods and then do bash magic from there.

This makes sense and is pretty much the equivalent of the grid engine procedure. The Kubernetes replica controller for each deployment would notice that the expected pod count was not met and then spawn the pods.

yuvipanda unsubscribed.Jul 18 2017, 12:59 AM

Framawiki subscribed.Sep 13 2017, 6:52 PM

Jeff_G awarded a token.Mar 21 2018, 2:32 AM

• GTirloni added a project: cloud-services-team (Kanban).Mar 23 2019, 9:22 PM

• Bstorm triaged this task as Medium priority.Feb 11 2020, 4:08 PM

There was a delay during the failover of NFS today (the root cause has been fixed so from here on in NFS failover should be smoother than it ever has been), and I'm interested if anyone's tools are currently broken by that. We are on a new Kubernetes cluster, which has been more stable so far. We also have made improvements over time to the webservice monitor for the grid (which is also upgraded since this ticket was created).

I am hoping that the requested actions in here are largely automated away. A spot check of tools on https://tools.wmflabs.org/magnustools/ shows most or all are live. https://tool-db-usage.toolforge.org/ accesses the database and is live.

So I think, this is largely deprecated at this point...I hope! I am going to close this, but please feel free to re-open if you see the problem again in a widespread fashion.

Restarting tools after NFS issuesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Restarting tools after NFS issues
Closed, ResolvedPublic
Actions