Page MenuHomePhabricator

quarry-nfs-1 went down; quarry is offline
Closed, ResolvedPublic

Description

13:07:04 <+wmcs-alerts> (InstanceDown) firing: Project quarry instance quarry-nfs-1 is down   - https://prometheus-alerts.wmcloud.org
13:26:04 <+wmcs-alerts> (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project quarry   - https://prometheus-alerts.wmcloud.org

Event Timeline

RhinosF1 triaged this task as Unbreak Now! priority.Feb 19 2022, 1:38 PM
RhinosF1 created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
taavi lowered the priority of this task from Unbreak Now! to High.Feb 19 2022, 2:11 PM
taavi subscribed.

Rebooting the NFS server seems to have solved the immediate issue.

Leaving this task open so that we can investigate why this happened and how to prevent it. The VM was fully unresponsible, even the serial console didn't react at all. The full console log is on P21049.

RhinosF1 raised the priority of this task from High to Unbreak Now!.Feb 20 2022, 7:17 PM

Crashed again

For the record:

19:02:04 <wmcs-alerts> (InstanceDown) firing: Project quarry instance quarry-nfs-1 is down  - https://prometheus-alerts.wmcloud.org
19:19:04 <wmcs-alerts> (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project quarry  - https://prometheus-alerts.wmcloud.org

Mentioned in SAL (#wikimedia-cloud) [2022-02-20T19:23:35Z] <taavi> hard rebooted quarry-nfs-1 again T302154

Mentioned in SAL (#wikimedia-cloud) [2022-02-20T19:49:50Z] <andrewbogott> moving nfs service from quarry-nfs-1 (bullseye) to quarry-nfs-2 (buster), testing to see if T302154 is a kernal or nfs-version issue

The VM was fully unresponsible, even the serial console didn't react at all.

Is it possible to send a sysrq? If so, can you send a sysrq-l & sysrq-w next time? Should force a backtrace and we can see which kernel bug this is.

zhuyifei1999 lowered the priority of this task from Unbreak Now! to High.Feb 23 2022, 5:42 AM

Not happened since. Closing per IRC.

Mentioned in SAL (#wikimedia-cloud) [2022-02-20T19:49:50Z] < @Andrew > moving nfs service from quarry-nfs-1 (bullseye) to quarry-nfs-2 (buster), testing to see if T302154 is a kernal or nfs-version issue

@Andrew can quarry-nfs-1 be deleted then? I don't see anything stored on it, /srv/quarry/project is empty and has no volume mounted.
All data is on quarry-nfs-2 and this one has the external volume mounted on it.
I turned off the instance to verify it's not used

Mentioned in SAL (#wikimedia-cloud) [2022-02-20T19:49:50Z] < @Andrew > moving nfs service from quarry-nfs-1 (bullseye) to quarry-nfs-2 (buster), testing to see if T302154 is a kernal or nfs-version issue

@Andrew can quarry-nfs-1 be deleted then? I don't see anything stored on it, /srv/quarry/project is empty and has no volume mounted.
All data is on quarry-nfs-2 and this one has the external volume mounted on it.
I turned off the instance to verify it's not used

We have a cookbook to failover NFS between an active server and a passive one, which includes moving the volume attachment. So nfs-1 is there as a safety net. I wouldn't insist that it's a /necessary/ safety net but it also seems like a harmless one.

Ah, sorry, I should've read back further in the task! Yes, that host can+should be deleted.

Ah, sorry, I should've read back further in the task! Yes, that host can+should be deleted.

thanks, deleted.