13:07:04 <+wmcs-alerts> (InstanceDown) firing: Project quarry instance quarry-nfs-1 is down - https://prometheus-alerts.wmcloud.org 13:26:04 <+wmcs-alerts> (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project quarry - https://prometheus-alerts.wmcloud.org
Description
Related Objects
- Mentioned Here
- P21049 (An Untitled Masterwork)
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2022-02-19T14:04:05Z] <taavi> reboot quarry-nfs-1 T302154
Rebooting the NFS server seems to have solved the immediate issue.
Leaving this task open so that we can investigate why this happened and how to prevent it. The VM was fully unresponsible, even the serial console didn't react at all. The full console log is on P21049.
For the record:
19:02:04 <wmcs-alerts> (InstanceDown) firing: Project quarry instance quarry-nfs-1 is down - https://prometheus-alerts.wmcloud.org
19:19:04 <wmcs-alerts> (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project quarry - https://prometheus-alerts.wmcloud.org
Mentioned in SAL (#wikimedia-cloud) [2022-02-20T19:23:35Z] <taavi> hard rebooted quarry-nfs-1 again T302154
Mentioned in SAL (#wikimedia-cloud) [2022-02-20T19:49:50Z] <andrewbogott> moving nfs service from quarry-nfs-1 (bullseye) to quarry-nfs-2 (buster), testing to see if T302154 is a kernal or nfs-version issue
The VM was fully unresponsible, even the serial console didn't react at all.
Is it possible to send a sysrq? If so, can you send a sysrq-l & sysrq-w next time? Should force a backtrace and we can see which kernel bug this is.
@Andrew can quarry-nfs-1 be deleted then? I don't see anything stored on it, /srv/quarry/project is empty and has no volume mounted.
All data is on quarry-nfs-2 and this one has the external volume mounted on it.
I turned off the instance to verify it's not used
We have a cookbook to failover NFS between an active server and a passive one, which includes moving the volume attachment. So nfs-1 is there as a safety net. I wouldn't insist that it's a /necessary/ safety net but it also seems like a harmless one.
Ah, sorry, I should've read back further in the task! Yes, that host can+should be deleted.