Page MenuHomePhabricator

quarry-nfs-1 went down; quarry is offline
Closed, ResolvedPublic

Description

13:07:04 <+wmcs-alerts> (InstanceDown) firing: Project quarry instance quarry-nfs-1 is down   - https://prometheus-alerts.wmcloud.org
13:26:04 <+wmcs-alerts> (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project quarry   - https://prometheus-alerts.wmcloud.org

Event Timeline

RhinosF1 triaged this task as Unbreak Now! priority.Feb 19 2022, 1:38 PM
RhinosF1 created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
taavi lowered the priority of this task from Unbreak Now! to High.Feb 19 2022, 2:11 PM
taavi added a subscriber: taavi.

Rebooting the NFS server seems to have solved the immediate issue.

Leaving this task open so that we can investigate why this happened and how to prevent it. The VM was fully unresponsible, even the serial console didn't react at all. The full console log is on P21049.

RhinosF1 raised the priority of this task from High to Unbreak Now!.Feb 20 2022, 7:17 PM

Crashed again

For the record:

19:02:04 <wmcs-alerts> (InstanceDown) firing: Project quarry instance quarry-nfs-1 is down  - https://prometheus-alerts.wmcloud.org
19:19:04 <wmcs-alerts> (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project quarry  - https://prometheus-alerts.wmcloud.org

Mentioned in SAL (#wikimedia-cloud) [2022-02-20T19:23:35Z] <taavi> hard rebooted quarry-nfs-1 again T302154

Mentioned in SAL (#wikimedia-cloud) [2022-02-20T19:49:50Z] <andrewbogott> moving nfs service from quarry-nfs-1 (bullseye) to quarry-nfs-2 (buster), testing to see if T302154 is a kernal or nfs-version issue

The VM was fully unresponsible, even the serial console didn't react at all.

Is it possible to send a sysrq? If so, can you send a sysrq-l & sysrq-w next time? Should force a backtrace and we can see which kernel bug this is.

zhuyifei1999 lowered the priority of this task from Unbreak Now! to High.Feb 23 2022, 5:42 AM

Not happened since. Closing per IRC.