Page MenuHomePhabricator

labstore1006 - Various VMs are stuck
Closed, ResolvedPublic

Description

from dmesg -T:

[Wed Mar 13 18:10:39 2019] nfs: server labstore1006.wikimedia.org not responding, timed out
...
[Thu Mar 14 18:56:26 2019] nfs: server labstore1006.wikimedia.org not responding, timed out

This has caused various processes to get stuck in uninterruptable sleep state (D)

Event Timeline

GTirloni triaged this task as High priority.Mar 14 2019, 6:58 PM
GTirloni created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 14 2019, 6:58 PM
GTirloni updated the task description. (Show Details)Mar 14 2019, 7:02 PM

Mentioned in SAL (#wikimedia-cloud) [2019-03-14T19:04:13Z] <gtirloni> bstorm started nfsd on labstore1006 (T218341)

Mentioned in SAL (#wikimedia-cloud) [2019-03-14T19:08:30Z] <gtirloni> rebooted tools-sgewebgrid-lighttpd-0914 (T218341)

Mentioned in SAL (#wikimedia-cloud) [2019-03-14T19:08:37Z] <gtirloni> rebooted tools-worker-1028 (T218341)

Mentioned in SAL (#wikimedia-cloud) [2019-03-14T21:23:00Z] <gtirloni> rebooted tools-sgeexec-0919, tools-sgeexec-0934, tools-worker-1018 (T218341)

Mentioned in SAL (#wikimedia-cloud) [2019-03-14T21:32:41Z] <gtirloni> rebooted tools-exec-1020 (T218341)

Mentioned in SAL (#wikimedia-cloud) [2019-03-14T21:38:51Z] <gtirloni> rebooted tools-sgewebgrid-generic-0904 (T218341)

It seems the load avg situation is resolved now. I don't see any processes stuck in 'D' state.

GTirloni closed this task as Resolved.Mar 14 2019, 9:42 PM