Page MenuHomePhabricator

labstore1006 - Various VMs are stuck
Closed, ResolvedPublic

Description

from dmesg -T:

[Wed Mar 13 18:10:39 2019] nfs: server labstore1006.wikimedia.org not responding, timed out
...
[Thu Mar 14 18:56:26 2019] nfs: server labstore1006.wikimedia.org not responding, timed out

This has caused various processes to get stuck in uninterruptable sleep state (D)

Screenshot from 2019-03-14 16-01-41.png (688×543 px, 76 KB)

Event Timeline

GTirloni created this task.

Mentioned in SAL (#wikimedia-cloud) [2019-03-14T19:04:13Z] <gtirloni> bstorm started nfsd on labstore1006 (T218341)

Mentioned in SAL (#wikimedia-cloud) [2019-03-14T19:08:30Z] <gtirloni> rebooted tools-sgewebgrid-lighttpd-0914 (T218341)

Mentioned in SAL (#wikimedia-cloud) [2019-03-14T19:08:37Z] <gtirloni> rebooted tools-worker-1028 (T218341)

Mentioned in SAL (#wikimedia-cloud) [2019-03-14T21:23:00Z] <gtirloni> rebooted tools-sgeexec-0919, tools-sgeexec-0934, tools-worker-1018 (T218341)

Mentioned in SAL (#wikimedia-cloud) [2019-03-14T21:32:41Z] <gtirloni> rebooted tools-exec-1020 (T218341)

Mentioned in SAL (#wikimedia-cloud) [2019-03-14T21:38:51Z] <gtirloni> rebooted tools-sgewebgrid-generic-0904 (T218341)

It seems the load avg situation is resolved now. I don't see any processes stuck in 'D' state.