Page MenuHomePhabricator

summary: Puppet agent failure detected on instance quarry-worker-04 in project quarry
Closed, ResolvedPublic

Description

From alert:
https://prometheus-alerts.wmcloud.org/?q=

alertname: PuppetAgentFailure
project: quarry
1
summary: Puppet agent failure detected on instance quarry-worker-04 in project quarry
2 days ago
instance: quarry-worker-04
severity: warn
@receiver: cloud-admin-feed

Event Timeline

dcaro triaged this task as High priority.Mar 25 2022, 11:53 AM
dcaro created this task.

It seems that is stuck trying to contact nfs, that makes puppet fail:

root@quarry-worker-04:~# dmesg -T
...
[Fri Mar 25 02:34:35 2022] nfs: server quarry-nfs.svc.quarry.eqiad1.wikimedia.cloud not responding, timed out
root@quarry-worker-04:~# run-puppet-agent
...
Error: '/usr/bin/timeout -k 5s 20s /bin/mkdir -p /mnt/nfs/labstore-secondary-project' returned 124 instead of one of [0]
Error: /Stage[main]/Profile::Wmcs::Nfsclient/Labstore::Nfs_mount[project-on-labstore-secondary]/Exec[create-/mnt/nfs/labstore-secondary-project]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/timeout -k 5s 20s /bin/mkdir -p /mnt/nfs/labstore-secondary-project' returned 124 instead of one of [0] (corrective)

Looking, might just reboot to get it unstuck

okok, so the new nfs server (quarry-nfs-2) is the correct one that the service ip points to, tried lazy umount (umount -l) and mount again, but did not work, so trying rebooting, see if that gets it unstuck.

Mentioned in SAL (#wikimedia-cloud) [2022-03-25T12:04:09Z] <dcaro> rebooting quarry-worker-04.quarry.eqiad1.wikimedia.cloud due to stuck nfs (T304681)

That did it, now it's mounting nfs from quarry-nfs-2, closing.