Page MenuHomePhabricator

NFS broken for new labs instances
Closed, ResolvedPublic

Description

On tools-worker-04, NFS is broken (both /project and /home). Puppet fails with:

ESC[mNotice: /Stage[main]/Role::Labs::Instance/Exec[block-for-home-export]/returns: clnt_create: RPC: Program not registeredESC[0m
ESC[mNotice: /Stage[main]/Role::Labs::Instance/Exec[block-for-home-export]/returns: clnt_create: RPC: Program not registeredESC[0m
ESC[mNotice: /Stage[main]/Role::Labs::Instance/Exec[block-for-home-export]/returns: clnt_create: RPC: Program not registeredESC[0m
ESC[mNotice: /Stage[main]/Role::Labs::Instance/Exec[block-for-home-export]/returns: clnt_create: RPC: Program not registeredESC[0m
ESC[mNotice: /Stage[main]/Role::Labs::Instance/Exec[block-for-home-export]/returns: clnt_create: RPC: Program not registeredESC[0m
ESC[mNotice: /Stage[main]/Role::Labs::Instance/Exec[block-for-home-export]/returns: clnt_create: RPC: Program not registeredESC[0m
ESC[mNotice: /Stage[main]/Role::Labs::Instance/Exec[block-for-home-export]/returns: clnt_create: RPC: Program not registeredESC[0m
ESC[mNotice: /Stage[main]/Role::Labs::Instance/Exec[block-for-home-export]/returns: clnt_create: RPC: Program not registeredESC[0m
ESC[1;31mError: /usr/local/sbin/block-for-export labstore.svc.eqiad.wmnet project/tools 180 returned 2 instead of one of [0]ESC[0m
ESC[1;31mError: /Stage[main]/Role::Labs::Instance/Exec[block-for-home-export]/returns: change from notrun to 0 failed: /usr/local/sbin/block-for-export labstore.svc.eqiad.wmnet project/tools 180 returned 2 instead of one of [0]ESC[0m
ESC[mNotice: /Stage[main]/Role::Labs::Instance/Mount[/home]: Dependency Exec[block-for-home-export] has failures: trueESC[0m
ESC[1;31mWarning: /Stage[main]/Role::Labs::Instance/Mount[/home]: Skipping because of failed dependenciesESC[0m

I restarted the whole instance to no avail.

Event Timeline

yuvipanda raised the priority of this task from to High.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Cloud-Services.
yuvipanda added subscribers: yuvipanda, coren, Andrew, chasemp.
chasemp raised the priority of this task from High to Unbreak Now!.
chasemp set Security to None.

I'm not seeing the issue at all at this time, and that instance - as far as I can tell - has working NFS mounts.

It's entirely possible that this was a transient issue that was fixed when I did a cleanup of old snapshots on labstore1001 this morning as I gave mountd a kick in the process of having it release the mountpoint: there was an issue with that snapshot that prevented it from being cleanly deactivated and mountd would be the rpc endpoint involved in the error of this issue.

The message itself, "clnt_create: RPC: Program not registered" is fairly opaque because unlikely in practice - it means that RPC was up but that one of the endpoint was not responsive. This is consistent with mountd having been stuck on that snapshot. (Having NFS down or the export not available would have given a different, clearer error).

I don't think there is anything more to do here unless the original symptom (stuck snapshot) reoccurs, at which point investigating that in depth is the better course of action.