Page MenuHomePhabricator

showmount not working on labstore1004 & labstore1005
Closed, ResolvedPublic0 Story Points

Description

at 15:14 UTC or so on July 31, this alarm went off on toolschecker:

PROBLEM - toolschecker: showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/nfs/secondary_cluster_showmount - 177 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

The issue is that showmount -e nfs-tools-project.svc.eqiad.wmnet returns an error:

rpc mount export: RPC: Remote system error

The same happens on localhost for labstore1004. Restarting NFS didn't help, but on this host nfs doesn't restart very cleanly without a reboot. Log messages from 15:45 or so are me restarting NFS.

The alert is ACKed, and perhaps we should schedule a reboot of the server. Doing it immediately seems needlessly disruptive.

Event Timeline

Bstorm triaged this task as Normal priority.Jul 31 2019, 5:01 PM
Bstorm created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 31 2019, 5:01 PM
Bstorm moved this task from Backlog to Shared Storage on the Data-Services board.

Mentioned in SAL (#wikimedia-cloud) [2019-08-13T13:41:52Z] <jeh> Set icingia downtime for toolschecker labs showmount T229448

Mentioned in SAL (#wikimedia-operations) [2019-08-28T21:39:16Z] <bd808> Set downtime/ack for showmount on labstore1004 (T229448)

The RPC portmapper is out of sync with the NFS server. NFS will need to be restarted to resolve this, but unfortunately that will cause a brief interruption in client IO.

bd808 renamed this task from showmount not working on labstore1004 to showmount not working on labstore1004 & labstore1005.Wed, Aug 28, 9:49 PM

Change 533220 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] labstore: check nfs v4 cluster status with rpcinfo

https://gerrit.wikimedia.org/r/533220

Since showmount is not reliable with NFS v4 (as seen in other cases too T171508) we could check NFS connectivity over the cluster IP with the nagios rpcinfo wrapper:

$ /usr/lib/nagios/plugins/check_rpc -H 10.64.37.19 -C nfs -c4 -t
OK: RPC program nfs version 4 tcp running
/usr/lib/nagios/plugins/check_rpc -H 10.64.37.20 -C nfs -c4 -t
CRITICAL: RPC program nfs  version 4 tcp is not running

That sounds like a great idea!

Change 533220 merged by Jhedden:
[operations/puppet@production] labstore: check nfs v4 cluster status with rpcinfo

https://gerrit.wikimedia.org/r/533220

JHedden closed this task as Resolved.Thu, Aug 29, 3:49 PM
JHedden claimed this task.

Change 533302 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] labstore: Open NFS between secondary servers

https://gerrit.wikimedia.org/r/533302

Change 533302 merged by Jhedden:
[operations/puppet@production] labstore: Open NFS between secondary servers

https://gerrit.wikimedia.org/r/533302

Change 533318 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] toolschecker: remove showmount check

https://gerrit.wikimedia.org/r/533318

Change 533318 merged by Jhedden:
[operations/puppet@production] toolschecker: remove showmount check

https://gerrit.wikimedia.org/r/533318