Page MenuHomePhabricator

labstore1006 nfsd not started after reboot
Open, Needs TriagePublic

Description

labstore1006 apparently had a spontaneous reboot, which surfaced two issues:

  • NFSd wasn't running after the reboot
  • NFSd error in Icinga did not cause a page/notification to be sent to the WMCS team

Related: T217473

Event Timeline

GTirloni created this task.Mar 2 2019, 12:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 2 2019, 12:19 PM

Ensured it's enabled:

root@labstore1006:~# systemctl enable nfs-kernel-server
Synchronizing state for nfs-kernel-server.service with sysvinit using update-rc.d...
Executing /usr/sbin/update-rc.d nfs-kernel-server defaults
insserv: warning: current start runlevel(s) (empty) of script `nfs-kernel-server' overrides LSB defaults (2 3 4 5).
insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `nfs-kernel-server' overrides LSB defaults (0 1 6).

root@labstore1006:~# ls -l /etc/rc5.d/*nfs*
lrwxrwxrwx 1 root root 27 Feb 16  2018 /etc/rc5.d/S01nfs-kernel-server -> ../init.d/nfs-kernel-server

And Puppet disables it:

root@labstore1006:~# puppet agent -t
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for labstore1006.wikimedia.org
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1551531636'
Notice: /Stage[main]/Profile::Dumps::Distribution::Nfs/Service[nfs-kernel-server]/enable: enable changed 'true' to 'false'
Notice: Applied catalog in 13.41 seconds

Related commit 195b6fa644f5d3083bbbf9e755bac5598cbee030

+    # Manage state manually
+    service { 'nfs-kernel-server':
+        enable => false,
+    }
Bstorm added a subscriber: Bstorm.Mar 2 2019, 4:42 PM

This was required by design on the main project NFS, and I think it was generalized to this cluster during the build for no strong reason. Both servers are always in use for NFS (either by Cloud or Analytics) at all times.

I think it is well worth it to enable the service going forward and disable puppet during maintenance on this cluster especially. NFS must only be managed manually on the project NFS cluster. There may have also been build steps that made that setting practical. I believe all of our NFS servers have the service managed manually in puppet, IIRC.

I'm open to other ideas, but that's my feeling.

Bstorm added a comment.Mar 4 2019, 3:19 PM

To temper my optimistic response over the weekend, I should mention that during the RAID expansion, it would have been problematic and likely corrupting if the nfsd daemons started on reboot. I'm just thinking we could also disable puppet and deliberately disable nfsd before rebooting during that kind of thing.

Bstorm claimed this task.Mar 5 2019, 6:10 PM

After discussing it at this past (admittedly small) team meeting, I got more context on this. I'm going to check whether there's any chance that this could be confused with servers where it absolutely should not be done and perhaps enable the service.

However, with the current config, we should have been paged for NFS being down. This seems to be a more important item. If NFS was started on reboot, we would never have had any idea that this error happened at all. Luckily, there's already a ticket for that part.

Dzahn added a subscriber: Dzahn.Mar 14 2019, 9:07 AM

labstore1006
NFS - CRITICAL 2019-03-14 09:05:41 0d 14h 56m 33s 3/3 connect to address 208.80.154.7 and port 2049: Connection refused

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=labstore1006&service=NFS

Yeah, it's out of service for T217473

I'll ack it.

Ah, it's already acked :)

GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:06 PM