Page MenuHomePhabricator

labstore1006 nfsd not started after reboot
Closed, ResolvedPublic

Description

labstore1006 apparently had a spontaneous reboot, which surfaced two issues:

  • NFSd wasn't running after the reboot
  • NFSd error in Icinga did not cause a page/notification to be sent to the WMCS team

Related: T217473

Event Timeline

GTirloni created this task.Mar 2 2019, 12:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 2 2019, 12:19 PM

Ensured it's enabled:

root@labstore1006:~# systemctl enable nfs-kernel-server
Synchronizing state for nfs-kernel-server.service with sysvinit using update-rc.d...
Executing /usr/sbin/update-rc.d nfs-kernel-server defaults
insserv: warning: current start runlevel(s) (empty) of script `nfs-kernel-server' overrides LSB defaults (2 3 4 5).
insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `nfs-kernel-server' overrides LSB defaults (0 1 6).

root@labstore1006:~# ls -l /etc/rc5.d/*nfs*
lrwxrwxrwx 1 root root 27 Feb 16  2018 /etc/rc5.d/S01nfs-kernel-server -> ../init.d/nfs-kernel-server

And Puppet disables it:

root@labstore1006:~# puppet agent -t
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for labstore1006.wikimedia.org
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1551531636'
Notice: /Stage[main]/Profile::Dumps::Distribution::Nfs/Service[nfs-kernel-server]/enable: enable changed 'true' to 'false'
Notice: Applied catalog in 13.41 seconds

Related commit 195b6fa644f5d3083bbbf9e755bac5598cbee030

+    # Manage state manually
+    service { 'nfs-kernel-server':
+        enable => false,
+    }
Bstorm added a subscriber: Bstorm.Mar 2 2019, 4:42 PM

This was required by design on the main project NFS, and I think it was generalized to this cluster during the build for no strong reason. Both servers are always in use for NFS (either by Cloud or Analytics) at all times.

I think it is well worth it to enable the service going forward and disable puppet during maintenance on this cluster especially. NFS must only be managed manually on the project NFS cluster. There may have also been build steps that made that setting practical. I believe all of our NFS servers have the service managed manually in puppet, IIRC.

I'm open to other ideas, but that's my feeling.

Bstorm added a comment.Mar 4 2019, 3:19 PM

To temper my optimistic response over the weekend, I should mention that during the RAID expansion, it would have been problematic and likely corrupting if the nfsd daemons started on reboot. I'm just thinking we could also disable puppet and deliberately disable nfsd before rebooting during that kind of thing.

Bstorm claimed this task.Mar 5 2019, 6:10 PM

After discussing it at this past (admittedly small) team meeting, I got more context on this. I'm going to check whether there's any chance that this could be confused with servers where it absolutely should not be done and perhaps enable the service.

However, with the current config, we should have been paged for NFS being down. This seems to be a more important item. If NFS was started on reboot, we would never have had any idea that this error happened at all. Luckily, there's already a ticket for that part.

Dzahn added a subscriber: Dzahn.Mar 14 2019, 9:07 AM

labstore1006
NFS - CRITICAL 2019-03-14 09:05:41 0d 14h 56m 33s 3/3 connect to address 208.80.154.7 and port 2049: Connection refused

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=labstore1006&service=NFS

Yeah, it's out of service for T217473

I'll ack it.

Ah, it's already acked :)

GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:06 PM
bd808 moved this task from Backlog to Shared Storage on the Data-Services board.May 30 2019, 7:03 PM

Change 517117 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: enable the service for nfs to start on reboot

https://gerrit.wikimedia.org/r/517117

Andrew added a subscriber: Andrew.Jun 17 2019, 12:45 PM

Just to double-check: IIRC, back in the day we avoided this because we had multiple controllers attached to a shared shelf and if two controllers ran at the same time then terrible, terrible things happened. Is it safe to say that there's no current situation where having 'too many' nfs services running at once causes harm?

I checked the irc logs to see what icinga told us then:

[09:25:21] <icinga-wm>	 PROBLEM - Host labstore1006 is DOWN: PING CRITICAL - Packet loss = 100%
[09:29:13] <icinga-wm>	 RECOVERY - Host labstore1006 is UP: PING WARNING - Packet loss = 28%, RTA = 36.35 ms
[09:31:51] <icinga-wm>	 PROBLEM - NFS on labstore1006 is CRITICAL: connect to address 208.80.154.7 and port 2049: Connection refused

Can we make sure that labstore1006/7 going down is CRITICAL and pages? If so then I'm ok with this patch going in.

That I believe is done. We can always do something that will make it page to be sure....
But the hiera works on the other NFS servers. It unfortunately tested itself.

In lieu of pages, I looked at the catalog (/var/lib/puppet/client_data/catalog/<hostname>.json and compared the entries for labmon1001 (which pages) and labstore1006. Yeah, I'm paranoid...

labstore1006:

{
    "exported": false,
    "file": "/etc/puppet/modules/monitoring/manifests/host.pp",
    "line": 130,
    "parameters": {
        "address": "208.80.154.7",
        "check_command": "check_ping!500,20%!2000,100%",
        "check_period": "24x7",
        "contact_groups": "admins",
        "ensure": "present",
        "host_name": "labstore1006",
        "hostgroups": "wmcs_eqiad,asw2-a-eqiad",
        "icon_image": "vendors/debian.png",
        "max_check_attempts": 2,
        "notification_interval": 0,
        "notification_options": "d,u,r,f",
        "notification_period": "24x7",
        "notifications_enabled": "1",
        "parents": "asw2-a-eqiad",
        "statusmap_image": "vendors/debian.gd2",
        "vrml_image": "vendors/debian.png"
    },
    "tags": [
        "monitoring::exported_nagios_host",
        "monitoring",
        "exported_nagios_host",
        "labstore1006",
        "monitoring::host",
        "host",
        "class",
        "base::monitoring::host",
        "base",
        "profile::base",
        "profile",
        "standard",
        "profile::standard",
        "role::dumps::distribution::server",
        "role",
        "dumps",
        "distribution",
        "server"
    ],
    "title": "labstore1006",
    "type": "Monitoring::Exported_nagios_host"
},

labmon1001:

{
    "exported": false,
    "file": "/etc/puppet/modules/monitoring/manifests/host.pp",
    "line": 130,
    "parameters": {
        "address": "10.64.37.13",
        "check_command": "check_ping!500,20%!2000,100%",
        "check_period": "24x7",
        "contact_groups": "admins,sms,admins",
        "ensure": "present",
        "host_name": "labmon1001",
        "hostgroups": "labs_eqiad,asw2-c-eqiad",
        "icon_image": "vendors/debian.png",
        "max_check_attempts": 2,
        "notification_interval": 0,
        "notification_options": "d,u,r,f",
        "notification_period": "24x7",
        "notifications_enabled": "1",
        "parents": "asw2-c-eqiad",
        "statusmap_image": "vendors/debian.gd2",
        "vrml_image": "vendors/debian.png"
    },
    "tags": [
        "monitoring::exported_nagios_host",
        "monitoring",
        "exported_nagios_host",
        "labmon1001",
        "monitoring::host",
        "host",
        "class",
        "base::monitoring::host",
        "base",
        "profile::base",
        "profile",
        "standard",
        "profile::standard",
        "role::wmcs::monitoring",
        "role",
        "wmcs"
    ],
    "title": "labmon1001",
    "type": "Monitoring::Exported_nagios_host"
},

Labmon1001 lists sms in the contact groups for the host being down (not pingable) but labstore1006 does not.

I found:

[ariel@bigtrouble hieradata]$ more hosts/labmon1001.yaml
statsite::instance::graphite_host: 'localhost'
statsite::instance::extended_counters: 1
profile::base::notifications: critical

but
./common/profile/dumps/distribution.yaml has these keys which don't get read because they are the wrong prefix (?)
profile::base::check_disk_critical: true
profile::base::notifications: critical

The rest of the keys in there match the profile and will get processed properly.

Proof: this change https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/517693/
produces this change in the catalog: https://puppet-compiler.wmflabs.org/compiler1002/16994/labstore1006.wikimedia.org/

I thought I'd set them on the role rather than the profile? Checking that it appears that they aren't set in the right place. I'd rather it be on the role. Lemme check if that will work right. Thanks for looking!

Change 517705 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: move the alert settings to the right place

https://gerrit.wikimedia.org/r/517705

Change 517705 merged by Bstorm:
[operations/puppet@production] dumps distribution: move the alert settings to the right place

https://gerrit.wikimedia.org/r/517705

Just to double-check: IIRC, back in the day we avoided this because we had multiple controllers attached to a shared shelf and if two controllers ran at the same time then terrible, terrible things happened. Is it safe to say that there's no current situation where having 'too many' nfs services running at once causes harm?

Yes, I think that's safe to say. NFS is being started via the service in general and many processes are started. I don't think we can end up with something quite like that from this particular change at least.

Change 517117 merged by Bstorm:
[operations/puppet@production] dumps distribution: enable the service for nfs to start on reboot

https://gerrit.wikimedia.org/r/517117

Ok, all it did was enable the service, which was the idea. On next reboot it will hopefully not go poorly :)

Bstorm closed this task as Resolved.Jun 19 2019, 2:54 PM