Page MenuHomePhabricator

puppet: profile::auto_restarts::service: have a way to don't deploy the systemd timers
Closed, ResolvedPublic

Description

In cloud realm VMs we get plenty of systemd timers deployed via profile::auto_restarts::service that don't belong there. We don't need or desire them.
There is some semantic interactions between the puppet manifests for debdeploy and auto_restarts that simply wont allow to disable systemd timers as they are today.

Actually they are harmful in some cases because they often run lsof trying to inspect a NFS filesystem which we may have many cloud VMs which can cause other problems.

The only known way to prevent the NFS filesystem from being hit by lsof is by using profile::debdeploy::client::exclude_filesystems: ['nfsv4'], ie. this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/920644/2
However, it is not really a valid solution because the default for cloud realm is to don't deploy this configuration via profile::debdeploy::client::ensure: absent, so the setting gets ignored.

Another apporach would be to have some hiera way to prevent the systemd timers from being deployed. That's the patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/920648

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 920644 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: wmf-auto-restart: exclude NFS filesystems

https://gerrit.wikimedia.org/r/920644

Change 920648 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] profile::auto_restarts: allow the systemd timer to not be installed

https://gerrit.wikimedia.org/r/920648

Change 920654 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] profile::auto_restarts::service: add some spec tests

https://gerrit.wikimedia.org/r/920654

I think you should simply add nfsv4 to profile::debdeploy::client::exclude_filesystems, that will entire make the issues vanish. We have similar config snippets in prouction for systems with an HFDS mount (such as the stat* nodes).

The auto restarts are needed in Cloud VPS as well; with unattended-upgrades installing a security update of libfoo, this ensures that all services using a copy of libfoo get restarted, otherwise they'd continue to use the old library.

The auto restarts are needed in Cloud VPS as well; with unattended-upgrades installing a security update of libfoo, this ensures that all services using a copy of libfoo get restarted, otherwise they'd continue to use the old library.

i agree with this however currently the wmf-auto-restart.py config options depends on the debdeploy config file as such i think we need to refactor and decouple this dependency before we can configure options in the cloud environment. Worth noting that the script will work however /etc/debdeploy-client/config.json will not exist as such the relevant settings in hiera will not apply

The auto restarts are needed in Cloud VPS as well; with unattended-upgrades installing a security update of libfoo, this ensures that all services using a copy of libfoo get restarted, otherwise they'd continue to use the old library.

i agree with this however currently the wmf-auto-restart.py config options depends on the debdeploy config file as such i think we need to refactor and decouple this dependency before we can configure options in the cloud environment. Worth noting that the script will work however /etc/debdeploy-client/config.json will not exist as such the relevant settings in hiera will not apply

Ah, indeed, that's a good point. I hadn't thought about the dependency on the Puppet level.

jbond triaged this task as Medium priority.May 25 2023, 11:18 AM
fgiunchedi subscribed.

As far as I know this hasn't caused practical problems in the recent past (?) Boldly declining, though reopen as needed

I stand corrected, lsof from wmf-auto-restart does get occasionally stuck on nfs:

root     1818049  0.0  0.0  17980 13352 ?        Ss   00:27   0:00 /usr/bin/python3 /usr/local/sbin/wmf-auto-restart -s lldpd
root     1818061  0.0  0.0   6120  3316 ?        S    00:27   0:00  \_ /usr/bin/lsof +c 15 -nXd DEL
root     1818062  0.0  0.0   5564   308 ?        D    00:27   0:00      \_ /usr/bin/lsof +c 15 -nXd DEL
# cat /proc/1818062/stack 
[<0>] folio_wait_bit_common+0x13d/0x350
[<0>] folio_wait_writeback+0x28/0x80
[<0>] __filemap_fdatawait_range+0x90/0x120
[<0>] filemap_write_and_wait_range+0x5f/0x80
[<0>] nfs_getattr+0x422/0x470 [nfs]
[<0>] vfs_statx+0xc2/0x180
[<0>] vfs_fstatat+0x51/0x70
[<0>] __do_sys_newfstatat+0x3f/0x80
[<0>] do_syscall_64+0x55/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

I don't think nfs filesystems should be ever looked at by wmf-auto-restart with our current architecture, as such I suggest nfs filesystems to be excluded by default.

Yeah, that sounds good, we even have a Hiera option for this already; profile::debdeploy::client::exclude_filesystems

The auto restarts are needed in Cloud VPS as well; with unattended-upgrades installing a security update of libfoo, this ensures that all services using a copy of libfoo get restarted, otherwise they'd continue to use the old library.

i agree with this however currently the wmf-auto-restart.py config options depends on the debdeploy config file as such i think we need to refactor and decouple this dependency before we can configure options in the cloud environment. Worth noting that the script will work however /etc/debdeploy-client/config.json will not exist as such the relevant settings in hiera will not apply

The refactor/decoupling work is now tracked in T404164: Untangle /etc/debdeploy-client/config.json from debdeploy and wmf-auto-restarts

Change #1186937 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: flip debdeploy::client::ensure to present

https://gerrit.wikimedia.org/r/1186937

Change #1186938 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: exclude nfs/nfs4 from debdeploy::client in cloud

https://gerrit.wikimedia.org/r/1186938

Yeah, that sounds good, we even have a Hiera option for this already; profile::debdeploy::client::exclude_filesystems

I looked at the puppet code and in practice I see no reason why we shouldn't default to ensuring present debdeploy::client and its config, which would make this setting also work as expected

Change #1186937 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: flip debdeploy::client::ensure to present

https://gerrit.wikimedia.org/r/1186937

Change #1186938 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: exclude nfs/nfs4 from debdeploy::client in cloud

https://gerrit.wikimedia.org/r/1186938

fgiunchedi claimed this task.

Ok change deployed, current situation:

filippo@tools-k8s-worker-nfs-12:~$ cat /etc/debdeploy-client/config.json
{
  "exclude_mounts": [

  ],
  "exclude_filesystems": [
    "nfs",
    "nfs4"
  ],
  "filter_services": {
    "qemu-system-x86": [
      "*"
    ]
  }
}

Not exactly what the task originally was about, though I think Good Enough™. Resolving again for now, if sth goes wrong we'll know via alerts and we can reopen this task