Page MenuHomePhabricator

confd setup left without configuration doesn't stop confd
Open, MediumPublic

Description

I've found by chance that we have 34 hosts where confd doesn't have anymore any configuration file or templates but the process is still running and logs every 3 seconds:

/usr/bin/confd[3971103]: WARNING Found no templates

The 34 hosts with that setup are:

arclamp2001.codfw.wmnet,arclamp1001.eqiad.wmnet,cloudgw[2002-2003]-dev.codfw.wmnet,cloudgw[1001-1002].eqiad.wmnet,cloudlb[2001-2003]-dev.codfw.wmnet,cuminunpriv1001.eqiad.wmnet,ganeti[2033-2034].codfw.wmnet,ganeti-test[2001-2003].codfw.wmnet,idm-test1001.wikimedia.org,moscovium.eqiad.wmnet,netbox-dev2002.codfw.wmnet,netboxdb2002.codfw.wmnet,netboxdb1002.eqiad.wmnet,netmon[1003,2002].wikimedia.org,people2003.codfw.wmnet,people1004.eqiad.wmnet,pybal-test2003.codfw.wmnet,sretest[2003-2005].codfw.wmnet,sretest[1002-1003].eqiad.wmnet,testreduce1002.eqiad.wmnet,testvm[2002,2004-2005].codfw.wmnet

Maybe we should do somehing on the puppet side to ensure confd is stopped if no config/template is present:

/etc/confd
├── conf.d
└── templates

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2024-02-07T19:19:52Z] <mutante> people1004 systemctl stop confd; running puppet; checking to remove confd remnants from people* hosts - T356296

Seems to me this has to do with the profile::firewall migration from iptables to nftables.

What these hosts have in common is profile::firewall::provider: nftables in hieradata.

And confd is pulled in from profile::firewall for request-ipblocks/abuse

From inside profile::firewall:

    case $provider {
        'ferm': {
            if $defs_from_etcd {
                # unmanaged files under /etc/ferm/conf.d are purged
                # so we define the file to stop it being deleted
                file { '/etc/ferm/conf.d/00_defs_requestctl':
                    ensure => file,
..

        'nftables': {
..
            if $defs_from_etcd and $defs_from_etcd_nft {
                confd::file { '/etc/nftables/sets/requestctl.nft':

See what happens when I pretend to change the firewall provider back to ferm for people hosts:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/998532/4/hieradata/role/common/microsites/peopleweb.yaml

-->

https://puppet-compiler.wmflabs.org/output/998532/1334/people1004.eqiad.wmnet/index.html

Mentioned in SAL (#wikimedia-operations) [2024-06-26T23:26:07Z] <mutante> people1004 - stopped confd which logs every 3 seconds that it can't find any templates (T356296)

Change #1050080 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] peopleweb: set profile::firewall::defs_from_etcd to false

https://gerrit.wikimedia.org/r/1050080

Change #1050080 abandoned by Dzahn:

[operations/puppet@production] peopleweb: set profile::firewall::defs_from_etcd to false

Reason:

per Moritz' comment

https://gerrit.wikimedia.org/r/1050080

I'm a little confused about this one. We have defs_from_etcd_nft set to false by default in heria for the firewall profile:

cmooney@wikilap:~/repos/puppet$ grep defs_from_etcd_nft hieradata/common/profile/firewall.yaml 
profile::firewall::defs_from_etcd_nft: false

Due to a separate issue we have which causes nftables to fail completely if this is set to true (due to mixed IPv4 and IPv6 networks being included if it is, which isn't compatible with how the rules are defined), we actually need to make sure this is not evaluated to true for any production hosts right now.

Taking arclamp2001 as an example I don't see the /etc/nftables/sets/requestctl.nft file, which would be created if defs_from_etcd_nft was true. But at the same time I do see that confd.service is defined and is running (and logging the warnings). Perhaps a change was made but our current puppet config doesn't properly remove the confd service?

@cmooney

In profile::firewall there is a if $defs_from_etcd and $defs_from_etcd_nft. So if both are true that installs confd::file { '/etc/nftables/sets/requestctl.nft': as you say.

There is also another if $defs_from_etcd { which installs confd::file { '/etc/ferm/conf.d/00_defs_requestctl':. and is true by default. This part is outside the case $provider stanza but has a ensure => stdlib::ensure($provider == 'ferm'),.

What I think happens is that this second confd::file here pulls in confd, regardless of what the provider is set to and then.. since the provider is now not ferm anymore, this file /etc/ferm/conf.d/00_defs_requestctl gets absented.

But merely absenting a confd file does not mean the confd service and package gets removed.

So the result would be what we see, none of the requestctl files exist but confd is still there and without config.

Yea, so this:

if $defs_from_etcd {
    confd::file { '/etc/ferm/conf.d/00_defs_requestctl':
        ensure          => stdlib::ensure($provider == 'ferm'),

$defs_from_etcd is true.. (from common/profile/firewall.yaml) ..so confd gets pulled.

But the provider is not ferm, so the config file gets removed.

But there is nothing in confd::file that would remove confd the service. All it does is remove the config file.

Change #1057264 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] firewal: if provider is nft and not pulling requestctl, remove confd

https://gerrit.wikimedia.org/r/1057264

This is an attempt to fix it per logic "if the provider is nft and we do NOT pull requestctl data.. THEN ... remove confd".

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1057264

puppet compiler shows on arclamp2001 it would remove confd:

https://puppet-compiler.wmflabs.org/output/1057264/3431/arclamp2001.codfw.wmnet/index.html

Here is what happens when compiling it on all of profile::firewall:

https://puppet-compiler.wmflabs.org/output/1057264/3432/ (still running)

Change #1057264 abandoned by Dzahn:

[operations/puppet@production] firewall: if provider is nft and not pulling requestctl, remove confd

https://gerrit.wikimedia.org/r/1057264

JMeybohm subscribed.

This came up during T374366: Race condition in iptables rules during puppet runs on k8s nodes - currently it is not possible to disable defs_from_etcd in a clean way.

Scott_French subscribed.

Indeed, there's currently no mechanism I'm aware of to automatically absent a confd instance from a host when there are no longer templates configured.

While the specific history of changes to profile::firewall may have made this particularly noisy, it's a more general issue.

I'll give some thought to whether this is something we can address in a straightforward way. In the meantime, if anyone has an estimate of how much of the fleet is running "empty" confd instances, that would be interesting to quantify.

@Scott_French from a quick check unless I did some lazy mistake I'd say around 733
(the second grep is just to allow cumin to aggregate them all stripping all the time, host, pid info)

$ sudo cumin 'R:package = confd' "journalctl -u confd -g 'Found no templates' -n1 | grep -o 'Found no templates'"
[...snip...]
===== NODE GROUP =====
(733) an-test-druid1001.eqiad.wmnet,an-test-presto1001.eqiad.wmnet,aphlict2001.codfw.wmnet,aphlict1002.eqiad.wmnet,arclamp2001.codfw.wmnet,arclamp1001.eqiad.wmnet,cephosd[2001-2003].codfw.wmnet,cephosd[1001-1005].eqiad.wmnet,cloudcumin2001.codfw.wmnet,cloudcumin1001.eqiad.wmnet,cloudgw[2002-2003]-dev.codfw.wmnet,cloudgw[1003-1004].eqiad.wmnet,cloudidp2001-dev.codfw.wmnet,cloudlb[2002-2004]-dev.codfw.wmnet,cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet,cuminunpriv1001.eqiad.wmnet,dborch1002.wikimedia.org,debmonitor-dev2001.codfw.wmnet,doc2003.codfw.wmnet,doc1004.eqiad.wmnet,durum[2001-2002].codfw.wmnet,durum[6001-6002].drmrs.wmnet,durum[1001-1002].eqiad.wmnet,durum[5001-5002].eqsin.wmnet,durum[3005-3006].esams.wmnet,durum7004.magru.wmnet,durum[4001-4002].ulsfo.wmnet,etherpad2002.codfw.wmnet,etherpad1004.eqiad.wmnet,ganeti[2025-2050].codfw.wmnet,ganeti[6001-6004].drmrs.wmnet,ganeti[1023-1054].eqiad.wmnet,ganeti[5004-5007].eqsin.wmnet,ganeti[3005-3008].esams.wmnet,ganeti[7001-7004].magru.wmnet,ganeti[4005-4008].ulsfo.wmnet,ganeti-test[2001-2003].codfw.wmnet,gerrit[1003,2002-2003].wikimedia.org,gitlab[1003-1004,2002-2003].wikimedia.org,hcaptcha[1001-1002,2001-2002].wikimedia.org,hcaptcha-proxy[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002,6001-6002,7001-7002].wikimedia.org,idm-test1001.wikimedia.org,idp-test[1005,2005].wikimedia.org,irc[1003,2003].wikimedia.org,krb2002.codfw.wmnet,krb1002.eqiad.wmnet,kubestage[2001-2004].codfw.wmnet,kubestage[1003-1006].eqiad.wmnet,kubestagemaster[2003-2005].codfw.wmnet,kubestagemaster[1003-1005].eqiad.wmnet,lists[1004,2001].wikimedia.org,maps[2011-2014].codfw.wmnet,maps[1011-1014].eqiad.wmnet,maps-test2001.codfw.wmnet,mc-gp[2004-2006].codfw.wmnet,mc-gp[1004-1006].eqiad.wmnet,mc-wf1001.eqiad.wmnet,netbox-dev2003.codfw.wmnet,netboxdb2003.codfw.wmnet,netboxdb1003.eqiad.wmnet,netmon[1003,2002].wikimedia.org,people2004.codfw.wmnet,people1005.eqiad.wmnet,phab2002.codfw.wmnet,phab[1004-1005].eqiad.wmnet,planet2003.codfw.wmnet,planet[1003-1004].eqiad.wmnet,pybal-test2003.codfw.wmnet,sretest[2001,2003-2004,2006,2009-2010].codfw.wmnet,sretest[1002-1003,1005-1006].eqiad.wmnet,stewards2001.codfw.wmnet,stewards1001.eqiad.wmnet,tcp-proxy[2001-2002].codfw.wmnet,tcp-proxy[6001-6002].drmrs.wmnet,tcp-proxy[1001-1002].eqiad.wmnet,tcp-proxy[5001-5002].eqsin.wmnet,tcp-proxy[3001-3002].esams.wmnet,tcp-proxy[7001-7002].magru.wmnet,tcp-proxy[4001-4002].ulsfo.wmnet,testreduce1002.eqiad.wmnet,testvm[2002,2004-2007].codfw.wmnet,testvm7001.magru.wmnet,testvm2008.wikimedia.org,vrts2002.codfw.wmnet,vrts[1003-1004].eqiad.wmnet,wikikube-ctrl[2001-2003].codfw.wmnet,wikikube-ctrl[1002-1004].eqiad.wmnet,wikikube-worker[2001-2002,2005-2006,2011-2018,2033-2039,2041-2042,2044,2046,2049-2051,2055-2062,2064-2065,2067-2078,2087-2095,2102-2115,2124-2179,2184-2215,2242-2243,2248-2330].codfw.wmnet,wikikube-worker[1002-1007,1011-1012,1015-1016,1019-1021,1029-1031,1034-1168,1240-1289,1291-1327].eqiad.wmnet,wikikube-worker-exp2001.codfw.wmnet,wikikube-worker-exp1001.eqiad.wmnet,zuul2002.codfw.wmnet
----- OUTPUT of 'journalctl -u co...nd no templates'' -----
Found no templates
================
[...snip...]

In an attempt to answer your question how many and which are still affected now:

sudo cumin 'C:profile::confd' 'grep -c "WARNING Found no templates" /var/log/syslog'

===== NODE GROUP =====
(1) kubestagemaster1005.eqiad.wmnet
----- OUTPUT of 'grep -c "WARNING... /var/log/syslog' -----
27369
===== NODE GROUP =====
(1) wikikube-ctrl1003.eqiad.wmnet
----- OUTPUT of 'grep -c "WARNING... /var/log/syslog' -----
27368
===== NODE GROUP =====
(2) wikikube-ctrl[1002,1004].eqiad.wmnet
----- OUTPUT of 'grep -c "WARNING... /var/log/syslog' -----
27367
===== NODE GROUP =====
(3) kubestagemaster[2004-2005].codfw.wmnet,kubestagemaster1004.eqiad.wmnet
----- OUTPUT of 'grep -c "WARNING... /var/log/syslog' -----
27370
===== NODE GROUP =====
(1) wikikube-ctrl2001.codfw.wmnet
----- OUTPUT of 'grep -c "WARNING... /var/log/syslog' -----
27366
===== NODE GROUP =====
(2) kubestagemaster2003.codfw.wmnet,kubestagemaster1003.eqiad.wmnet
----- OUTPUT of 'grep -c "WARNING... /var/log/syslog' -----
27371
===== NODE GROUP =====
(2) wikikube-ctrl[2002-2003].codfw.wmnet
----- OUTPUT of 'grep -c "WARNING... /var/log/syslog' -----
27365
===== NODE GROUP =====
(45) alert[1002,2002].wikimedia.org,aux-k8s-ctrl[2002-2003].codfw.wmnet,aux-k8s-ctrl[1002-1003].eqiad.wmnet,cloudlb[1001-1002].eqiad.wmnet,config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet,deploy2002.codfw.wmnet,deploy1003.eqiad.wmnet,dns[1004-1006,2004-2006,3003-3004,4003-4004,5003-5004,6001-6002,7001-7002].wikimedia.org,dse-k8s-ctrl[2001-2002].codfw.wmnet,dse-k8s-ctrl[1001-1002].eqiad.wmnet,ml-serve-ctrl[2001-2002].codfw.wmnet,ml-serve-ctrl[1001-1002].eqiad.wmnet,ml-staging-ctrl[2001-2002].codfw.wmnet,puppetmaster1001.eqiad.wmnet,puppetserver[2001-2002,2004].codfw.wmnet,puppetserver[1001-1003].eqiad.wmnet
----- OUTPUT of 'grep -c "WARNING... /var/log/syslog' -----
0

edit: submitted before seeing that Volans also submitted above. it's more than that because mine was limited to hosts actually using the confd profile.

Thank you both for beating me to assembling a similar command myself, heh.

Indeed, I tend to use C:confd to find all the hosts where class instantiated (as in profile::confd) or included (as in confd::file), which yields the same number of hosts as R:package = confd. In any case, that means 733 of 2369 relevant hosts (30% are in this state), which is definitely informative in terms of the scope of the problem.

On balance, I will note that carrying the additional "latent" confd hosts does not put any additional query load on etcd, since indeed there are no keys polled.

Does this have any relation to T417458 , T116224 over in cloud VPSes?

Change #1296537 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] confd: Add condition to prevent starting without configs

https://gerrit.wikimedia.org/r/1296537