Page MenuHomePhabricator

remove old puppet certificates from puppet master
Closed, ResolvedPublic

Description

I recent alert for expiring puppet certificates triggered for a host (dbstore2001.codfw.wmnet) which has been decommissioned for a long time. This is likely host that where decommissioned before the current decommissioning scripts which force a puppet clean. As such we should manually remove any certs for host that no longer exist.

Event Timeline

jbond triaged this task as Medium priority.Aug 4 2022, 10:29 AM
jbond created this task.

for this i used the following script run from a cumin host. puppet_hosts was generated on the puppet masters with ls -1 /var/lib/puppet/server/ssl/ca/signed | sed 's/\.pem$//' > puppet_hosts

from pathlib import Path
from spicerack import Spicerack
from spicerack.netbox import NetboxHostNotFoundError
spicerack = Spicerack(verbose=False)

for host in Path('puppet_hosts').read_text().splitlines():
    if host.endswith('discovery.wmnet') or host.startswith('_etcd-server-ssl'):
        continue
    host = host.split('.')
    if len(host) > 1 and host[1] == 'svc':
        continue
    try:
        nb_host = spicerack.netbox_server(host[0])
    except NetboxHostNotFoundError:
        print('.'.join(host))

from the scripot above we get the following list

db2051.codfw.wmnet - T230778
db2057.codfw.wmnet - T230394
db2063.codfw.wmnet - T230704
kafka1001.eqiad.wmnet - T121553
kafka1002.eqiad.wmnet - T121553
kafka1003.eqiad.wmnet
kafka2001.codfw.wmnet
kafka2002.codfw.wmnet
kafka2003.codfw.wmnet
mw1259.eqiad.wmnet - T187466
mw1260.eqiad.wmnet - T187466
orespoolcounter1002.eqiad.wmnet

We also had the following certs which look strange
contint.wikimedia.org
default-staging-certificate.wmnet
istio-egressgateway.istio-system.svc.cluster.local
kafka_broker_kafka-jumbo1001
kafka_broker_kafka-jumbo1002
kafka_broker_kafka-jumbo1003
kafka_broker_kafka-jumbo1004
kafka_broker_kafka-jumbo1005
kafka_broker_kafka-jumbo1006
kafka_client_test1
kafka_fundraising_client
kafka_jumbo_broker
kafka_jumbo-eqiad_broker
kafka_logging-codfw_broker
kafka_logging-eqiad_broker
kafka_main-codfw_broker
kafka_main-eqiad_broker
kafka_mirror_maker
kafka_test-eqiad_broker
kserve-webhook-server-service.kserve.svc.cluster.local
labtest-puppetmaster.wikimedia.org
ldap.wikimedia.org
puppet.pem.expired
purged
query-preview.wikidata.org
restbase-test2003.codfw.wmnet.pem.expired
swift_codfw
swift_eqiad
tegola-vector-tiles
varnishkafka
yarn.wikimedia.org

This is likely host that where decommissioned before the current decommissioning scripts which force a puppet clean.

Based on logs at T220002#5574262 that didn't seem to be the case. I can see some options of what happened:

  • the decom script "lied" to us and had a bug and didn't delete the certs, even if it told us that it did
  • the host was entered again into puppet after wipe
  • Some kind of syncronization issue between servers/data store corruption leading to reappearing after deletion
  • The cert was duplicated - e.g. after and aborted install

We also had the following certs which look strange

Maybe attempts for a puppet-signed cert for generic services before you created certmanager/other signing methods? I don't know which of these are valid/in use though, or if services were migrated to other signing methods already.

jbond claimed this task.

the decom script "lied" to us and had a bug and didn't delete the certs, even if it told us that it did

This i think is the most likely. Either way it seems newer decommissions have worked correctly. I have cleaned out the old certs and will resolve this for now

Aklapper renamed this task from remove old puppet certificates fom puppet master to remove old puppet certificates from puppet master.Aug 4 2022, 11:40 AM