I recent alert for expiring puppet certificates triggered for a host (dbstore2001.codfw.wmnet) which has been decommissioned for a long time. This is likely host that where decommissioned before the current decommissioning scripts which force a puppet clean. As such we should manually remove any certs for host that no longer exist.
Description
Related Objects
Event Timeline
for this i used the following script run from a cumin host. puppet_hosts was generated on the puppet masters with ls -1 /var/lib/puppet/server/ssl/ca/signed | sed 's/\.pem$//' > puppet_hosts
from pathlib import Path from spicerack import Spicerack from spicerack.netbox import NetboxHostNotFoundError spicerack = Spicerack(verbose=False) for host in Path('puppet_hosts').read_text().splitlines(): if host.endswith('discovery.wmnet') or host.startswith('_etcd-server-ssl'): continue host = host.split('.') if len(host) > 1 and host[1] == 'svc': continue try: nb_host = spicerack.netbox_server(host[0]) except NetboxHostNotFoundError: print('.'.join(host))
from the scripot above we get the following list
db2051.codfw.wmnet - T230778
db2057.codfw.wmnet - T230394
db2063.codfw.wmnet - T230704
kafka1001.eqiad.wmnet - T121553
kafka1002.eqiad.wmnet - T121553
kafka1003.eqiad.wmnet
kafka2001.codfw.wmnet
kafka2002.codfw.wmnet
kafka2003.codfw.wmnet
mw1259.eqiad.wmnet - T187466
mw1260.eqiad.wmnet - T187466
orespoolcounter1002.eqiad.wmnet
We also had the following certs which look strange
contint.wikimedia.org
default-staging-certificate.wmnet
istio-egressgateway.istio-system.svc.cluster.local
kafka_broker_kafka-jumbo1001
kafka_broker_kafka-jumbo1002
kafka_broker_kafka-jumbo1003
kafka_broker_kafka-jumbo1004
kafka_broker_kafka-jumbo1005
kafka_broker_kafka-jumbo1006
kafka_client_test1
kafka_fundraising_client
kafka_jumbo_broker
kafka_jumbo-eqiad_broker
kafka_logging-codfw_broker
kafka_logging-eqiad_broker
kafka_main-codfw_broker
kafka_main-eqiad_broker
kafka_mirror_maker
kafka_test-eqiad_broker
kserve-webhook-server-service.kserve.svc.cluster.local
labtest-puppetmaster.wikimedia.org
ldap.wikimedia.org
puppet.pem.expired
purged
query-preview.wikidata.org
restbase-test2003.codfw.wmnet.pem.expired
swift_codfw
swift_eqiad
tegola-vector-tiles
varnishkafka
yarn.wikimedia.org
This is likely host that where decommissioned before the current decommissioning scripts which force a puppet clean.
Based on logs at T220002#5574262 that didn't seem to be the case. I can see some options of what happened:
- the decom script "lied" to us and had a bug and didn't delete the certs, even if it told us that it did
- the host was entered again into puppet after wipe
- Some kind of syncronization issue between servers/data store corruption leading to reappearing after deletion
- The cert was duplicated - e.g. after and aborted install
We also had the following certs which look strange
Maybe attempts for a puppet-signed cert for generic services before you created certmanager/other signing methods? I don't know which of these are valid/in use though, or if services were migrated to other signing methods already.
the decom script "lied" to us and had a bug and didn't delete the certs, even if it told us that it did
This i think is the most likely. Either way it seems newer decommissions have worked correctly. I have cleaned out the old certs and will resolve this for now