Page MenuHomePhabricator

Decommission wdqs100[3-5]
Closed, ResolvedPublic

Description

Per T303459 , these hosts are slated to be refreshed.

Creating this ticket to:

  • run decom cookbook
  • create DC Ops decom tickets as needed.
  • remove references to wdqs1005 as the bigdata ldf endpoint host and replace with wdqs1016 (arbitrarily chosen)
    • TLS Cert updated accordingly in private puppet repo
  • wdqs1003 is no longer listed as canary in deploy repo's scap config (scap/wdqs-canary); instead wdqs1016 should be.
  • dc-ops tickets created (full decom)

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2023-08-25T21:03:47Z] <inflatador> bking@cumin1001 shutting off wdqs1005 in preparation for decommission T344198

wdqs1005 is alerting for ipmi, and based on this search result it seems like it might have hardware issues.

Since it's already scheduled for refresh, I've shut if off and we'll continue the decom process on Monday.

Change 952864 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs1005: Disable notifications

https://gerrit.wikimedia.org/r/952864

Mentioned in SAL (#wikimedia-operations) [2023-08-28T17:00:35Z] <inflatador> bking@cumin1001 depool wdqs1005 for decom T344198

RKemper renamed this task from Decommission wdqs10[03-05] to Decommission wdqs100[3-5].Aug 28 2023, 9:30 PM

Change 952864 merged by Ryan Kemper:

[operations/puppet@production] wdqs1005: Disable notifications and remove from lvs

https://gerrit.wikimedia.org/r/952864

Mentioned in SAL (#wikimedia-operations) [2023-08-31T19:03:21Z] <ryankemper> T344198 Temporarily disabling puppet on all wdqs* hosts in preparation for wdqs.discovery.wmnet certificate revocation

Mentioned in SAL (#wikimedia-operations) [2023-08-31T19:03:41Z] <ryankemper> T344198 on ryankemper@cumin1001: sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T344198"'

Change 954123 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: new wqds.discovery cert

https://gerrit.wikimedia.org/r/954123

Change 954123 merged by Ryan Kemper:

[operations/puppet@production] wdqs: new wqds.discovery cert

https://gerrit.wikimedia.org/r/954123

Generated new cergen certs for wdqs.discovery.wmnet that include wdqs1016 in the alt_names instead of wdqs1005. Followed the steps below:

INSTRUCTIONS
(1) Edit /srv/private/modules/secret/secrets/certificates/certificate.manifests.d/wdqs.certs.yaml
#stop and disable puppet on all wdqs* hosts here
# !log T344198 Temporarily disabling puppet on all `wdqs*` hosts in preparation for `wdqs.discovery.wmnet` certificate revocation
# !log T344198 on `ryankemper@cumin1001`: `sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T344198"'`
(2) on `ryankemper@cumin1001`: `sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T344198"'`
(3) sudo puppet cert clean wdqs.discovery.wmnet
(4) sudo rm -fv /srv/private/modules/secret/secrets/certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.crt.pem /srv/private/modules/secret/secrets/certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.csr.pem /srv/private/modules/secret/secrets/certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.jks /srv/private/modules/secret/secrets/certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.p12 /srv/private/modules/secret/secrets/certificates/wdqs.discovery.wmnet/truststore.jks
(5) sudo cergen -c 'wdqs.*' --generate --base-path /srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d
(6) The new cert should be under `/srv/private/modules/secret/secrets/certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.crt.pem`, add this to puppet under `files/ssl/wdqs.discovery.wmnet.crt`


(7)
<commit+merge the public and private updates>
sudo chown -R gitpuppet:gitpuppet modules/secret/secrets/certificates/wdqs.discovery.wmnet/
sudo chown gitpuppet:gitpuppet /srv/private/modules/secret/secrets/ssl/wdqs.discovery.wmnet.key
(8) <run puppet on one public wdqs server and verify queries work by tunneling in>
# <re-enable puppet on all wdqs* and `sudo run-puppet-agent`>

cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: wdqs1005.eqiad.wmnet

  • wdqs1005.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
RKemper updated the task description. (Show Details)
RKemper moved this task from In Progress to Blocked / Waiting on the Data-Platform-SRE board.

Made the decom ticket for wdqs1005 and ran the cookbook. We will decom 1003/1004 later. Moving this to blocked/waiting for now.

Change 955858 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[wikidata/query/deploy@master] wdqs1016 is taking over as canary

https://gerrit.wikimedia.org/r/955858

Change 955858 merged by Bking:

[wikidata/query/deploy@master] wdqs1016 is taking over as canary

https://gerrit.wikimedia.org/r/955858

Change 958572 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: decom old canary wdqs1003

https://gerrit.wikimedia.org/r/958572

Running decom cookbook for wdqs100[3,4]. Dc-ops ticket up here: https://phabricator.wikimedia.org/T346699

Change 958572 merged by Ryan Kemper:

[operations/puppet@production] wdqs: decom wdqs100[3,4]

https://gerrit.wikimedia.org/r/958572

cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: wdqs1003.eqiad.wmnet

  • wdqs1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: wdqs1004.eqiad.wmnet

  • wdqs1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB