Page MenuHomePhabricator

decommission/replace bast4001.wikimedia.org
Closed, ResolvedPublic

Description

Please note this decom CANNOT take place until the new bastion for ulsfo is online.

bast4001.wikimedia.org (WMF5799) is well out of warranty. There were new systems purchased for misc use on T160936, one of those misc systems should replace the bastion role for ulsfo.

Once the new misc system is online in that role, bast4001 should be decommissioned. It (plus a few cp systems already being accounted for) are the last remaining old ulsfo systems. Once they are all offline, we can get rid of the lot of old systems.

bast4001:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - power down host
  • - update status in netbox (inventory for decom, planned for spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - unwire and move out of the way in the rack to make space for new hosts to come online.
  • - mgmt dns entries removed. (systems are in rack, but with no power/network/mgmt connections, due to there being no storage in ulsfo and the office has no storage for us during the relocation.)
  • - switch port config removed

The remainder cannot happen until we are done with ALL the old CP/bastion/lvs systems to unrack them in a batch.

Event Timeline

Note that this host also emits SMART errors since two days, not worth investigating further as it's going to be decommed.

Change 478785 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decommission bast4001

https://gerrit.wikimedia.org/r/478785

Change 478785 merged by RobH:
[operations/puppet@production] decommission bast4001

https://gerrit.wikimedia.org/r/478785

wmf-decommission-host was executed by robh for bast4001.wikimedia.org and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor
RobH updated the task description. (Show Details)

wipe is in progress via usb live image boot, it'll take 24-48 hours to complete, so I'll just check it when I'm onsite next.

So, this is on asw2-ulsfo:ge-2/0/12

robh@asw2-ulsfo# show | compare 
[edit interfaces interface-range vlan-public1-ulsfo]
-    member ge-2/0/12;

There wasn't a disabled range to add it to, so this switch stack seems like its not setup like others? @ayounsi please advise?

Note that bast4001 no longer works for login?

RobH mentioned this in Unknown Object (Task).Jun 25 2019, 4:20 PM

Change 519061 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom bast4001

https://gerrit.wikimedia.org/r/519061

Change 519061 merged by RobH:
[operations/puppet@production] decom bast4001

https://gerrit.wikimedia.org/r/519061

Change 519062 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] removing bast4001 dns

https://gerrit.wikimedia.org/r/519062

Change 519062 merged by RobH:
[operations/dns@master] removing bast4001 dns

https://gerrit.wikimedia.org/r/519062

RobH removed a project: Patch-For-Review.