Page MenuHomePhabricator

rename cloudswift1002 as cloudlb1002
Closed, ResolvedPublic

Description

Netbox device: https://netbox.wikimedia.org/dcim/devices/3525

Procedure: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

  • decomission
  • netbox: edit the device name, and set its status from DECOMMISSIONING to PLANNED.
  • readd the DNS Name field for the management interface
  • run sre.dns.netbox cookbook
  • run sre.network.configure-switch-interfaces cookbook
  • reimage server with new name

Event Timeline

aborrero changed the task status from Open to In Progress.Jul 6 2023, 11:09 AM
aborrero triaged this task as Medium priority.
aborrero created this task.
aborrero moved this task from Backlog to Doing on the User-aborrero board.

cookbooks.sre.hosts.decommission executed by aborrero@cumin1001 for hosts: cloudswift1002.eqiad.wmnet

  • cloudswift1002.eqiad.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 936019 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb1001/1002: add role

https://gerrit.wikimedia.org/r/936019

Change 936019 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb1001/1002: add role

https://gerrit.wikimedia.org/r/936019

Change 936022 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: eqiad: bootstrap hiera data

https://gerrit.wikimedia.org/r/936022

aborrero changed the task status from Stalled to In Progress.Jul 7 2023, 3:59 PM

No longer blocked!

Change 936022 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: eqiad: bootstrap hiera data

https://gerrit.wikimedia.org/r/936022

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudlb1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudlb1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudlb1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudlb1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudlb1002.eqiad.wmnet with OS bullseye completed:

  • cloudlb1002 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307101028_aborrero_4018022_cloudlb1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
aborrero updated the task description. (Show Details)