Page MenuHomePhabricator

reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet
Closed, ResolvedPublic

Description

The machine WMF11584 (netbox 5353) that was procured in T368918 and has been racked as gerrit1004.wikimedia.org in T369671 should be renamed.

Please:

  • move from external IP to internal IP (wikimedia.org -> eqiad.wmnet)
  • reimage it as phab1005.eqiad.wmnet, OS still bookworm

Budget-wise there should be no effects because this has been budgeted as "Phabricator / Gerrit / Contint spare host" and it is still one of those.

We are just changing which we use for Gerrit and which for Phabricator.

Thanks, and sorry for the change after racking.

Rename & VLAN change checklist

  • make note of server's network port and cable ID
  • run the decom script to remove all instances of host. (check if this remocves mgmt ip cuz we don't want that)
  • update netbox with new hostname
  • add the network data back for the host, check that mgmt updated ot new hostname
  • run dns update script
  • reimage host with new hostname
  • create sub-task to this task with ops-eqiad tag for onsites to change the hostname tags on the host.

Details

Related Objects

StatusSubtypeAssignedTask
ResolvedFeatureAklapper
ResolvedFeatureAklapper
ResolvedFeatureAklapper
OpenNone
Resolvedvalerio.bozzolan
ResolvedBUG REPORTvalerio.bozzolan
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedFeatureAklapper
ResolvedAklapper
ResolvedFeatureAklapper
ResolvedBUG REPORTAklapper
Resolvedbrennen
ResolvedJclark-ctr
OpenNone
ResolvedDzahn
ResolvedMarostegui
ResolvedABran-WMF
ResolvedDzahn

Event Timeline

Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)
Dzahn renamed this task from reimage gerrit1004 as phab1003 to reimage gerrit1004 as phab1005.Aug 19 2024, 6:55 PM
Dzahn renamed this task from reimage gerrit1004 as phab1005 to reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet.
Dzahn updated the task description. (Show Details)

Change #1063870 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: rename gerrit1004 to phab1005

https://gerrit.wikimedia.org/r/1063870

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm

@Dzahn please update preseed.yaml file for sw raid for this server. Reimage fails without this

Change #1063870 merged by Dzahn:

[operations/puppet@production] site: rename gerrit1004 to phab1005

https://gerrit.wikimedia.org/r/1063870

@Jclark-ctr I don't think it's preseed.yaml. Both existing gerrit1004 and phab* are set to standard/raid1-2dev.cfg.

I think it was that the host wasn't in site.pp with a role. But now it is.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm executed with errors:

  • phab1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console phab1005.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm

@Dzahn phab1005 is still continuing to fail imaging not picking up ip address for pxe booting would you be able to assist?

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm executed with errors:

  • phab1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console phab1005.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm

Was this a bug in the cookbook?

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm executed with errors:

  • phab1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console phab1005.eqiad.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host phab1005.eqiad.wmnet with OS bookworm completed:

  • phab1005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202409041937_jclark_2394602_phab1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Jclark-ctr updated the task description. (Show Details)

Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: gerrit1004.wikimedia.org

I ran the decom cookbook (without and with --force) but it errors out with

spicerack.netbox.NetboxHostNotFoundError: gerrit1004

I guess it's just the cert.. In the past this would have been fixable with a manual "puppet cert clean" on the puppetmaster.

Looking...

[puppetserver1001:~] $ sudo puppet node clean gerrit1004.wikimedia.org
Notice: Certificate for gerrit1004.wikimedia.org has been revoked
Notice: Cleaned files related to gerrit1004.wikimedia.org

This is now fixed for real after also running a "puppet node deactivate" on puppetmaster and puppetserver. Next puppet run on Icinga removed it from monitoring.

Wasn't aware this was also fixed in the cookbook a few days after it was used here (https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1071588).