Page MenuHomePhabricator

cloudgw2004-dev service implementation
Closed, ResolvedPublic

Description

There is new hardware here to replace cloudgw2002-dev.

Related Objects

StatusSubtypeAssignedTask
ResolvedAndrew

Event Timeline

Andrew triaged this task as Medium priority.

Change #1248004 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] site: Use nftables insetup role for cloudgw2004-dev

https://gerrit.wikimedia.org/r/1248004

Change #1248004 merged by Andrew Bogott:

[operations/puppet@production] site: Use nftables insetup role for cloudgw2004-dev

https://gerrit.wikimedia.org/r/1248004

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudgw2004-dev.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudgw2004-dev.codfw.wmnet with OS trixie executed with errors:

  • cloudgw2004-dev (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202603092108_andrew_964969_cloudgw2004-dev.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudgw2004-dev.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudgw2004-dev.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudgw2004-dev.codfw.wmnet with OS trixie completed:

  • cloudgw2004-dev (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202603092144_andrew_975213_cloudgw2004-dev.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1250638 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Replace cloudgw2002-dev with cloudgw2004-dev

https://gerrit.wikimedia.org/r/1250638

Change #1250638 merged by Andrew Bogott:

[operations/puppet@production] Replace cloudgw2002-dev with cloudgw2004-dev

https://gerrit.wikimedia.org/r/1250638

Andrew added a subscriber: taavi.

2004-dev is up and working now, thanks to @taavi and a reboot.