Page MenuHomePhabricator

Q4:rack/setup/install kafka-main200[6789] & kafka-main2010
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of X

Hostname / Racking / Installation Details

Hostnames: kafka-main200[6789] and kafka-main2010
Racking Proposal: Row A (A4 if possible), Row B (B4 if possible), Row C (C7 if possible), Row D (D7 if possible), Row D (D4 if possible)
Networking Setup: 1 connection, 1G
VLAN: Private, AAAA records: Y, Additional IP records (Cassandra)? No
Partitioning/Raid: HW Raid: N, Partman recipe and/or desired Raid Level: raid10-4dev.cfg
OS Distro: Bullseye
Sub-team Technical Contact: @akosiaris

Per host setup checklist

kafka-main2006:
  • Receive in system on procurement task T361362 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
kafka-main2007:
  • Receive in system on procurement task T361362 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
kafka-main2008:
  • Receive in system on procurement task T361362 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
kafka-main2009:
  • Receive in system on procurement task T361362 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
kafka-main2010:
  • Receive in system on procurement task T361362 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH renamed this task from Q#:rack/setup/install X to Q#:rack/setup/install kafka-main200[6789] & kafka-main2010.Apr 23 2024, 7:21 PM
RobH renamed this task from Q#:rack/setup/install kafka-main200[6789] & kafka-main2010 to Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.
RobH added a project: ops-codfw.
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH added a parent task: Unknown Object (Task).
RobH unsubscribed.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2007.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2008.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2010.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2008.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2008 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2008.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2010.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2010.codfw.wmnet to get a root shellbut depending on the failure this may not work.

@Papaul, This was the last screen I got. The servers all have the OS installed and it failed at the certificate stage. I think it's cause I used python 7 instead of 5. when I attempt to retry with 5, it fails.

@cumin2002:~$ sudo secure-cookbook sre.hosts.reimage  -t T363209 --os bullseye kafka-main2006 -p 5 --new. 
==> ATTENTION: Destructive action for kafka-main2006. Proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
Starting reimage on kafka-main2006. You can check progress via serial console or by running `install-console kafka-main2006.codfw.wmnet` on any cumin host
==> Host kafka-main2006.codfw.wmnet was found in PuppetDB but --new was set. Are you sure you want to proceed? The --new option will be unset
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
The option --new has been unset
Exception raised while initializing the Cookbook sre.hosts.reimage:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 199, in run
  runner = self.instance.get_runner(args)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 101, in get_runner
  return ReimageRunner(args, self.spicerack)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 181, in __init__
  self.new_puppet_server = self._get_puppet_server()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 227, in _get_puppet_server
  has_puppet7 = self.puppet_server.hiera_lookup(self.fqdn, "profile::puppet::agent::force_puppet7")
File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 599, in hiera_lookup
  result = self.server_host.run_sync(command, is_safe=True, print_output=False, print_progress_bars=False)
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 514, in run_sync
  return self._execute(
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 720, in _execute
  raise RemoteExecutionError(ret, "Cumin execution failed", worker.get_results())
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 199, in run
  runner = self.instance.get_runner(args)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 101, in get_runner
  return ReimageRunner(args, self.spicerack)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 181, in __init__
  self.new_puppet_server = self._get_puppet_server()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 227, in _get_puppet_server
  has_puppet7 = self.puppet_server.hiera_lookup(self.fqdn, "profile::puppet::agent::force_puppet7")
File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 599, in hiera_lookup
  result = self.server_host.run_sync(command, is_safe=True, print_output=False, print_progress_bars=False)
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 514, in run_sync
  return self._execute(
File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 720, in _execute
  raise RemoteExecutionError(ret, "Cumin execution failed", worker.get_results())
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)

All insetup roles default to Puppet 7 these days (as does the kafka-main roler itself), so these should be installed with Puppet 7.

I think the reason the installation failed is because there is no entry in site.pp yet.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2006.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye completed:

  • kafka-main2006 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405151325_jhancock_2409379_kafka-main2006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2007.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2008.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2010.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2007.codfw.wmnet with OS bullseye completed:

  • kafka-main2007 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405151422_jhancock_2467812_kafka-main2007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2008.codfw.wmnet with OS bullseye completed:

  • kafka-main2008 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405151424_jhancock_2468073_kafka-main2008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2010.codfw.wmnet with OS bullseye completed:

  • kafka-main2010 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405151426_jhancock_2468124_kafka-main2010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

2009 is on timeout until I can take another crack at it. stuck on this even thought the rest passed.

6/10, retrying in 160.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for kafka-main2009.codfw.wmnet
[7/10, retrying in 320.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for kafka-main2009.codfw.wmnet

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Jhancock.wm subscribed.

@Papaul I'm still having trouble with the same spot as noted before. Can you take a look at it?

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Change #1032857 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] The resue option is making re-image to fail so testing without it.

https://gerrit.wikimedia.org/r/1032857

Change #1032857 merged by Papaul:

[operations/puppet@production] The resue option is making re-image to fail so testing without it.

https://gerrit.wikimedia.org/r/1032857

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Change #1032871 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add back reuse option after testing

https://gerrit.wikimedia.org/r/1032871

Change #1032871 abandoned by Papaul:

[operations/puppet@production] Add back reuse option after testing

Reason:

https://gerrit.wikimedia.org/r/1032871

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye completed:

  • kafka-main2009 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405172346_pt1979_1649208_kafka-main2009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

@Jhancock.wm thank you for working on this. Like I mentioned to you this morning the reason kafka-main2009 was failing is because it was contacting the wrong puppet server for cert request. (see below) what I did was to delete the cert req on puppetmaster and restart the re-image.

@akosiaris please see my comment above. We have been having this issue for some time now where the node send the cert req to puppetmaster1001 even though when you lunch the cookbook you tell it to use puppet 7. Thanks. This is all yours.

pt1979@puppetmaster1001:~$ sudo puppet cert --list
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
  "kafka-main2009.codfw.wmnet" (SHA256) B5:5F:81:9E:67:18:D4:E1:CF:AA:C7:A7:8B:DE:B1:8B:D3:04:C3:E7:04:64:D1:6A:27:04:23:EF:24:E3:37:46
pt1979@puppetmaster1001:~$ sudo puppet cert clean kafka-main2009.codfw.wmnet
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
Notice: Removing file Puppet::SSL::CertificateRequest kafka-main2009.codfw.wmnet at '/var/lib/puppet/server/ssl/ca/requests/kafka-main2009.codfw.wmnet.pem'
Papaul updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2008.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2008.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2008 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2008.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2010 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2010.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2010.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye completed:

  • kafka-main2009 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405201604_akosiaris_1154020_kafka-main2009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2008.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye completed:

  • kafka-main2007 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405201703_akosiaris_1163483_kafka-main2007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye completed:

  • kafka-main2010 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405201709_akosiaris_1163663_kafka-main2010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2008.codfw.wmnet with OS bullseye completed:

  • kafka-main2008 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405201713_akosiaris_1163582_kafka-main2008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye completed:

  • kafka-main2006 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405201716_akosiaris_1163423_kafka-main2006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

As an update, I had to reimage these servers as I had messed up the original recipe.