Page MenuHomePhabricator

Q2:rack/setup/install es104[1-6]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of es104[1-6]

Hostname / Racking / Installation Details

Hostnames: es104[1-6]
Racking Proposal: Where should these systems be racked? Can they share with any existing systems or should they avoid any other systems sharing their rack or row? (Note EQIAD now has rows A-F.)
es1020 → will be replaced by es1041, server can be installed in A3
es1021 → will be replaced by es1042, server can be installed in B3
es1022 → will be replaced by es1043, server can be installed in C5
es1023 → will be replaced by es1044, server can be installed in D6
es1024 → will be replaced by es1045, server can be installed in A5
es1025 → will be replaced by es1046, server can be installed in B5

This is basically the same topology as their older counterparts, this will minimize any risk taking due to misplacement.

Networking Setup: # of Connections:1/2 - Speed:1G/10G. - VLAN:Private/Public/Other(Specify) :
OS Distro: Bookworm (default unless otherwise specified)
Sub-team Technical Contact: @ABran-WMF

Per host setup checklist

es1041
  • Receive in system on procurement task T376157 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
es1042
  • Receive in system on procurement task T376157 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
es1043
  • Receive in system on procurement task T376157 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
es1044
  • Receive in system on procurement task T376157 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
es1045
  • Receive in system on procurement task T376157 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
es1046
  • Receive in system on procurement task T376157 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

@ABran-WMF,

Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. This is due to the majority of DC Ops not having root/merge puppet rights.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-sites will claim this task for racking and setup.

Thank you!

Change #1083758 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: add 12 new es hosts

https://gerrit.wikimedia.org/r/1083758

ABran-WMF added subscribers: Marostegui, Ladsgroup.

I've tried to reproduce what's been done in T355269 which is quite close to what we're doing here. I might be lacking some info though. @Ladsgroup @Marostegui I'd be happy to have a sanity check (also on T378146)

I've tried to reproduce what's been done in T355269 which is quite close to what we're doing here. I might be lacking some info though. @Ladsgroup @Marostegui I'd be happy to have a sanity check (also on T378146)

To my understanding, ES hosts are basically the exact same databases as other db clusters so there isn't anything radically different (specially with regards to pool/depool). Only things to keep in mind are (from top of my head):

  • They are large, obviously. So transfer.py might not work or might be a bit slow. @jcrespo told me about this edge case but I might be misremembering it. So pinging him.
  • In some ES hosts, they are also backup sources. I think in RO ones. Please coordinate with Jaime to make sure it doesn't happen at the same time.
  • Similarly the RO dbs, don't have replication set up so the clone cookbook wouldn't work with them (and/or we need to make the cookbook work with it)

T262388 is the bug, but I couldn't fix it because I couldn't reproduce it at the time. I highly recommend double checking data after copy to make sure nothing was lost (counting tables and comparing file sizes is a fast way to do it). transfer does a checksum, but one never knows of other errors could be missed.

Thanks for digging out the bug!

I've tried to reproduce what's been done in T355269 which is quite close to what we're doing here. I might be lacking some info though. @Ladsgroup @Marostegui I'd be happy to have a sanity check (also on T378146)

What would you need?

basically a validation of the picked up positions, I stuck to the existing topology as there was a 1:1 match between hosts and each rack had a sufficient amount of rack units available.

So, we have 6 rows available, so let's place one per row.
For A3, there's already an external store host there, so if there's any other rack in row A with no es*,let's put it there.
es1042 in B3 looks good
es1043 in C5 looks good.
es1044 in D6 looks good.

For es1045 and es1046 just look for a rack in E and F with no other es hosts and use those I'd propose.

so, 1020 would go in A2
es1045 in E1
es1046 in F1

I assumed that there was no constraint on the electrical side, those racks are quite full.

@ABran-WMF these have been racked/ cabled/ configured Per the racking instructions that where in the Racking Proposal : and just need puppet updated for os install

The most effective power wise would be to keep these in a-d and avoid E-F since you have requested to keep in same racks as replacements 1G racks

Change #1099696 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: add new ES hosts

https://gerrit.wikimedia.org/r/1099696

Change #1083758 abandoned by Arnaudb:

[operations/puppet@production] mariadb: add 12 new es hosts

Reason:

replaced by: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1099696

https://gerrit.wikimedia.org/r/1083758

Change #1099696 merged by Arnaudb:

[operations/puppet@production] mariadb: add new ES hosts

https://gerrit.wikimedia.org/r/1099696

Thanks @Jclark-ctr! puppet patch pas been merged

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1041.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1042.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1044.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1042.eqiad.wmnet with OS bookworm completed:

  • es1042 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202412040040_jclark_2053163_es1042.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1041.eqiad.wmnet with OS bookworm executed with errors:

  • es1041 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1041.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1041.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1045.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1046.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1044.eqiad.wmnet with OS bookworm executed with errors:

  • es1044 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1044.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm executed with errors:

  • es1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1043.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1041.eqiad.wmnet with OS bookworm completed:

  • es1041 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202412040122_jclark_2071163_es1041.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1046.eqiad.wmnet with OS bookworm completed:

  • es1046 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202412040139_jclark_2072675_es1046.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1045.eqiad.wmnet with OS bookworm executed with errors:

  • es1045 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1045.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1044.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1045.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1045.eqiad.wmnet with OS bookworm executed with errors:

  • es1045 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1045.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1045.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1044.eqiad.wmnet with OS bookworm executed with errors:

  • es1044 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1044.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm executed with errors:

  • es1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1043.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1044.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1045.eqiad.wmnet with OS bookworm executed with errors:

  • es1045 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1045.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm executed with errors:

  • es1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1043.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1044.eqiad.wmnet with OS bookworm completed:

  • es1044 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202412042238_jclark_2259035_es1044.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm executed with errors:

  • es1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1043.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

running into issues with the last two @ABran-WMF es1043 is imaged but will not pass certificate for puppet es1045 will not pxe @Jhancock.wm if you get a chance can you take a look at these two see if you can tell whats missing

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1045.eqiad.wmnet with OS bookworm

got 1045 to pxe. re-ran provisioning script and it fixed whatever that was. 1043 is still not passing certificate despite re-run of some cookbooks. waiting for it to fail and then will take another look. had a similar issue with one of the es20xx servers we got in.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1045.eqiad.wmnet with OS bookworm completed:

  • es1045 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202412051437_jhancock_1873095_es1045.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm executed with errors:

  • es1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1043.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm

es1043 is gonna fail again

C&P from @Papaul

puppetmaster1001:~$ sudo puppet cert --list
Warning: puppet cert is deprecated and will be removed in a future release.
(location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
"es1043.eqiad.wmnet" (SHA256)

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm executed with errors:

  • es1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1043.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm executed with errors:

  • es1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1043.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm executed with errors:

  • es1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1043.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm executed with errors:

  • es1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1043.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm

@elukey we're having an issue with this last server. es1043 keeps going to the puppetmaster server for it's certificate instead of the others.

Generating a new Puppet certificate on 1 hosts: es1043.eqiad.wmnet
PASS |██████████████████████████████████| 100% (1/1) [00:03<00:00,  3.34s/hosts]
FAIL |                                          |   0% (0/1) [00:03<?, ?hosts/s]
Generated CSR for host es1043.eqiad.wmnet: 
Generated Puppet certificate
[1/10, retrying in 5.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for es1043.eqiad.wmnet
[2/10, retrying in 10.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for es1043.eqiad.wmnet
[3/10, retrying in 20.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for es1043.eqiad.wmnet
[4/10, retrying in 40.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for es1043.eqiad.wmnet
[5/10, retrying in 80.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for es1043.eqiad.wmnet
[6/10, retrying in 160.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for es1043.eqiad.wmnet
[7/10, retrying in 320.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for es1043.eqiad.wmnet

Papaul has removed it multiple times only for it to fail again. Is there a way we can force this to one of the puppet servers? i've tried every trick i know including re-provisioning it with all the tags you showed me last time. ty for your help!

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm executed with errors:

  • es1043 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es1043.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

@elukey we're having an issue with this last server. es1043 keeps going to the puppetmaster server for it's certificate instead of the others.

[..]

Papaul has removed it multiple times only for it to fail again. Is there a way we can force this to one of the puppet servers? i've tried every trick i know including re-provisioning it with all the tags you showed me last time. ty for your help!

@Jhancock.wm hi! I reprovisioned + reimaged es1043 with Bookworm and I didn't see any issue, the host is ready to go. I checked also the cookbook logs for previous runs and I don't see clear issues or culprits, we default in most places to puppet 7 so ending up with 5 is really unlikely, not sure how to repro the issue :(

weird. when i ran the cookbook it was defaulting to puppet 7 since it was bookworm. not sure why it would do that. but! not gonna question it if it works now. ty for your help!

Jhancock.wm updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcontrol1011.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcontrol1011.eqiad.wmnet with OS bookworm executed with errors:

  • cloudcontrol1011 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudcontrol1011.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.