Page MenuHomePhabricator

Q4:rack/setup/install clouddb102[2-5]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of clouddb102[1-4]

Hostname / Racking / Installation Details

Hostnames: clouddb102[2-5].eqiad.wmnet
Racking Proposal: Ideally each host in a separate row. Failing that, at least keep 1022/1023 in separate rows from each other and 1024/1025 separate as well.
Networking Setup: # of Connections:1 - Speed: 10G. - VLAN: Private
OS Distro: Bookworm (default unless otherwise specified)
Boot Method: Legacy BIOS
Sub-team Technical Contact: Francesco Negri (preferred) or Andrew

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

clouddb1022
  • Receive in system on procurement task T391363 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
clouddb1023
  • Receive in system on procurement task T391363 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
clouddb1024
  • Receive in system on procurement task T391363 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
clouddb1025
  • Receive in system on procurement task T391363 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

RobH assigned this task to fnegri.
RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. This is due to the majority of DC Ops not having root/merge puppet rights.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-site engineerss will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

Change #1143871 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] site.pp: Add clouddb102[1-4] as insetup::wmcs_ferm

https://gerrit.wikimedia.org/r/1143871

Change #1143871 merged by FNegri:

[operations/puppet@production] site.pp: Add clouddb102[1-4] as insetup::wmcs_ferm

https://gerrit.wikimedia.org/r/1143871

fnegri removed fnegri as the assignee of this task.May 9 2025, 5:28 PM
fnegri subscribed.

update the site.pp file with the insetup role for your team

Done in https://gerrit.wikimedia.org/r/1143871

and add the new servers to preseed.yml for partition info

This was not necessary as preseed.yml already contains an entry for clouddb1* that matches both old and new hosts.

fnegri renamed this task from Q4:rack/setup/install clouddb102[1-4] to Q4:rack/setup/install clouddb102[2-5].May 9 2025, 5:29 PM
fnegri updated the task description. (Show Details)

Please note that we should skip clouddb1021 as it existed in the past and was decom'd in T368518: decommission clouddb1021.

I updated the task to rename the 4 new hosts clouddb1022 to clouddb1025.

Change #1143876 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] site.pp: skip clouddb1021 as it existed in the past

https://gerrit.wikimedia.org/r/1143876

Change #1143876 merged by FNegri:

[operations/puppet@production] site.pp: skip clouddb1021 as it existed in the past

https://gerrit.wikimedia.org/r/1143876

clouddb1022
Rack A4
U33
CableID: 20220030
Port: 43

clouddb1023
Rack B2
U32
CableID: 5253
Port: 39

clouddb1024
Rack E8
U35
CableID 240707900050
Port 35

clouddb1025
Rack F8
U24
CableID 240707900035
Port 24

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with errors:

  • clouddb1022 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1022.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with errors:

  • clouddb1022 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1022.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with errors:

  • clouddb1022 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1022.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with errors:

  • clouddb1022 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1022.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

While attempting to image this server (clouddb1022) and got this error.

kVh9WNyR.png (296×500 px, 32 KB)

I think the partman recipe is incompatible with the new servers, I'll look into it.

Change #1173974 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] installserver: setup new hosts clouddb102[2-5]

https://gerrit.wikimedia.org/r/1173974

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm

Tried to run reimage on clouddb1022, to no avail. Running through clouddb1023 to see if there is a difference.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with errors:

  • clouddb1022 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1022.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1023.eqiad.wmnet with OS bookworm

I have it the same issue with the same error on clouddb1023

Jclark-ctr moved this task from Decommission to Blocked on the ops-eqiad board.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm

@Papaul the reimage of clouddb1022 will fail until my patch above is merged, I'm waiting for a review from Data-Persistence .

@fnegri thanks i was about to ping you also on that.

Change #1173974 merged by FNegri:

[operations/puppet@production] installserver: setup new hosts clouddb102[2-5]

https://gerrit.wikimedia.org/r/1173974

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm executed with errors:

  • clouddb1022 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1022.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1023.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1022.eqiad.wmnet with OS bookworm completed:

  • clouddb1022 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507301847_vriley_3936284_clouddb1022.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1023.eqiad.wmnet with OS bookworm completed:

  • clouddb1023 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507301907_vriley_3959389_clouddb1023.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Was able to run though 2 of these. Running into issues with BMC password.

clouddb1022 - Finished, no issues
clouddb1023 - Finished, Password was set to the BMC default
clouddb1024 and clouddb1025 - Tried multiple combinations of usernames and passwords with no luck.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1025.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host clouddb1024.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1025.eqiad.wmnet with OS bookworm completed:

  • clouddb1025 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507310213_vriley_4070441_clouddb1025.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1024.eqiad.wmnet with OS bookworm completed:

  • clouddb1024 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507310218_vriley_4072907_clouddb1024.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

This has been completed

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host clouddb1023.eqiad.wmnet with OS bookworm executed with errors:

  • clouddb1023 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1023.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

I've expanded their logical volume to use most of the disk as we normally do

root@clouddb1022:~# pvs
  PV         VG   Fmt  Attr PSize  PFree
  /dev/sda3  tank lvm2 a--  <8.69t 56.36g

Change #1198064 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] clouddb102[25]: Add hieradata file

https://gerrit.wikimedia.org/r/1198064

Change #1198064 merged by Marostegui:

[operations/puppet@production] clouddb102[25]: Add hieradata file

https://gerrit.wikimedia.org/r/1198064