Page MenuHomePhabricator

Q2:rack/setup/install ganeti-jumbo100[1-3]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ganeti-jumbo100[1-3]

Hostname / Racking / Installation Details

Hostnames: ganeti-jumbo100[1-3]
Racking Proposal: Different rack per host
Networking Setup: # of Connections:1/2 - Speed:1G/10G. - VLAN:Private/Public/Other(Specify) : 1 10G/private network
OS Distro: Trixie
Boot Method: UEFI
Sub-team Technical Contact: EU hours: Ben Tullis (@BTullis, btullis on IRC) US hours: Brian King (@bking, inflatador on IRC)

Per host setup checklist

ganeti-jumbo1001
  • Receive in system on procurement task T404778 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ganeti-jumbo1002
  • Receive in system on procurement task T404778 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ganeti-jumbo1003
  • Receive in system on procurement task T404778 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

RobH assigned this task to bking.
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

@bking,

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-site engineerss will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

RobH mentioned this in Unknown Object (Task).Sep 29 2025, 7:31 PM
RobH added a parent task: Unknown Object (Task).

Change #1196935 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] site.pp: Add ganeti-jumbo hosts

https://gerrit.wikimedia.org/r/1196935

Change #1196952 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] ganeti-jumbo: Add hosts and partman recipe

https://gerrit.wikimedia.org/r/1196952

Change #1196935 abandoned by Bking:

[operations/puppet@production] site.pp: Add ganeti-jumbo hosts

Reason:

superseded by 1196952

https://gerrit.wikimedia.org/r/1196935

Change #1196952 merged by Bking:

[operations/puppet@production] ganeti-jumbo: Add hosts and partman recipe

https://gerrit.wikimedia.org/r/1196952

bking changed the task status from Open to In Progress.Oct 27 2025, 1:56 PM
bking changed the task status from In Progress to Stalled.Oct 27 2025, 1:59 PM

Hello DC Ops,

I've added the hosts and partman recipe to Puppet as requested. Please note that the partman recipe is untested, so if the reimage fails more than once, please ping me in IRC (inflatador) and/or assign this task to me. The same goes for T405964, as the hosts in that ticket are identical to these.

Thanks for your help!

bking removed bking as the assignee of this task.Nov 20 2025, 2:47 PM

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ganeti-jumbo1001.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ganeti-jumbo1002.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ganeti-jumbo1001.eqiad.wmnet with OS trixie completed:

  • ganeti-jumbo1001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202512112009_jclark_3604593_ganeti-jumbo1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ganeti-jumbo1002.eqiad.wmnet with OS trixie executed with errors:

  • ganeti-jumbo1002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti-jumbo1002.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ganeti-jumbo1002.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ganeti-jumbo1003.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ganeti-jumbo1003.eqiad.wmnet with OS trixie completed:

  • ganeti-jumbo1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202512121418_jclark_4024268_ganeti-jumbo1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ganeti-jumbo1002.eqiad.wmnet with OS trixie completed:

  • ganeti-jumbo1002 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202512121404_jclark_4010144_ganeti-jumbo1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually
Jclark-ctr updated the task description. (Show Details)