Page MenuHomePhabricator

Q2:rack/setup/install ganeti103[5-8]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of 4 ganeti nodes for expansion in eqiad.

Hostname / Racking / Installation Details

Hostnames: ganeti1035.eqiad.wmnet, ganeti1036.eqiad.wmnet, ganeti1037.eqiad.wmnet, ganeti1038.eqiad.wmnet
Racking Proposal: One in each row of A, B, C and D
Networking Setup: 10G - VLAN setup like existing Ganeti nodes
Partitioning/Raid: partman/custom/ganeti-raid5.cfg
OS Distro: Bullseye
Sub-team Technical Contact: @MoritzMuehlenhoff

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ganeti1035:
  • - receive in system on procurement task T348023 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti1036:
  • - receive in system on procurement task T348023 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti1037:
  • - receive in system on procurement task T348023 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ganeti1038:
  • - receive in system on procurement task T348023 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Related Objects

StatusSubtypeAssignedTask
ResolvedJclark-ctr

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).

@MoritzMuehlenhoff,

The parent purchasing task for 4 nodes in eqiad has been escalated to order without racking details. Would you please provide racking details in the task description for this task on the incoming ganeti nodes for this site?

Once done, please assign back to me so I can fill out the checklist for the nodes and escalate accordingly. Thanks!

MoritzMuehlenhoff renamed this task from Q2:rack/setup/install ganeti expansion to Q2:rack/setup/install ganeti for eqiad.Nov 1 2023, 4:03 PM
MoritzMuehlenhoff reassigned this task from MoritzMuehlenhoff to RobH.
MoritzMuehlenhoff updated the task description. (Show Details)

I've filled in the details in the task description, let me know if you need anything else.

ganeti1035.eqiad.wmnet
Rack: A2
Position: U33
Port: 41
Cableid: 230304500230

ganeti1036.eqiad.wmnet
Rack: B4
Position: U29
Port: 43
Cableid: 230304500136

ganeti1037.eqiad.wmnet
Rack: C7
Position: U42
Port: 42
CableID: 230304500155

ganeti1038.eqiad.wmnet
Rack: D4
Position: U37
Port: 9
Cableid: 230304500231

RobH renamed this task from Q2:rack/setup/install ganeti for eqiad to Q2:rack/setup/install ganeti103[5-8].Nov 17 2023, 6:33 PM
RobH reassigned this task from RobH to VRiley-WMF.
RobH updated the task description. (Show Details)
RobH unsubscribed.

Please also enable virtualisation for these in the BIOS, they will serve as virt servers.

Mentioned in SAL (#wikimedia-operations) [2023-11-20T10:55:14Z] <volans@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management records for ganeti103[5-8] - T349925 - volans@cumin1001"

Mentioned in SAL (#wikimedia-operations) [2023-11-20T10:56:04Z] <volans@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management records for ganeti103[5-8] - T349925 - volans@cumin1001"

The hosts were setup in Netbox with a public VLAN and FQDN (wikimedia.org) while they should have been setup with the private one (eqiad.wmnet FQDNs).
The changes were not committed to the DNS (running the sre.dns.netbox cookbook), as a result Icinga has been alerting for Uncommitted DNS changes in Netbox since Friday.
I've noticed that the provision cookbook was run for all the hosts, and failed for all of them. That's because the connection to the Redfish API of the iDRAC is via IP address but then the check that remote IPMI works uses the DNS and the management DNS records were not committed.

To unblock other changes in Netbox and prevent committing wrong data (the public IP DNS records) I've deleted the public IPs in Netbox for the above hosts, left the management ones that have probably already been setup into the iDRACs, and run the sre.dns.netbox cookbook to commit the mgmt records.

The sre.hosts.provision cookbook will need to be re-run on those hosts once the final IPs will be set with the proper FQDNs to ensure it finishes and also to enable virtualization, as they were run without the --enable-virtualization flag.

@Volans Thank you for the information! I have ran through these again and with the help @RobH these should be corrected. Also, virtualization has been enabled.

Change 981396 had a related patch set uploaded (by Jclark-ctr; author: jclark):

[operations/puppet@production] Add ganeti103[5-8] to site.pp

https://gerrit.wikimedia.org/r/981396

Change 981396 merged by Jclark-ctr:

[operations/puppet@production] Add ganeti103[5-8] to site.pp

https://gerrit.wikimedia.org/r/981396

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ganeti1035.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ganeti1036.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ganeti1037.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ganeti1038.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ganeti1038.eqiad.wmnet with OS bullseye completed:

  • ganeti1038 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312080019_jclark_2715178_ganeti1038.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ganeti1035.eqiad.wmnet with OS bullseye completed:

  • ganeti1035 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312080021_jclark_2714538_ganeti1035.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ganeti1037.eqiad.wmnet with OS bullseye completed:

  • ganeti1037 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312080024_jclark_2714936_ganeti1037.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ganeti1036.eqiad.wmnet with OS bullseye completed:

  • ganeti1036 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312080026_jclark_2714899_ganeti1036.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Jclark-ctr claimed this task.
Jclark-ctr updated the task description. (Show Details)

Change 991593 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make ganeti1035 a Ganeti node

https://gerrit.wikimedia.org/r/991593

Change 991593 merged by Muehlenhoff:

[operations/puppet@production] Make ganeti1035 a Ganeti node

https://gerrit.wikimedia.org/r/991593

Change 992088 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ganeti1035 to eqiad ganeti nodes

https://gerrit.wikimedia.org/r/992088

Change 992088 merged by Muehlenhoff:

[operations/puppet@production] Add ganeti1035 to eqiad ganeti nodes

https://gerrit.wikimedia.org/r/992088

Change 992099 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make ganeti1036 a ganeti node

https://gerrit.wikimedia.org/r/992099

Change 992099 merged by Muehlenhoff:

[operations/puppet@production] Make ganeti1036 a ganeti node

https://gerrit.wikimedia.org/r/992099

Change 992418 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make ganeti1037 a Ganeti node

https://gerrit.wikimedia.org/r/992418

Change 992418 merged by Muehlenhoff:

[operations/puppet@production] Make ganeti1037 a Ganeti node

https://gerrit.wikimedia.org/r/992418

Change 992689 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make ganeti1038 a Ganeti node

https://gerrit.wikimedia.org/r/992689

Change 992689 merged by Muehlenhoff:

[operations/puppet@production] Make ganeti1038 a Ganeti node

https://gerrit.wikimedia.org/r/992689