Page MenuHomePhabricator

Q1:rack/setup/install ganeti103[34]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ganeti103[34]

Hostname / Racking / Installation Details

Hostnames: ganeti1033.eqiad.wmnet, ganeti1034.eqiad.wmnet
Racking Proposal: Please add these two in row D in two different racks different from D8, D6 or D3 (if possible, but we should avoid D8 in any case, there's already two Ganeti servers in there)
Networking Setup: same VLAN/IP setup as existing Ganeti servers
Partitioning/Raid: partman/custom/ganeti-raid5.cfg
OS Distro: Bullseye
Sub-team Technical Contact: Moritz (or IF in general)

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ganeti1033:
  • - receive in system on procurement task 311754 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details - automatic setup run with enable virtualization set
  • - firmware update (idrac, bios, network, raid controller) - bios already newest 2.15.1, idrac remains 5.10.50.00 , nic downgraded from 22.00.07.60 to 21.85.21.92
  • - Enable "Virtualization technology" under "System BIOS" -> "Processor Settings" - this should be done by the automated provisioning and bios setup
  • - operations/puppet update - https://gerrit.wikimedia.org/r/855100
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook. - pending fix to reimage script for staging/active change.
ganeti1034:
  • - receive in system on procurement task 311754 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details - automatic setup run with enable virtualization set
  • - firmware update (idrac, bios, network, raid controller) - bios already newest 2.15.1, idrac remains 5.10.50.00 , nic downgraded from 22.00.07.60 to 21.85.21.92
  • - Enable "Virtualization technology" under "System BIOS" -> "Processor Settings" - this should be done by the automated provisioning and bios setup, doublechecked
  • - operations/puppet update - https://gerrit.wikimedia.org/r/855100
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

ganeti1033 D2 U34 Port 34 Cableid 20220010
ganeti1034 D4 U30 Port 38 Cableid 20220038

Touched base with @Cmjohnson, who will work on this later today. Thanks, Willy

@Jclark-ctr I did the netbox provisioning script, I am not ale to ping the mgmt IP for either server. Can you verify that the mgmt cables are connected?

@Cmjohnson ganeti1033 had a bad cable replaced. ganeti1034 is connected properly and has link for management

When I attempt to pull up https://ganeti1033.mgmt.eqiad.wmnet I get 'Bad Request'

Bad Request

Your browser sent a request that this server could not understand.

Additionally, a 400 Bad Request error was encountered while trying to use an ErrorDocument to handle the request.

I had the same issue on ganeti1034, but racreset fixed it there. Racreset and a full power removal hasn't fixed it on ganeti1003.

Proposed solution: Use the sre.hardware.upgrade-firmware cookbook to flash to newest idrac, and then use the fix on T322419 to fix the https console from that upgrade.

I am not sure why this is happening with 5.10 version of idrac, but its odd and I've not seen it before. I'm also not sure of the syntax to successully use sre.hardware.upgrade-firmware.

Perhaps @Papaul could provide us with an example command to update the idrac firmware on ganeti1033?

Change 855100 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] site.pp update for ganeti103[34]

https://gerrit.wikimedia.org/r/855100

Change 855100 merged by RobH:

[operations/puppet@production] site.pp update for ganeti103[34]

https://gerrit.wikimedia.org/r/855100

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye executed with errors:

  • ganeti1034 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Change 855105 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] adding ganeti103[34] netboot

https://gerrit.wikimedia.org/r/855105

Change 855105 merged by RobH:

[operations/puppet@production] adding ganeti103[34] netboot

https://gerrit.wikimedia.org/r/855105

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye executed with errors:

  • ganeti1034 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye executed with errors:

  • ganeti1034 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211092251_robh_1713692_ganeti1034.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • The reimage failed, see the cookbook logs for the details

@MoritzMuehlenhoff i recall you stating the puppet run fails in the isntaller but then just re-run after a fix and its fine? If so, ganeti1034 is ready for ya to rerun its install. since it failed im not sure what the exact fix is, but its calling into puppet and is showing in icinga.

@MoritzMuehlenhoff i recall you stating the puppet run fails in the isntaller but then just re-run after a fix and its fine? If so, ganeti1034 is ready for ya to rerun its install. since it failed im not sure what the exact fix is, but its calling into puppet and is showing in icinga.

There's some bug related to ifupdown enabling ipv6, which can interrupt the first puppet run it happens. But that can simply be retried and then it works fine. Thanks for racking setting up 1034, I've now added it to the production Ganeti cluster in eqiad.

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS bullseye executed with errors:

  • ganeti1034 (FAIL)
    • [...SNIP...]
    • The reimage failed, see the cookbook logs for the details

@MoritzMuehlenhoff i recall you stating the puppet run fails in the isntaller but then just re-run after a fix and its fine? If so, ganeti1034 is ready for ya to rerun its install. since it failed im not sure what the exact fix is, but its calling into puppet and is showing in icinga.

@RobH please do not assume that if Puppet runs a reimage is complete, there are various of steps in the reimage cookbook after that that might cause issues later down the line if not completed. Either something is wrong with the host or its configuration and needs to be fixed or can be a bug in the cookbook.

In this case the last failure was a bug in Spicerack that I indirectly caused when migrating out from the staged status in Netbox in T320696. I'm sending a fix and will release a new Spicerack right away.

RobH updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti1033.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1033.eqiad.wmnet with OS bullseye completed:

  • ganeti1033 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211101918_robh_2175557_ganeti1033.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
RobH updated the task description. (Show Details)

@MoritzMuehlenhoff : ganeti1033 is all ready for you, resolving this setup task.

Change 855973 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ganeti1033

https://gerrit.wikimedia.org/r/855973

Change 855973 merged by Muehlenhoff:

[operations/puppet@production] Add ganeti1033

https://gerrit.wikimedia.org/r/855973

ganeti1033/1034 have been added to the eqiad cluster (group D)