Page MenuHomePhabricator

Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ms-fe1013 - ms-fe1014, thanos-fe1004

Hostname / Racking / Installation Details

Hostnames: ms-fe1013 - ms-fe1014, thanos-fe1004
Racking Proposal:
ms-fe: One host per row, preferably one host each in rows D and F as these rows don't have Swift frontend hosts yet
thanos-fe: A row without a thanos-fe host (D, E or F)
Networking Setup: 10G production network
Partitioning/Raid: Same as existing ms-fe hosts
OS Distro: Bullseye
Sub-team Technical Contact: @MatthewVernon

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ms-fe1013:
  • - receive in system on procurement task T324219 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::serviceops.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ms-fe1014:
  • - receive in system on procurement task T324219 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::serviceops.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
thanos-fe1004:
  • - receive in system on procurement task T324219 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::serviceops.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH added a parent task: Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).
RobH unsubscribed.

ms-fe1013 D2 U35 PORT13 4902
ms-fe1014 F1 U38. PORT 38 20220049
thanos-fe1004 F1 U39. PORT. 39 20220018

Hi @Jclark-ctr could you give me an update on timescales for getting this hardware ready to go, please? From an operational perspective, it'd be great to have these nodes in production ASAP. Thanks :)

Hi @Jclark-ctr any news on getting these frontends ready for use, please?

Jclark-ctr subscribed.

@MatthewVernon I have asked chris to help with installs reassigning to him for assistance

@MatthewVernon working on these now, I will let you know if I run into any blocks

@MatthewVernon the raid configuration states "\Partitioning/Raid: Same as existing ms-fe hosts" Can you be more specific, is this h/w raid? there is a controller how do you want me to set the disks up? thanks!

@Cmjohnson these are just JBOD I think - at least that's how ms-fe1012 appears to me (and I think that's what I expect for a swift frontend) - we do software-raid on these systems.

Change 898922 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding ms-fe1013-4 and thanos-fe1004 to site.pp

https://gerrit.wikimedia.org/r/898922

Change 898922 merged by Cmjohnson:

[operations/puppet@production] Adding ms-fe1013-4 and thanos-fe1004 to site.pp

https://gerrit.wikimedia.org/r/898922

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with errors:

  • thanos-fe1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with errors:

  • thanos-fe1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1013 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1013 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

I am not able to do the initial installs, fe1013 and 1014 fail immediately, maybe there is a dhcp error and thanos-fe doesn't get a lease

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with errors:

  • thanos-fe1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

@Cmjohnson & @Papaul - can you guys provide an ETR on this one? Thanks, Willy

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1013 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1013 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with errors:

  • thanos-fe1004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

@wiki_willy ms-fe1013 and thanos-fe1004 both installed but did not set puppet certificates correctly and now they both just fail when I try to install --new or --no pxe

ms-1014 is failing immediately stating dhcp error. I verified all 3 have the correct network firmware installed, set the 10G NICS to pxe boot, and disabled pxe on the GB connection, raid configuration is correct.

Papaul states he is having similar issue with new db servers and is working on it

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

@Cmjohnson taking over the task to look into it

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe1013.eqiad.wmnet with OS bullseye completed:

  • ms-fe1013 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303311330_pt1979_733261_ms-fe1013.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe1014.eqiad.wmnet with OS bullseye

On ms-fe1014 IPMI was disable that is the reason it was failing

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with errors:

  • thanos-fe1004 (FAIL)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with errors:

  • thanos-fe1004 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with errors:

  • thanos-fe1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe1014.eqiad.wmnet with OS bullseye completed:

  • ms-fe1014 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303311451_pt1979_799826_ms-fe1014.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Change 904826 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] update thanos-fe1004 entry in site.pp

https://gerrit.wikimedia.org/r/904826

Change 904826 merged by Papaul:

[operations/puppet@production] update thanos-fe1004 entry in site.pp

https://gerrit.wikimedia.org/r/904826

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with errors:

  • thanos-fe1004 (FAIL)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Change 904830 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Fix typo on role for thanos-fe1004

https://gerrit.wikimedia.org/r/904830

Change 904830 merged by Papaul:

[operations/puppet@production] Fix typo on role for thanos-fe1004

https://gerrit.wikimedia.org/r/904830

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe1004.eqiad.wmnet with OS bullseye completed:

  • thanos-fe1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303311748_pt1979_936484_thanos-fe1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Papaul updated the task description. (Show Details)

The problem with thanos-fe1004 was wrong entry in site.pp. All the server are now ready