Page MenuHomePhabricator

Q3:(Need By: TBD) rack/setup/install stat1009
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of stat1009

Hostname / Racking / Installation Details

Hostnames: stat100* (next available) stat1009
Racking Proposal: Replacing stat1006, no restrictions.
Networking/Subnet/VLAN/IP: Single 1G private1 vlan connection
Partitioning/Raid: hw raid10 of hdd, hwraid1 of ssds (same as other config I hosts)
OS Distro: buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

stat1009
  • - receive in system on procurement task T297738 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH renamed this task from (Need By: TBD) rack/setup/install stat1009 to Q3:(Need By: TBD) rack/setup/install stat1009.Jan 19 2022, 5:20 PM
RobH unsubscribed.
Jclark-ctr subscribed.

stat1009 B1 U17 cableid 1181 port 5

@Cmjohnson port 1 is damaged missing plastic clip to retain network cable it is connected to port 2

Change 808547 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] updating site.pp for stat1009 and stat1010

https://gerrit.wikimedia.org/r/808547

Change 808547 merged by Cmjohnson:

[operations/puppet@production] updating site.pp for stat1009 and stat1010

https://gerrit.wikimedia.org/r/808547

Change 808870 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a new partman recipe for the new H750 based stat servers

https://gerrit.wikimedia.org/r/808870

Change 808870 merged by Btullis:

[operations/puppet@production] Add a new partman recipe for the new H750 based stat servers

https://gerrit.wikimedia.org/r/808870

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS buster executed with errors:

  • stat1009 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS bullseye executed with errors:

  • stat1009 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

@BTullis @RobH @Papaul I set the raid up so the raid 1 ssds were first and used the install script for buster. Buster fails to see to the disks, so I attempted to install with Bullseye, the entire install with through as it should and the server rebooted. However, it did not boot the hard drive and defaulted to pxe boot again. I did verify bios boot settings have hard drive first and pxe second. I have seen something similar on the an-presto servers as well. I am assuming this has something to do with new raid controller cards. Have any of you experienced this yet and do you have any suggestions?

@BTullis just read your response on an-presto and see that you're experiencing this with stat1010. Thank you for digging into it more.

Thanks @Cmjohnson - yes I think that this is very likely to be the same issue. That's useful that you've experienced exactly the same outcome on this as I did with stat1010.
Let's stick to bullseye for these servers from now on (if we can). Hopefully we'll have a working partmand recipe or BIOS/RAID/whatever configuration that works soon. @fgiunchedi and I are hoping to get to the bottom of it soon.

We will have to rebuild hadoop for bullsye, eh? T310643: Build Bigtop 1.5 Hadoop packages for Bullseye

Yep, looks that way.

@Cmjohnson I think that this should now work if you tweak the RAID controller configuration as described here: T297913#8041258

Let me know if it doesn't behave and I'll look into it again. Thanks.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS bullseye completed:

  • stat1009 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207012118_cmjohnson_1378019_stat1009.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Cmjohnson updated the task description. (Show Details)

resolved