Page MenuHomePhabricator

Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ms-fe2013 - ms-fe2014, thanos-fe2004

Hostname / Racking / Installation Details

Hostnames: ms-fe2013 - ms-fe2014, thanos-fe2004
Racking Proposal:
ms-fe: One host per row
thanos-fe: Row C as it doesn't have a thanos-fe host yet
Partitioning/Raid: Same as existing ms-fe hosts
OS Distro: Bullseye
Sub-team Technical Contact: @MatthewVernon

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ms-fe2013: Rack: A2 - U32 - Port 31
  • - receive in system on procurement task T324220 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::serviceops.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ms-fe2014: Rack: B2 - U33 - Port 32
  • - receive in system on procurement task T324220 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::serviceops.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
thanos-fe2004: Rack: C7 - U7 - Port 6
  • - receive in system on procurement task T324220 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::serviceops.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH unsubscribed.

Hi @Papaul could you give me an update on timescales for getting this hardware ready to go, please? From an operational perspective, it'd be great to have these nodes in production ASAP. Thanks :)

@jbond

poweredge-r450: picking DellDriverCategory.BIOS update file
We have found multiple entries please pick from the list below:
0: /srv/firmware/poweredge-r450/BIOS/BIOS_G7K8G_WN64_1.8.2_01.EXE
1: /srv/firmware/poweredge-r450/BIOS/BIOS_DHRG5_WN64_1.7.5.EXE
2: Download new file
==> Please select the entry you want
> 0
User input is: "0"
ms-fe2013.codfw.wmnet (BIOS): target_version: 1.8.2, current_version: 1.8.2
ms-fe2013.codfw.wmnet (BIOS): Skipping already at target version 1.8.2
ms-fe2013.codfw.wmnet: no job_id for BIOS update
Resetting chassis power status for ms-fe2013 to ForceOff
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-fe2013']
pt1979@cumin2002:~$

Change 891914 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Ad new ms-fe and thanos-fe node to site.pp

https://gerrit.wikimedia.org/r/891914

Change 891914 merged by Papaul:

[operations/puppet@production] Ad new ms-fe and thanos-fe node to site.pp

https://gerrit.wikimedia.org/r/891914

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe2013.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe2013.codfw.wmnet with OS bullseye completed:

  • ms-fe2013 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302242153_pt1979_990154_ms-fe2013.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

@jbond

poweredge-r450: picking DellDriverCategory.BIOS update file
We have found multiple entries please pick from the list below:
0: /srv/firmware/poweredge-r450/BIOS/BIOS_G7K8G_WN64_1.8.2_01.EXE
1: /srv/firmware/poweredge-r450/BIOS/BIOS_DHRG5_WN64_1.7.5.EXE
2: Download new file
==> Please select the entry you want
> 0
User input is: "0"
ms-fe2013.codfw.wmnet (BIOS): target_version: 1.8.2, current_version: 1.8.2
ms-fe2013.codfw.wmnet (BIOS): Skipping already at target version 1.8.2
ms-fe2013.codfw.wmnet: no job_id for BIOS update
Resetting chassis power status for ms-fe2013 to ForceOff
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-fe2013']
pt1979@cumin2002:~$

From the output and the logs, specifically the following line it looks like the upgrade works successfully however there is some issue with handling the command that powers the server back down.

ms-fe2013.codfw.wmnet (BIOS): target_version: 1.8.2, current_version: 1.8.2

if you see this its safe to assume that the firmware has been upgraded and there probably some other minor issue with the cookbook

Mentioned in SAL (#wikimedia-operations) [2023-02-27T11:28:10Z] <jbond@cumin2002> START - Cookbook sre.hosts.downtime for 4:00:00 on ms-fe2013.codfw.wmnet with reason: testing redfish T326848

Mentioned in SAL (#wikimedia-operations) [2023-02-27T11:28:25Z] <jbond@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-fe2013.codfw.wmnet with reason: testing redfish T326848

Change 892393 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/cookbooks@master] sre.hardware.upgrade-firmware: incorrect error status

https://gerrit.wikimedia.org/r/892393

@MatthewVernon the firmware, bios and network have all been upgraded so should be good to procead

Change 892393 merged by jenkins-bot:

[operations/cookbooks@master] sre.hardware.upgrade-firmware: incorrect error status

https://gerrit.wikimedia.org/r/892393

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ms-fe2014.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ms-fe2014.codfw.wmnet with OS bullseye completed:

  • ms-fe2014 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303021905_pt1979_2640155_ms-fe2014.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host thanos-fe2004.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host thanos-fe2004.codfw.wmnet with OS bullseye completed:

  • thanos-fe2004 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303022012_pt1979_2653067_thanos-fe2004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
Papaul updated the task description. (Show Details)

This is complete