Page MenuHomePhabricator

Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ml-cache200[1-3]

Hostname / Racking / Installation Details

Hostnames: ml-cache200[1-3]
Racking Proposal: Spread in rows as much as possible, and bonus point if they don't share anything with the ml-serve200* nodes.
Networking/Subnet/VLAN/IP: 10G if possible would be good, but otherwise we can use 1G as well. Private VLAN.
Partitioning/Raid: We'll likely need a root partition + a srv partition, so if these nodes have already a standard recipe to use it will be fine, otherwise we'll find one.
OS Distro: Bullseye (default unless otherwise specified)

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ml-cache2001: A2 U10
  • - receive in system on procurement task T297640 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ml-cache2002: B2 U10
  • - receive in system on procurement task T297640 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
ml-cache2003:C2 U20
  • - receive in system on procurement task T297640 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).Jan 18 2022, 5:45 PM
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH unsubscribed.

@elukey can you please get me the Partitioning/Raid information?

Thanks

@Papaul Hi! IIRC these nodes have two 2TB disks, so I'd go for the standard raid1 recipe: echo partman/standard.cfg partman/raid1-2dev

Lemme know if it makes sense.

Change 763518 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] install_server: add partman recipe for ml-cache nodes

https://gerrit.wikimedia.org/r/763518

Change 763518 merged by Elukey:

[operations/puppet@production] install_server: add partman recipe for ml-cache nodes

https://gerrit.wikimedia.org/r/763518

Went ahead and merged the change, I've also ran puppet across install nodes, so you can install the os whenever you want :)

Change 763818 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add ml-cache200[1-3] to site.pp

https://gerrit.wikimedia.org/r/763818

Change 763818 merged by Papaul:

[operations/puppet@production] Add ml-cache200[1-3] to site.pp

https://gerrit.wikimedia.org/r/763818

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-cache2002.codfw.wmnet with OS bullseye

@elukey can you please double check again the partman i am getting the error below

Failed to retrieve the preconfiguration file            │
   │ The file needed for preconfiguration could not be retrieved from  │
   │ http://apt.wikimedia.org/autoinstall/partman/raid1-2dev. The      │
   │ installation will proceed in non-automated mode.

Change 763837 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] partman: add missing .cfg file extension to recipe used by ml-cache

https://gerrit.wikimedia.org/r/763837

Change 763837 merged by Dzahn:

[operations/puppet@production] partman: add missing .cfg file extension to recipe used by ml-cache

https://gerrit.wikimedia.org/r/763837

@Papaul deployed fix and ran puppet on apt1001. try again now

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-cache2002.codfw.wmnet with OS bullseye executed with errors:

  • ml-cache2002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-cache2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-cache2002.codfw.wmnet with OS bullseye completed:

  • ml-cache2002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202182124_pt1979_4095899_ml-cache2002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-cache2003.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-cache2003.codfw.wmnet with OS bullseye completed:

  • ml-cache2003 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202182159_pt1979_4100767_ml-cache2003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-cache2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-cache2001.codfw.wmnet with OS bullseye completed:

  • ml-cache2001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202182246_pt1979_4108914_ml-cache2001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Papaul updated the task description. (Show Details)
Papaul updated the task description. (Show Details)

@elukey complete