Page MenuHomePhabricator

(Need By: TBD) rack/setup/install ml-serve200[5-8]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ml-serve200[5-8]

Hostname / Racking / Installation Details

Hostnames: ml-serve200[5-8]
Racking Proposal: The only constraint would be to not rack them (if possible) where other ml-serve2* nodes are
Networking/Subnet/VLAN/IP: private vlan, 1G
Partitioning/Raid: Same as ml-serve
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ml-serve2005: A1-U35-36 ge-1/0/34

  • - receive in system on procurement task T291975 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host

ml-serve2006: B6-U28-29 ge-6/0/27

  • - receive in system on procurement task T291975 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host

ml-serve2007: C3-U3-4 ge-3/0/2

  • - receive in system on procurement task T291975 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host

ml-serve2008: D5-U8-9 ge-5/0/7

  • - receive in system on procurement task T291975 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Related Objects

StatusSubtypeAssignedTask
ResolvedPapaul

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH mentioned this in Unknown Object (Task).
RobH unsubscribed.

Change 758968 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add ml-server200[5-6] to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/758968

Change 758968 abandoned by Papaul:

[operations/puppet@production] Add ml-server200[5-6] to site.pp and netboot.cfg

Reason:

https://gerrit.wikimedia.org/r/758968

Change 758968 restored by Papaul:

[operations/puppet@production] Add ml-server200[5-6] to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/758968

Change 758968 abandoned by Papaul:

[operations/puppet@production] Add ml-server200[5-6] to site.pp and netboot.cfg

Reason:

https://gerrit.wikimedia.org/r/758968

Change 758968 restored by Papaul:

[operations/puppet@production] Add ml-server200[5-6] to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/758968

Change 758968 merged by Papaul:

[operations/puppet@production] Add ml-server200[5-6] to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/758968

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2005.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2005.codfw.wmnet with OS buster completed:

  • ml-serve2005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202012221_pt1979_1135834_ml-serve2005.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2006.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2006.codfw.wmnet with OS buster completed:

  • ml-serve2006 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202020017_pt1979_1151975_ml-serve2006.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS buster executed with errors:

  • ml-serve2007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS buster executed with errors:

  • ml-serve2007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2007.codfw.wmnet with OS buster completed:

  • ml-serve2007 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202020122_pt1979_1160633_ml-serve2007.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2008.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2008.codfw.wmnet with OS buster completed:

  • ml-serve2008 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202020157_pt1979_1163949_ml-serve2008.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Papaul added a subscriber: elukey.

@elukey all yours leaving the task open since i don't have the Packing Slip to receive the servers in Coupa

Papaul updated the task description. (Show Details)

complete