Page MenuHomePhabricator

Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of mw241[2-9].codfw.wmnet

Hostname / Racking / Installation Details

Hostnames: mw241[2-9].codfw.wmnet
Racking Proposal: C3-codfw
Networking/Subnet/VLAN/IP: 1G, internal vlan, single connection
Partitioning/Raid: sw raid, standard, raid1-2dev
OS Distro: Buster (default unless otherwise specified)

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

mw2412:

  • - receive in system on procurement task T286516 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

mw2413:

  • - receive in system on procurement task T286516 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

mw2414:

  • - receive in system on procurement task T286516 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

mw2415:

  • - receive in system on procurement task T286516 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

mw2416:

  • - receive in system on procurement task T286516 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

mw2417:

  • - receive in system on procurement task T286516 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

mw2418:

  • - receive in system on procurement task T286516 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

mw2419:

  • - receive in system on procurement task T286516 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH renamed this task from (Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet to Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet.Sep 1 2021, 6:40 PM
RobH assigned this task to Papaul.
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH unsubscribed.
RobH mentioned this in Unknown Object (Task).Sep 1 2021, 6:44 PM

Change 724415 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add mw24[12-19] MAC address and to site.pp

https://gerrit.wikimedia.org/r/724415

Change 724415 merged by Papaul:

[operations/puppet@production] Add mw24[12-19] MAC address and to site.pp

https://gerrit.wikimedia.org/r/724415

Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host mw2412.codfw.wmnet

Cookbook cookbooks.sre.experimental.reimage executed with errors:

  • mw2412 (FAIL)
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host mw2412.codfw.wmnet

Cookbook cookbooks.sre.experimental.reimage completed:

  • mw2412 (WARN)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB
    • Removed from Debmonitor
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/wmf-auto-reimage/202109281645_pt1979_236935_mw2412.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host mw2413.codfw.wmnet

Cookbook cookbooks.sre.experimental.reimage executed with errors:

  • mw2413 (FAIL)
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • The reimage failed, see the cookbook logs for the details

@Volans mw2413 failed with the same error

Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host mw2413.codfw.wmnet

Cookbook cookbooks.sre.experimental.reimage completed:

  • mw2413 (WARN)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB
    • Removed from Debmonitor
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/wmf-auto-reimage/202109281735_pt1979_244529_mw2413.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts:

mw2414.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109282134_pt1979_274485_mw2414_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts:

mw2415.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109282140_pt1979_276513_mw2415_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mw2414.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts:

mw2416.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109282159_pt1979_279172_mw2416_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mw2415.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts:

mw2417.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109282207_pt1979_282626_mw2417_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mw2416.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts:

mw2418.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109282228_pt1979_284985_mw2418_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mw2417.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw2418.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts:

mw2419.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202109282255_pt1979_290876_mw2419_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mw2419.codfw.wmnet']

and were ALL successful.

@jijiki @Dzahn this is all ready for service

Thank you.

Papaul updated the task description. (Show Details)
Papaul updated the task description. (Show Details)

complete

Change 785147 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] site: use appserver in codfw C3, cleanup duplicate insetup definition

https://gerrit.wikimedia.org/r/785147

Change 785147 merged by Dzahn:

[operations/puppet@production] site: use appserver in codfw C3, cleanup duplicate insetup definition

https://gerrit.wikimedia.org/r/785147

Change 785918 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] conftool-date: add mw2412 through mw2419 as new appservers

https://gerrit.wikimedia.org/r/785918

Change 785918 merged by Dzahn:

[operations/puppet@production] conftool-date: add mw2412 through mw2419 as new appservers

https://gerrit.wikimedia.org/r/785918

Mentioned in SAL (#wikimedia-operations) [2022-04-26T19:48:18Z] <mutante> mw2419 - set weight to 25 in conftool, scap pull, first time in production, jobrunner/videoscaler T290192

Dzahn claimed this task.
Dzahn removed projects: ops-codfw, DC-Ops.
Dzahn added a subscriber: Papaul.
Dzahn edited subscribers, added: Jelto; removed: Papaul.

@Dzahn i think it is best to create another task for this issue and not reopen the rack/setup task. Thanks

@Papaul I removed you from the ticket and any tags related to dcops though. Still an issue?

@Dzahn i think it is best to create another task for this issue and not reopen the rack/setup task. Thanks

replaced with T307255