Page MenuHomePhabricator

Q2:(Need By: ASAP) rack/setup/install lvs10[17-20]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of lvs10[17-20].

These are required to make use of any services that are behind LVS in eqiad for servers in the new rows E and F, so the need by is asap.

Hostname / Racking / Installation Details

Hostnames: What are the hostnames, and have you updated https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions ?

lvs1017, lvs1018, lvs1019, lvs1020

Racking Proposal: Where should these systems be racked? Can they share with any existing systems or should they avoid any other systems sharing their rack or row?

Must be racked one-per-row in rows A (17), B (18), C (19), D (20). Should be in 10G-capable racks in each row. See below about network cables affecting rack placement.

Networking/Subnet/VLAN/IP: What are the network details? 1G or 10G? Only one network port connection, or more? Subnet/vlan and IP requirements per connect?

Each of the four hosts has 6x 10G ports that will eventually be hooked up. The primary/first 10G interface goes to its own ToR switch in its own row. The other 5 are for cross-row connections.

For the initial replacement/decom of lvs101[3456] - it would be advantageous to place each new server in the same rack as the one it's replacing, as near to the existing lvs server as possible: (new lvs1017 with old lvs1013 in A7, 18 with 14 in B7, 19 with 15 in C7, and 20 with 16 in D7).

We can then, one rack/server at a time, bring up the new lvs and move the existing x-row cables (for ABCD) to the new lvs and decom the old one from service. Then the only truly-new x-row connections will be for future rows E/F (which may not be ready when we initially install these LVSes, which is fine!).

All of those racks seem to have at least 1U slot open. If the cable runs can't move far enough for the new slot, we could also consider a decom-then-install process (remove the old LVS from service, unrack it, rack new LVS in same slot and hook cables back up), but there are resiliency risks so that process would have to be completed in a single day in as short a time window as we can manage.

Partitioning/Raid: Is this hardware or software raid and what raid levels should be applied to each disk? What are the partitioning requirements and is there an existing partman recipe?

Already correct in partman: lvs*) echo partman/standard.cfg partman/raid1-2dev.cfg ;; \

OS Distro: Buster (default unless otherwise specified)

Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

lvs1017:

  • - receive in system on procurement task T293128 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via script

lvs1018:

  • - receive in system on procurement task T293128 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via script

lvs1019:

  • - receive in system on procurement task T293128 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via script

lvs1020:

  • - receive in system on procurement task T293128 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit

[x]x - firmware update (idrac, bios, network, raid controller)

  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via script

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH renamed this task from (Need By: TBD) rack/setup/install lvs10[17-20] to Q2:(Need By: TBD) rack/setup/install lvs10[17-20].Nov 16 2021, 7:28 PM
RobH raised the priority of this task from Medium to High.
RobH updated Other Assignee, added: Cmjohnson.
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.
RobH renamed this task from Q2:(Need By: TBD) rack/setup/install lvs10[17-20] to Q2:(Need By: ASAP) rack/setup/install lvs10[17-20].Nov 16 2021, 7:30 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH added a parent task: Unknown Object (Task).Nov 16 2021, 9:40 PM
RobH added a parent task: Unknown Object (Task).Nov 16 2021, 9:42 PM

lvs1017 A7 U9 id# 1206202101 Port#26
lvs1018 B7 U29 id# 1206202102 Port#4
lvs1019 C7 U25 id# 1206202103 Port#30
lvs1020 D7 U41 id# 1206202104 Port#23

idracs are setup, neet f/w update and OS install

Change 746927 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Add new lvs servers to site.pp role (insetup)

https://gerrit.wikimedia.org/r/746927

Change 746927 merged by Cmjohnson:

[operations/puppet@production] Add new lvs servers to site.pp role (insetup)

https://gerrit.wikimedia.org/r/746927

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors:

  • lvs1017 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1018.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors:

  • lvs1017 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112131814_cmjohnson_30417_lvs1017.out
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS bullseye executed with errors:

  • lvs1020 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1018.eqiad.wmnet with OS bullseye completed:

  • lvs1018 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112131827_cmjohnson_31402_lvs1018.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1019.eqiad.wmnet with OS bullseye completed:

  • lvs1019 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112131828_cmjohnson_31506_lvs1019.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS bullseye executed with errors:

  • lvs1017 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112131850_cmjohnson_5823_lvs1017.out
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS bullseye executed with errors:

  • lvs1020 (FAIL)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112131851_cmjohnson_5927_lvs1020.out
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Change 746946 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] lvs1017-20: switch to insetup_noferm role

https://gerrit.wikimedia.org/r/746946

Change 746946 merged by BBlack:

[operations/puppet@production] lvs1017-20: switch to insetup_noferm role

https://gerrit.wikimedia.org/r/746946

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS buster executed with errors:

  • lvs1017 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS buster executed with errors:

  • lvs1020 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1017.eqiad.wmnet with OS buster completed:

  • lvs1017 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112132055_cmjohnson_31317_lvs1017.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host lvs1020.eqiad.wmnet with OS buster completed:

  • lvs1020 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112132055_cmjohnson_31264_lvs1020.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1018.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1019.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1018.eqiad.wmnet with OS buster completed:

  • lvs1018 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112141447_bblack_12925_lvs1018.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1019.eqiad.wmnet with OS buster completed:

  • lvs1019 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112141449_bblack_13117_lvs1019.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

The servers are finished with rack and initial setup, cross row connections should be handled in a separate task.

Change 747173 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] lvs1020: lvs role and iface/addr metadata

https://gerrit.wikimedia.org/r/747173

Change 747175 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/homer/public@master] lvs1020: add to homer lvs_neighbors

https://gerrit.wikimedia.org/r/747175

Change 747173 merged by BBlack:

[operations/puppet@production] lvs1020: lvs role and iface/addr metadata

https://gerrit.wikimedia.org/r/747173

Change 747192 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] lvs1020: add interface_tweaks data

https://gerrit.wikimedia.org/r/747192

Change 747192 merged by BBlack:

[operations/puppet@production] lvs1020: add interface_tweaks data

https://gerrit.wikimedia.org/r/747192

Change 747175 merged by jenkins-bot:

[operations/homer/public@master] eqiad lvs_neighbors: swap lvs1020 for lvs1016

https://gerrit.wikimedia.org/r/747175

Change 747203 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] pybal: peer all eqiad lvses with eqiad routers

https://gerrit.wikimedia.org/r/747203

Change 747203 merged by BBlack:

[operations/puppet@production] pybal: peer all eqiad lvses with eqiad routers

https://gerrit.wikimedia.org/r/747203

Change 747515 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] lvs1016: unconfig lvs, move to insetup

https://gerrit.wikimedia.org/r/747515

Change 747515 merged by BBlack:

[operations/puppet@production] lvs1016: unconfig lvs, move to insetup

https://gerrit.wikimedia.org/r/747515

Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1016.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1016.eqiad.wmnet with OS buster completed:

  • lvs1016 (WARN)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112151725_bblack_19360_lvs1016.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB