Page MenuHomePhabricator

Q4: rack/setup/install dse-k8s-worker100[5-8]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of dse-k8s-worker100[5-8]

Hostname / Racking / Installation Details

Hostnames: dse-k8s-worker100[5-8]
Racking Proposal: The hosts will be expanding the existing dse-k8s-worker pool, so it might make sense to have:

  • dse-k8s-worker1005 -> row e
  • dse-k8s-worker1006 -> row f
  • dse-k8s-worker1007 -> row a - not rack 2
  • dse-k8s-worker1008 -> row b - not rack 2

@BTullis added a screenshot from netbox to show the existing host placement.

Networking/Subnet/VLAN/IP: 10G single connection, private1 vlan
Partitioning/Raid:Same as dse-k8s-worker100[1-4] nodes - sw raid1
OS Distro: Bullseye

existing dse-k8s-worker-racking

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

dse-k8s-worker1005:
  • - receive in system on procurement task T303432 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dse-k8s-worker1006:
  • - receive in system on procurement task T303432 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dse-k8s-worker1007:
  • - receive in system on procurement task T303432 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
dse-k8s-worker1008:
  • - receive in system on procurement task T303432 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Related Objects

StatusSubtypeAssignedTask
Resolved Cmjohnson

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH unsubscribed.

@BTullis Hey Ben as of right now i do not have any space in 10g racks for 2u in A-D can all 4 of these be in row E & F not sharing racks ?
dse-k8s-worker1005 -> row e1
dse-k8s-worker1006 -> row e3
dse-k8s-worker1007 -> row f1
dse-k8s-worker1008 -> row f3-

@Jclark-ctr - Many thanks. Yes that racking proposal is fine, thanks. Please fee free to proceed to rack them all in rows E & F.

dse-k8s-worker1005 e1 U33 port 33 Cableid 20220052
dse-k8s-worker1006 e3 U33 port 33 Cableid 20220060
dse-k8s-worker1007 f1 U33 port 33 Cableid 20220053
dse-k8s-worker1008 f3 U34 port 34 Cableid 20220051

Change 808924 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding dse-k8-100[7-8] to netboot and site.pp

https://gerrit.wikimedia.org/r/808924

Change 808924 merged by Cmjohnson:

[operations/puppet@production] Adding dse-k8-100[7-8] to netboot and site.pp

https://gerrit.wikimedia.org/r/808924

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1008 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1008 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye executed with errors:

  • dse-k8s-worker1008 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1007.eqiad.wmnet with OS bullseye completed:

  • dse-k8s-worker1007 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206271632_cmjohnson_111711_dse-k8s-worker1007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1005.eqiad.wmnet with OS bullseye completed:

  • dse-k8s-worker1005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206271633_cmjohnson_112027_dse-k8s-worker1005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1008.eqiad.wmnet with OS bullseye completed:

  • dse-k8s-worker1008 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206271630_cmjohnson_111534_dse-k8s-worker1008.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host dse-k8s-worker1006.eqiad.wmnet with OS bullseye completed:

  • dse-k8s-worker1006 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206271632_cmjohnson_111861_dse-k8s-worker1006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Cmjohnson updated the task description. (Show Details)

These are installed, took a few extra steps, there is a raid card, and the disk has to be changed to no-raid, which triggers the bios to set network boot as the primary option. I then had to reboot and change the bios settings to boot from the hard drive. T