Page MenuHomePhabricator

Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of kubernetes10[18-21]

Hostname / Racking / Installation Details

Hostnames: kubernetes10[18-21]
Racking Proposal: 4 hosts, 1 per row (A, B, C, D)
Networking/Subnet/VLAN/IP: Single 1G production internal vlan connection
Partitioning/Raid: match kubernetes1017
OS Distro: Stretch

Per host setup checklist

kubernetes1018:

  • - receive in system on procurement task T286585 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

kubernetes1019:

  • - receive in system on procurement task T286585 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

kubernetes1020:

  • - receive in system on procurement task T286585 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

kubernetes1021:

  • - receive in system on procurement task T286585 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH created this task.
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH mentioned this in Unknown Object (Task).
RobH unsubscribed.

The latest kubernetes node there is is kubernetes1017, so I'd say the new nodes should be kubernetes10[18-21].
We also need them to run Stretch unfortunately.

wiki_willy renamed this task from Q1:(Need By: TBD) rack/setup/install kubernetes10[19-22] to Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21].Oct 1 2021, 4:52 AM
wiki_willy updated the task description. (Show Details)
wiki_willy subscribed.

Updated task description based on @JMeybohm's comment

The latest kubernetes node there is is kubernetes1017, so I'd say the new nodes should be kubernetes10[18-21].
We also need them to run Stretch unfortunately.

kubernetes1018 A6 U28 Port26 Cableid# 1949
kubernetes1019 B3 U29 Port 25 Cableid# 1925
kubernetes1020 C3 U11 Port9 Cableid# 2865
kubernetes1021 D3 U33 Port25 Cableid# 2572

The BIOS and Idracs are set up, kubernetes1020 would not power on. @Jclark-ctr can you call Dell about 1020, I am on holiday next week. I will install the first 3.

kubernetes1020 is powered on might of been delayed

Change 728494 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding new kubernetes host to site.pp, dhcpd, and netboot.cfg

https://gerrit.wikimedia.org/r/728494

Change 728494 merged by Cmjohnson:

[operations/puppet@production] Adding new kubernetes host to site.pp, dhcpd, and netboot.cfg

https://gerrit.wikimedia.org/r/728494

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

kubernetes1018.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110081920_cmjohnson_18343_kubernetes1018_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

kubernetes1019.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110081930_cmjohnson_19199_kubernetes1019_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

kubernetes1020.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110081931_cmjohnson_19309_kubernetes1020_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['kubernetes1018.eqiad.wmnet']

Of which those FAILED:

['kubernetes1018.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

kubernetes1018.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202110081949_cmjohnson_24465_kubernetes1018_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['kubernetes1019.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['kubernetes1020.eqiad.wmnet']

and were ALL successful.

Confirmed: Service Request 1072368852 was successfully submitted.
for kubernetes1021

kubernetes1018-1020 are fully installed, once we figure out and fix the issue with 1021 we'll be able to close the task.

Completed auto-reimage of hosts:

['kubernetes1018.eqiad.wmnet']

and were ALL successful.

kubernetes1021 is up and bios configured

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host kubernetes1021.eqiad.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host kubernetes1021.eqiad.wmnet with OS stretch executed with errors:

  • kubernetes1021 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host kubernetes1021.eqiad.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host kubernetes1021.eqiad.wmnet with OS stretch completed:

  • kubernetes1021 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110132004_robh_26282_kubernetes1021.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
RobH claimed this task.
RobH updated the task description. (Show Details)
RobH added a subscriber: ā€¢ Cmjohnson.