Page MenuHomePhabricator

Q3:rack/setup/install an-worker11[49-56]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of an-worker11[49-56]

Hostname / Racking / Installation Details

Hostnames: an-worker11[49-56]
Racking Proposal: Spread anywhere across rows A-F, trying to avoid having too many other hadoop workers (analytics1* or an-worker1*) in a since rack.
Networking Setup: # of Connections:1 Speed:10G. Vlan:Analytics AAAA records:Y, Additional IP records N.
Partitioning/Raid: HW Raid: Y (only the O/S disks), Partman recipe and/or desired Raid Level: partman/custom/analytics-flex.cfg
OS Distro: Bullseye
Sub-team Technical Contact: @BTullis

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

an-worker1149:
  • - receive in system on procurement task T325206 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_engineering.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
an-worker1150:
  • - receive in system on procurement task T325206 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_engineering.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
an-worker1151:
  • - receive in system on procurement task T325206 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_engineering.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
an-worker1152:
  • - receive in system on procurement task T325206 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_engineering.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
an-worker1153:
  • - receive in system on procurement task T325206 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_engineering.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
an-worker1154:
  • - receive in system on procurement task T325206 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_engineering.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
an-worker1155:
  • - receive in system on procurement task T325206 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_engineering.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
an-worker1156:
  • - receive in system on procurement task T325206 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_engineering.
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

an-worker1149 A7 U1 port 1 CableId 4899
an-worker1150 B7 U38 port 34 CableId 5013
an-worker1151 C7 U32 port 10 CableId 2965
an-worker1152 D7 U28 port 27 CableId 20220032
an-worker1153 E1 U15 port 15 CableId 5070
an-worker1154 E3 U15 port 15 CableId 5076
an-worker1155 F1 U15 port 15 CableId 20220023
an-worker1155 F3 U15 port 15 CableId20220224

Hi @Jclark-ctr

Apologies for any omission on my part.

For these servers we use RAID1 for the OS, based on the two risks in the rear-mounted flex bay.

All of the remaining 12 disks are left unconfigured.

We set them up as single volume RAID0 volumes using this cookbook: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/hadoop/init-hadoop-workers.py

Hope that helps.

Configured raid on an-worker11(49,51,52,53,54,55,56)

Unable to login to management on 50.
need to verify psu1 on 49

Reset idrac. still unable to login to an-worker1150
Fixed psu1 on 49

Jhancock.wm added subscribers: Jclark-ctr, Jhancock.wm.
This comment was removed by Jhancock.wm.

Change 930724 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add an-worker11[49-56] to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/930724

Change 930724 merged by Papaul:

[operations/puppet@production] Add an-worker11[49-56] to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/930724

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1149 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1156 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1149 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1156 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

got the firmware for all the servers updated for everything (bios,idrac,network,sas) last night. finding time and connectivity to make sure all the bios settings are correct and raids are configured correctly.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1153.eqiad.wmnet with OS bullseye

@Jclark-ctr I need your assistance next time you're onsite at Eqiad. These servers do not have a network connection on the 1st port of the NIC card. Can you please change or connect as needed? TY!
an-worker1150 in B7 at U38
an-worker1151 in C7 at U32
an-worker1154 in E3 at U15

edit: an-worker1152 in D7 at U28

@Papaul
I'm having a problem on an-worker1149 where the boot order will not update after applying changes and rebooting. The network boot stays on Embedded rather than changing to Integrated. I've ensured the firmware version of the NIC is 21.85. It's not one of the servers with no connection I tagged John on, so it's not likely physical. Please lmk if you know what's causing this. thanks!

@Jclark-ctr sorry missed one. edited my previous comment with the addition to keep all the info together.

@Papaul finished double checking that I got everything like we discussed. All the firmware is up to date and the NIC issues have been solved. Can you please assist me with by performing this step on all 8 servers..

  • operations/puppet update - this should include updates to netboot.pp, and site.pp role::insetup::data_engineering.

Thanks.

@Jhancock.wm that step was already done on june15 see link below. so you should be good to proceed with the OS install.
Thanks
https://gerrit.wikimedia.org/r/c/operations/puppet/+/930724/

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye completed:

  • an-worker1156 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307202138_jhancock_1553686_an-worker1156.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1155.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1155.eqiad.wmnet with OS bullseye completed:

  • an-worker1155 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307202232_jhancock_1611525_an-worker1155.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1154.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1154.eqiad.wmnet with OS bullseye completed:

  • an-worker1154 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307202316_jhancock_1668076_an-worker1154.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1153.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1153.eqiad.wmnet with OS bullseye completed:

  • an-worker1153 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307210003_jhancock_1719874_an-worker1153.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

finished 53-56. should have time to finish the last 4 tomorrow afternoon

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1152.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1152.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1152.eqiad.wmnet with OS bullseye completed:

  • an-worker1152 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307211848_jhancock_2890713_an-worker1152.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1151.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1151.eqiad.wmnet with OS bullseye completed:

  • an-worker1151 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307211924_jhancock_2927613_an-worker1151.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1150.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1150.eqiad.wmnet with OS bullseye completed:

  • an-worker1150 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307212002_jhancock_2966140_an-worker1150.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye completed:

  • an-worker1149 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202307212038_jhancock_3007509_an-worker1149.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Jhancock.wm updated the task description. (Show Details)

@BTullis finally finished. thanks for your patience.