Page MenuHomePhabricator

Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of X

Hostname / Racking / Installation Details

We would like two different hostname schemes for these four servers please:

  • an-coord100[3,4]
  • an-mariadb100[1,2]

Racking Proposal: an-coord1003 and an-mariadb1001 in row E : an-coord1004 and an-mariadb1002 in row F
It doesn't matter whether or not the two classes share the same rack.
Networking Setup: # of Connections: 1, Speed: 10G . Vlan: Analytics AAAA records: Y, Additional IP records (Cassandra)?
Partitioning/Raid: HW Raid: N, Partman recipe and/or desired Raid Level: partman/raid1-2dev.cfg
OS Distro: Bullseye
Sub-team Technical Contact: Ben Tullis - Data Engineering SREs

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

an-coord1003:
  • - receive in system on procurement task T319437 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
an-coord1004:
  • - receive in system on procurement task T319437 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
an-mariadb1001:
  • - receive in system on procurement task T319437 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
an-mariadb1002:
  • - receive in system on procurement task T319437 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.
Jclark-ctr subscribed.

an-coord1003 E1 U36 Port 36 Cableid # 20220001
an-mariadb1001 E1 U37 Port 37 Cableid # 20220002
an-coord1004 F1 U36 Port 36 Cableid # 20220003
an-mariadb1002 F1 U37 Port 37 Cableid # 20220004

Change 879812 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add new an-coord and an-mariad to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/879812

Change 879812 merged by Papaul:

[operations/puppet@production] Add new an-coord and an-mariadb to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/879812

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host an-coord1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host an-coord1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host an-coord1003.eqiad.wmnet with OS bullseye completed:

  • an-coord1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301131434_pt1979_398365_an-coord1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host an-mariadb1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host an-coord1004.eqiad.wmnet with OS bullseye completed:

  • an-coord1004 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301131501_pt1979_403444_an-coord1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Change 879825 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] fix typo for an-mariadb node in netboot.cfg

https://gerrit.wikimedia.org/r/879825

Change 879825 merged by Papaul:

[operations/puppet@production] fix typo for an-mariadb node in netboot.cfg

https://gerrit.wikimedia.org/r/879825

Change 879847 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Fix an-mariadb100[1-2] in netboot.cfg

https://gerrit.wikimedia.org/r/879847

Change 879847 merged by Papaul:

[operations/puppet@production] Fix an-mariadb100[1-2] in netboot.cfg

https://gerrit.wikimedia.org/r/879847

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host an-mariadb1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host an-mariadb1001.eqiad.wmnet with OS bullseye completed:

  • an-mariadb1001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301131519_pt1979_407195_an-mariadb1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host an-mariadb1002.eqiad.wmnet with OS bullseye completed:

  • an-mariadb1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301131922_pt1979_449461_an-mariadb1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Papaul added a subscriber: BTullis.

@BTullis this is done.