Page MenuHomePhabricator

Q3:(Need By: TBD) rack/setup/install conf100[789]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of conf100[789]

Hostname / Racking / Installation Details

Hostnames: conf100[789]
Racking Proposal: conf1007 -> Row A (any 1G rack), conf1008 -> Row B (any 1G rack), conf1009 -> Row D (any 1G rack)
Networking/Subnet/VLAN/IP: single 1G connection, private1 vlan
Partitioning/Raid: sw raid1, standard 2dev
OS Distro: Bullseye (default unless otherwise specified)

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

conf1007:
  • - receive in system on procurement task T297152 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
conf1008:
  • - receive in system on procurement task T297152 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
conf1009:
  • - receive in system on procurement task T297152 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH moved this task from Incoming 🐫 to this.quarter 🍕 on the serviceops board.
RobH mentioned this in Unknown Object (Task).Feb 8 2022, 6:28 PM
RobH renamed this task from (Need By: TBD) rack/setup/install conf100[789] to Q3:(Need By: TBD) rack/setup/install conf100[789].Feb 23 2022, 6:19 PM

FYI I don't believe there is any reason E/F would be ruled out for these, if space/power is tight in the existing rows.

FYI I don't believe there is any reason E/F would be ruled out for these, if space/power is tight in the existing rows.

No E/F shouldn't be ruled out but I do have 1 question regarding the new rows. Do we have latency numbers (avg, max) for traffic to other rows. etcd (which is what runs on conf* boxes) is sensitive to high latencies and jitter. My gut feeling says that latency to other rows is going to be sub ms we are going to be ok, but if we can proactively measure that, it would be awesome.

conf1007 A1 U17 port17 cableid2907
conf1008 B1 U22 port32 cableid2013339101789
conf1009 D3 U40 port45 cableid23000027

Change 808546 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding conf1007-9 to site.pp and and netboot.cfg

https://gerrit.wikimedia.org/r/808546

Change 808546 merged by Cmjohnson:

[operations/puppet@production] Adding conf1007-9 to site.pp and and netboot.cfg

https://gerrit.wikimedia.org/r/808546

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host conf1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host conf1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host conf1009.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host conf1007.eqiad.wmnet with OS bullseye completed:

  • conf1007 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206262357_cmjohnson_4122658_conf1007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host conf1009.eqiad.wmnet with OS bullseye completed:

  • conf1009 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206270003_cmjohnson_4123371_conf1009.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host conf1008.eqiad.wmnet with OS bullseye completed:

  • conf1008 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206270003_cmjohnson_4123298_conf1008.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Cmjohnson updated the task description. (Show Details)