Page MenuHomePhabricator

Q1:rack/setup/install pc101[56]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of pc101[56]

Hostname / Racking / Installation Details

Hostnames: pc1015, pc1016
Racking Proposal: Any rack (NOT in A1, B1, C5, D6), as long as they are different rows.
Networking Setup: # Connections:1, Speed: 1G Vlan: Private AAAA records: N
Partitioning/Raid: HW Raid: Y, Raid Level: 10 and @Marostegui will assign the partman recipe on puppet
OS Distro: Bullseye
Sub-team Technical Contact: @Marostegui

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

pc1015:
  • - receive in system on procurement task T341271 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
pc1016:
  • - receive in system on procurement task T341271 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).

Change 941042 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc1015,pc1016: New hosts to be set up

https://gerrit.wikimedia.org/r/941042

Change 941042 merged by Marostegui:

[operations/puppet@production] pc1015,pc1016: New hosts to be set up

https://gerrit.wikimedia.org/r/941042

Change 945535 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site: Add pc101[56] as in setup

https://gerrit.wikimedia.org/r/945535

Change 945535 merged by Marostegui:

[operations/puppet@production] site: Add pc101[56] as in setup

https://gerrit.wikimedia.org/r/945535

pc1015 - A 6. U 33. Port 32. Cableid: 2839

pc1016 - C 6. U 31. port 30 CableID 3252

pc1016 - C 6. U 31. port 30 CableID 3252 is having issues, will recheck cabling

@VRiley-WMF Serial was entered into netbox incorrectly if you are not onsite sometimes you can look at procurement ticket packing slip that is attached.

@VRiley-WMF if you have a screen going please check to see if its in process of doing something on pc1015
spicerack.dhcp.DHCPError: Snippet /etc/dhcp/automation/mgmt-eqiad/pc1015.mgmt.eqiad.wmnet.conf already exists, is there another operation in progress for the same device? If not you delete it and retry.

@Jclark-ctr corrected the problem with on pc1015. Also, I did close my screen on this device. Would you be able to try again?

Change 961191 had a related patch set uploaded (by Jclark-ctr; author: jclark):

[operations/puppet@production] add pc10(15-16) T342164

https://gerrit.wikimedia.org/r/961191

Change 961191 abandoned by Jclark-ctr:

[operations/puppet@production] add pc10(15-16) T342164

Reason:

Already added by Marostegui

https://gerrit.wikimedia.org/r/961191

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc1016.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc1015.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc1015.eqiad.wmnet with OS bullseye completed:

  • pc1015 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309262002_jhancock_719761_pc1015.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc1016.eqiad.wmnet with OS bullseye completed:

  • pc1016 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309262010_jhancock_719766_pc1016.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Jhancock.wm updated the task description. (Show Details)
Jhancock.wm subscribed.

@Marostegui finished up