Page MenuHomePhabricator

Q1:rack/setup/install cloudservices1006.eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudservices1006.eqiad.wmnet

Hostname / Racking / Installation Details

Hostnames: cloudservices1006.eqiad.wmnet
Racking Proposal: eqiad C8
Networking Setup: cloudsw switch, cloud-host vlan
Partitioning/Raid: HW Raid: Y/N, Partman recipe and/or desired Raid Level: standard
OS Distro: Bullseye (default unless otherwise specified)
Sub-team Technical Contact: WMCS @aborrero, @Andrew or SRE/IF @cmooney

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudservices1006:
  • - receive in system on procurement task T341245 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).

Change 941383 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices1006: prepare service

https://gerrit.wikimedia.org/r/941383

Server has not arrived yet they should arrive any day now.

Jul 24, 2023 9:16 PM
Departed Terminal Location ASHBURN, 20147, US

@aborrero @Andrew Will this server be using single or dual network connection?

cloudservices1006 C8 u20 port 19 (cableid - 5321) port 43(cableid 5329)

@aborrero @Andrew i have connected two network ports to prevent any blockers from remote work but only have entered one into netbox while we wait for answer on number of network connections

Raid has not been configured yet on server. What raid is needed for this server @aborrero @Andrew

I'm sure that this only needs a single nic connected unless @aborrero has something truly ambitious in mind. Seems easy enough to leave the second cable attached until he's back in September though. Thank you!

Oh, raid-wise: the existing cloudservices use

echo partman/standard.cfg partman/raid1-2dev.cfg ;; \

That, or similar, is fine for this host as well. Nothing too strenuous will be happening on disk.

I'm sure that this only needs a single nic connected unless @aborrero has something truly ambitious in mind. Seems easy enough to leave the second cable attached until he's back in September though. Thank you!

Single NIC is fine.

Is there any blocker for this server to get it ready to go?

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye

@Jclark-ctr @VRiley-WMF i am taking over this task to finish it for the cloud team. thanks

Please use partman/raid10-4dev.cfg as it seems the most standard for 4 devices.

Change 953299 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add cloudservices1006 to netboot.cfg

https://gerrit.wikimedia.org/r/953299

Change 953299 merged by Papaul:

[operations/puppet@production] Add cloudservices1006 to netboot.cfg

https://gerrit.wikimedia.org/r/953299

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudservices1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Change 953327 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add cloudservices1006 to site.pp

https://gerrit.wikimedia.org/r/953327

Change 953327 merged by Papaul:

[operations/puppet@production] Add cloudservices1006 to site.pp

https://gerrit.wikimedia.org/r/953327

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudservices1006 (FAIL)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye completed:

  • cloudservices1006 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308291901_pt1979_2816826_cloudservices1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Papaul updated the task description. (Show Details)

@aborrero all your's

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudservices1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudservices1006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • The reimage failed, see the cookbook logs for the details

We are already working at service level with this box. We should coordinate reimage/reboots etc.