Page MenuHomePhabricator

Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of wqds101[456]

Hostname / Racking / Installation Details

Hostnames: wqds101[456]
Racking Proposal: Where should these systems be racked? Can they share with any existing systems or should they avoid any other systems sharing their rack or row? Avoid rows A and B, any other rows are fine , 1 per row - NetOps notes these would be ideal for E and F placement as they are private1 vlan requirements.
Networking/Subnet/VLAN/IP: 10G, single port, private1 VLAN
Partitioning/Raid: S/W RAID10 - partman: raid10-4dev.cfg
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

wqds1014:
  • - receive in system on procurement task T303459 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit

[x]x - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details

  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
wqds1015:
  • - receive in system on procurement task T303459 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
wqds1016:
  • - receive in system on procurement task T303459 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.
This comment was removed by Jclark-ctr.
Jclark-ctr subscribed.

wqds1014 C4 38 cableid 20220072 port 6
wqds1015 E3 30 cableid 20220071 port 30
wqds1016. F3 30 cableid 20220076 port 30

I updated all the names to wdqs on the switch and netbox.

Cmjohnson renamed this task from Q4:(Need By: TBD) rack/setup/install wqds101[4,5,6] to Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6].Jun 15 2022, 3:54 PM

Change 805851 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding netboot and site.pp for new wdqs servers

https://gerrit.wikimedia.org/r/805851

Change 805851 merged by Cmjohnson:

[operations/puppet@production] Adding netboot and site.pp for new wdqs servers

https://gerrit.wikimedia.org/r/805851

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1014.eqiad.wmnet with OS buster

@Papaul more issues with installs, and is waiting for an image but then fails to load. Below is the error I received. Can you look into this as well please?

Stalls after this but a few minutes later it gives me a failed to load message

CLIENT MAC ADDR: 5C 6F 69 C4 3D E0 GUID: 4C4C4544-004D-3610-805A-B3C04F575033
CLIENT IP: 10.64.32.188 MASK: 255.255.252.0 DHCP IP: 208.80.154.32
GATEWAY IP: 10.64.32.1

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al

Then

Failed to load ldlinux.c32
Boot failed: press a key to retry, or wait for reset...
...

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1014.eqiad.wmnet with OS buster executed with errors:

  • wdqs1014 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1015.eqiad.wmnet with OS buster

@Cmjohnson yes this is the same issue that we had on backup1009 just downgrade the 10 NIC firmware to version 21.83.83 it will fix the issue.

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1015.eqiad.wmnet with OS buster executed with errors:

  • wdqs1015 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1014.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1015.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster executed with errors:

  • wdqs1016 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1014.eqiad.wmnet with OS buster completed:

  • wdqs1014 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206152149_cmjohnson_1813381_wdqs1014.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1015.eqiad.wmnet with OS buster completed:

  • wdqs1015 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206152202_cmjohnson_1814100_wdqs1015.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

1014 and 1015 are installed, 1016 shows that no cables are connected. John will look at that in the morning.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster executed with errors:

  • wdqs1016 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster executed with errors:

  • wdqs1016 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster executed with errors:

  • wdqs1016 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster completed:

  • wdqs1016 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206241253_cmjohnson_3562960_wdqs1016.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Cmjohnson updated the task description. (Show Details)