Page MenuHomePhabricator

(Need By: TBD) rack/setup/install an-test-coord1002
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of an-test-coord1002 using spare system wmf5178, allocated via T289784.

Hostname / Racking / Installation Details

Hostnames: an-test-coord1002.eqiad.wmnet
Racking Proposal: The peer system (an-test-coord1001) is in eqiad row D (rack D3) - so therefore this host should be in a different row for resilience purposes.
Networking/Subnet/VLAN/IP: - 1Gbps - one network port connection required - Analytics VLAN required
Partitioning/Raid: software RAID1 - existing recipe for an-test-coord100* is partman/custom/reuse-analytics-hadoop-test.cfg
OS Distro: Buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

an-test-coord1002:

  • - relocate system from D6 into any other 1G rack in rows A,B,or C. This needs to be in a different row than an-test-coord1001 which is located in D3
  • - update netbox with new rack location
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox. - system already has a mgmt ip with asset tag generated for it and will need its hostname added in
  • - network port setup via netbox, run homer to commit
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH added a subscriber: Jclark-ctr.

@Jclark-ctr

This is a spare system we already have in netbox. It just needs to relocate from row D, as its being allocated into service as redundant to a server in D3. This can go in any 1G rack outside of row D (as rows A, B, and C all have analytics1 vlans available in them according to netbox.)

@Jclark-ctr I noticed this moved out of D6, can you update task and netbox when you get a chance

I am still not sure where this server is, I cannot find it. @Jclark-ctr is out this week.

Host was still in rack d6 u7 verifed location and relocated to
B1 U29 Port20 Cableid#1935

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-coord1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • The reimage failed, see the cookbook logs for the details

updated firmware, updated dns in netbox. Running into errors with the install script.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-coord1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

updated site.pp, ran the script again and it made it to the debian installer but failed on raid cfg.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster

This is the error I am getting, I verified there are disks in the server. I also checked BIOS and it's set to auto but I do see the disks. I am not sure.

┌────────────┤ [!!] Partition disks ├─────────────┐
│                                                 │
│   reuse-parts: Recipe device matching failed    │
│ ERROR: =dev=mapper=*-root matches zero devices  │
│                                                 │
│ All devices:                                    │
│ =dev=sda                                        │
│ =dev=sdb                                        │
│                                                 │
│     <Go Back>                    <Continue>

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster executed with errors:

  • an-test-coord1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

@Papaul or @RobH could you look at this and let me know what I am missing.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster executed with errors:

  • an-test-coord1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details
Cmjohnson added a subscriber: Ottomata.

@Ottomata Can you verify that this is using the correct partman recipe, the installer fails during the install at the raid configuration.

I'm not familiar with what is going on with this node atm, pinging @BTullis!

I'm happy to look at it. It's likely that I've set the wrong partman recipe, so sincere apologies if I've wasted your time. I'll look at it asap.

@BTullis Have you had a chance to look and see if we're using the correct partman recipe?

I have gone through and made sure that the internal USB has been turned off, which has been known to cause issues during install. Double checked bios settings. Both disks are being seen

Change 752028 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Trying a different partman recipe for an-test-worker servers

https://gerrit.wikimedia.org/r/752028

Change 752028 merged by Cmjohnson:

[operations/puppet@production] Trying a different partman recipe for an-test-worker servers

https://gerrit.wikimedia.org/r/752028

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster executed with errors:

  • an-test-coord1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

I've tried a different partman recipe. I do not know what is wrong or why the raid fails.

Change 753110 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] install_server: fix netboot settings for an-test-coord1002

https://gerrit.wikimedia.org/r/753110

Change 753110 merged by Elukey:

[operations/puppet@production] install_server: fix netboot settings for an-test-coord1002

https://gerrit.wikimedia.org/r/753110

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster completed:

  • an-test-coord1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201111808_elukey_18220_an-test-coord1002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

@Cmjohnson an-test-coord1002 done, there was an issue with your partman patch (it was targeting an-test-worker1002 instead of an-test-coord1002), but now the reimage passed.

The host is set to staged in Netbox (https://netbox.wikimedia.org/dcim/devices/2140/), I think that we can close!

Thanks @elukey resolving the task