Page MenuHomePhabricator

Q3:rack/setup/install apus-be100[56]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of apus-be100[56]

Hostname / Racking / Installation Details

Hostnames: apus-be100[56]
Racking Proposal: Where should these systems be racked? Avoid E8 and F3
Networking Setup: # of Connections:1 - Speed:10G. - VLAN:Private
OS Distro: Bookworm
Boot Method: UEFI
Sub-team Technical Contact: @MatthewVernon
Tags Please tag SRE-swift-storage on racking task

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

apus-be1005

  • Receive in system on procurement task T412706 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

apus-be1006

  • Receive in system on procurement task T412706 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Related Objects

StatusSubtypeAssignedTask
ResolvedJclark-ctr

Event Timeline

RobH assigned this task to MatthewVernon.
RobH mentioned this in Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-site engineerss will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

Change #1247937 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] preseed: all apus-be nodes are using boss cards

https://gerrit.wikimedia.org/r/1247937

Change #1247937 merged by MVernon:

[operations/puppet@production] preseed: all apus-be nodes are using boss cards

https://gerrit.wikimedia.org/r/1247937

These where finished accidentally put Racking ticket on cookbook reimage was posted on Procurement ticket.

Hi @Jclark-ctr could you take another look at the disks on these two systems, please? There should be 24 JBOD spinning disks visible to the OS, but neither host has that:
apus-be1005 has 23 (i.e. one missing)

mvernon@apus-be1005:~$ grep -c ' sd' /proc/partitions 
23

apus-be1006 has 13 (i.e. 11 missing)

mvernon@apus-be1006:~$ grep -c ' sd' /proc/partitions 
13

I've had a poke through the web iDRAC, and I think I've found the offending disk on apus-be1005; apus-be1006 looks OK now too.

Change #1272713 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] apus: add two new storage nodes in eqiad

https://gerrit.wikimedia.org/r/1272713

Change #1272713 merged by MVernon:

[operations/puppet@production] apus: add two new storage nodes in eqiad

https://gerrit.wikimedia.org/r/1272713

Change #1275260 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] apus: move eqiad controller moss-be1001 -> apus-be1005

https://gerrit.wikimedia.org/r/1275260

Change #1275260 merged by MVernon:

[operations/puppet@production] apus: move eqiad controller moss-be1001 -> apus-be1005

https://gerrit.wikimedia.org/r/1275260

Mentioned in SAL (#wikimedia-operations) [2026-04-20T09:43:00Z] <Emperor> ceph orch host drain moss-be1001 T418901

Mentioned in SAL (#wikimedia-operations) [2026-04-20T10:02:34Z] <Emperor> ceph orch host drain moss-be1002 T418901

Change #1275366 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: remove two old apus backends for decom

https://gerrit.wikimedia.org/r/1275366

Change #1275366 merged by MVernon:

[operations/puppet@production] hiera: remove two old apus backends for decom

https://gerrit.wikimedia.org/r/1275366

cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: moss-be[1001-1002].eqiad.wmnet

  • moss-be1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet server and PuppetDB
  • moss-be1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet server and PuppetDB