Page MenuHomePhabricator

Q4 object storage hardware tasks
Closed, ResolvedPublic

Description

There's quite a lot of hardware work being done in FY2024-25/Q4, this task is to keep track of it all

Thanos refresh

Apus expansion

  • codfw frontend apus-fe2003 T390578
  • codfw backend apus-be2004 T388242 T392845 (installed)
  • eqiad frontend apus-fe1003 T388239 T389632 (installed)
  • eqiad backend apus-be1004 T388241 T392844 (installed)

MS frontend expansion

  • codfw frontend ms-fe201[56] T388887
  • eqiad frontend ms-fe101[56] T385040 T388886 (ready to be pooled)

MS hardware refresh (pulled forward from Q1 of next FY due to T392796 (failure of ms-be1060)

  • decommission ms-be1060 - handed off to DC-Ops in T393609
  • bring ms-be109[2-5] into service T393046 T393104 (being loaded into the rings)
  • decommission ms-be106[1-3] - handed off to DC-Ops in T401368

Event Timeline

Change #1140752 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: add ms-fe101[5,6] as new proxy nodes

https://gerrit.wikimedia.org/r/1140752

Change #1140752 merged by MVernon:

[operations/puppet@production] swift: add ms-fe101[5,6] as new proxy nodes

https://gerrit.wikimedia.org/r/1140752

Mentioned in SAL (#wikimedia-operations) [2025-05-07T15:04:01Z] <Emperor> pool ms-fe1015 ms-fe1016 new frontends T388886 T391354

Change #1143118 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: remove ms-be1060 from swift storagehosts

https://gerrit.wikimedia.org/r/1143118

Change #1143118 merged by MVernon:

[operations/puppet@production] hiera: remove ms-be1060 from swift storagehosts

https://gerrit.wikimedia.org/r/1143118

Change #1143821 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] apus: bring new frontend apus-fe1003 into service

https://gerrit.wikimedia.org/r/1143821

Change #1143821 merged by MVernon:

[operations/puppet@production] apus: bring new frontend apus-fe1003 into service

https://gerrit.wikimedia.org/r/1143821

Change #1148280 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] apus: add apus-be1004 to eqiad cluster as osd server

https://gerrit.wikimedia.org/r/1148280

Change #1148280 merged by MVernon:

[operations/puppet@production] apus: add apus-be1004 to eqiad cluster as osd server

https://gerrit.wikimedia.org/r/1148280

Change #1148296 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] cephadm: handle storage servers with BOSS card

https://gerrit.wikimedia.org/r/1148296

Change #1148296 merged by MVernon:

[operations/puppet@production] cephadm: handle storage servers with BOSS card

https://gerrit.wikimedia.org/r/1148296

Change #1151166 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: add apus-be2004 to codfw apus cluster

https://gerrit.wikimedia.org/r/1151166

Change #1151166 merged by MVernon:

[operations/puppet@production] hiera: add apus-be2004 to codfw apus cluster

https://gerrit.wikimedia.org/r/1151166

Mentioned in SAL (#wikimedia-operations) [2025-05-27T12:00:55Z] <Emperor> ceph orch apply to bring apus-be2004 into service T391354

Change #1153648 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] ms-be eqiad: add 2 new backends, drain 2 old ones

https://gerrit.wikimedia.org/r/1153648

Change #1153648 merged by MVernon:

[operations/puppet@production] ms-be eqiad: add 2 new backends, drain 2 old ones

https://gerrit.wikimedia.org/r/1153648

Change #1165851 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: add ms-be109[2-5] to swift::storagehosts

https://gerrit.wikimedia.org/r/1165851

Change #1165852 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift/eqiad: add ms-be109[2,3], drain ms-be1063

https://gerrit.wikimedia.org/r/1165852

Change #1165851 merged by MVernon:

[operations/puppet@production] hiera: add ms-be109[2-5] to swift::storagehosts

https://gerrit.wikimedia.org/r/1165851

Change #1165852 merged by MVernon:

[operations/puppet@production] swift/eqiad: add ms-be109[2,3], drain ms-be1063

https://gerrit.wikimedia.org/r/1165852

Change #1176253 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: remove old nodes, drain & reweight SM C-J nodes

https://gerrit.wikimedia.org/r/1176253

Change #1176253 merged by MVernon:

[operations/puppet@production] swift: remove old nodes, drain & reweight SM C-J nodes

https://gerrit.wikimedia.org/r/1176253

Change #1176258 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: remove ms-be106[1-3]

https://gerrit.wikimedia.org/r/1176258

Change #1176258 merged by MVernon:

[operations/puppet@production] swift: remove ms-be106[1-3]

https://gerrit.wikimedia.org/r/1176258

MatthewVernon updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by mvernon@cumin1003 for hosts: ms-be[1061-1063].eqiad.wmnet

  • ms-be1061.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • ms-be1062.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • ms-be1063.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB