Page MenuHomePhabricator

Q4 Thanos hardware refresh
Closed, ResolvedPublic

Description

We are refreshing almost the entire thanos estate this quarter.

  • Frontend refresh
    • codfw
      • Bring thanos-fe200[5-7] into service T389634
      • Decommission thanos-fe200[1-3] (handed over to DC-ops via T393870)
    • eqiad
      • Bring thanos-fe100[5-7] into service T389501 T389635
      • Decommission thanos-fe100[1-3] (handed over to DC-ops via T394894)
  • Backend refresh
    • codfw
      • Bring thanos-be200[6-9] into service T389836 T392908 (being loaded into the rings)
      • Decommission thanos-be200[1-4] (handed over to DC-ops via T398849)
    • eqiad
      • Bring thanos-be100[6-9] into service T389837 T392909
      • Decommission thanos-be100[1-4] (handed over to DC-ops via T397414)

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2025-04-24T09:27:18Z] <Emperor> depool thanos-fe200[1-3] pending decommissioning T391352

RobH mentioned this in Unknown Object (Task).Apr 24 2025, 6:05 PM
RobH mentioned this in Unknown Object (Task).

Change #1143824 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: remove thanos-fe200[1-3]

https://gerrit.wikimedia.org/r/1143824

Change #1143824 merged by MVernon:

[operations/puppet@production] thanos: remove thanos-fe200[1-3]

https://gerrit.wikimedia.org/r/1143824

cookbooks.sre.hosts.decommission executed by mvernon@cumin1002 for hosts: thanos-fe[2001-2003].codfw.wmnet

  • thanos-fe2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • thanos-fe2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • thanos-fe2003.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change #1146511 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] Thanos: add new thanos-fe100[5-7] nodes

https://gerrit.wikimedia.org/r/1146511

Change #1146511 merged by MVernon:

[operations/puppet@production] Thanos: add new thanos-fe100[5-7] nodes

https://gerrit.wikimedia.org/r/1146511

Mentioned in SAL (#wikimedia-operations) [2025-05-15T09:07:55Z] <Emperor> reboot thanos-fe100[5-7] prior to bringing into service T391352

Mentioned in SAL (#wikimedia-operations) [2025-05-15T10:08:26Z] <Emperor> depool thanos-fe100[1-3] prior to decom T391352

Change #1148330 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: remove old frontends thanos-fe100[1-3]

https://gerrit.wikimedia.org/r/1148330

Mentioned in SAL (#wikimedia-operations) [2025-05-21T08:29:30Z] <Emperor> disable puppet on thanos-fe1001 and thanos-fe1004 T391352

Change #1148330 merged by MVernon:

[operations/puppet@production] thanos: remove old frontends thanos-fe100[1-3]

https://gerrit.wikimedia.org/r/1148330

cookbooks.sre.hosts.decommission executed by mvernon@cumin1002 for hosts: thanos-fe[1001-1003].eqiad.wmnet

  • thanos-fe1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • thanos-fe1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • thanos-fe1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change #1151159 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: add new backends to hiera

https://gerrit.wikimedia.org/r/1151159

Change #1151160 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: add new backends, drain old ones

https://gerrit.wikimedia.org/r/1151160

Change #1151159 merged by MVernon:

[operations/puppet@production] thanos: add new backends to hiera

https://gerrit.wikimedia.org/r/1151159

Mentioned in SAL (#wikimedia-operations) [2025-05-27T11:52:45Z] <Emperor> reboot thanos-be100[6-9] before bringing into the rings T391352

Change #1151160 merged by MVernon:

[operations/puppet@production] thanos: add new backends, drain old ones

https://gerrit.wikimedia.org/r/1151160

Change #1160824 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: remove drained thanos-be100[1-4] from rings

https://gerrit.wikimedia.org/r/1160824

Change #1160824 merged by MVernon:

[operations/puppet@production] thanos: remove drained thanos-be100[1-4] from rings

https://gerrit.wikimedia.org/r/1160824

Change #1160855 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: add new backends, remove old ones gone from rings

https://gerrit.wikimedia.org/r/1160855

Change #1160856 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: add new nodes to ring, drain old ones

https://gerrit.wikimedia.org/r/1160856

Change #1160855 merged by MVernon:

[operations/puppet@production] thanos: add new backends, remove old ones gone from rings

https://gerrit.wikimedia.org/r/1160855

Change #1160856 merged by MVernon:

[operations/puppet@production] thanos: add new nodes to ring, drain old ones

https://gerrit.wikimedia.org/r/1160856

cookbooks.sre.hosts.decommission executed by mvernon@cumin1003 for hosts: thanos-be[1001-1004].eqiad.wmnet

  • thanos-be1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • thanos-be1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • thanos-be1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • thanos-be1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change #1166822 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: remove now-drained thanos-be200[1-4] from rings

https://gerrit.wikimedia.org/r/1166822

Change #1166823 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: remove thanos-be200[1-4] from thanos::swift::backends

https://gerrit.wikimedia.org/r/1166823

Change #1166822 merged by MVernon:

[operations/puppet@production] thanos: remove now-drained thanos-be200[1-4] from rings

https://gerrit.wikimedia.org/r/1166822

Change #1166823 merged by MVernon:

[operations/puppet@production] hiera: remove thanos-be200[1-4] from thanos::swift::backends

https://gerrit.wikimedia.org/r/1166823

cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: thanos-be[2001-2004].codfw.wmnet

  • thanos-be2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • thanos-be2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • thanos-be2003.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • thanos-be2004.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB