⚓ T302277 Decommission old AQS cluster nodes

	Subject	Repo	Branch	Lines +/-
	Remove absented resource definitions for aqs nodes	operations/puppet	production	+1 -43
	Remove legacy AQS host configuration	operations/puppet	production	+3 -170

Status	Assigned	Task
Resolved	BTullis	T249755 Cassandra3 migration for Analytics AQS
Resolved	BTullis	T302276 Stop ingesting data to the old AQS cluster
Resolved	Jclark-ctr	T302277 Decommission old AQS cluster nodes

BTullis created this task.Feb 22 2022, 12:24 PM

BTullis added a parent task: T302276: Stop ingesting data to the old AQS cluster.

• EChetty moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.Feb 24 2022, 5:15 PM

• EChetty moved this task from Ops Week to Serve on the Data-Engineering board.Feb 24 2022, 5:18 PM

Removing this from kanban whilst we work on the dependent tickets - namely migrating cassandra loading from oozie to airflow.

hnowlan removed a project: Platform Team Workboards (Platform Engineering Reliability).Aug 25 2022, 4:32 PM

Moving to planning, so that it can be discussed and added to the next sprint.

BTullis moved this task from Backlog to Shared Data Infra on the Data-Engineering-Planning board.Sep 27 2022, 10:20 AM

BTullis triaged this task as Medium priority.Sep 27 2022, 11:01 AM

• EChetty moved this task from Shared Data Infra to Ops Week on the Data-Engineering-Planning board.Sep 27 2022, 12:52 PM

• EChetty moved this task from Ops Week to Shared Data Infra on the Data-Engineering-Planning board.

• EChetty moved this task from Backlog to To be discussed on the Shared-Data-Infrastructure board.Sep 27 2022, 1:40 PM

• EChetty moved this task from To be discussed to Sprint 02 on the Shared-Data-Infrastructure board.

• EChetty edited projects, added Shared-Data-Infrastructure (Sprint 02); removed Shared-Data-Infrastructure.

• EChetty moved this task from Sprint 02 to Estimated/Discussed on the Shared-Data-Infrastructure board.

• EChetty edited projects, added Shared-Data-Infrastructure; removed Shared-Data-Infrastructure (Sprint 02).

• EChetty set the point value for this task to 1.Sep 27 2022, 1:57 PM

• EChetty moved this task from Estimated/Discussed to Sprint 02 on the Shared-Data-Infrastructure board.

• EChetty edited projects, added Shared-Data-Infrastructure (Sprint 02); removed Shared-Data-Infrastructure.

BTullis moved this task from Next Up to In Progress on the Shared-Data-Infrastructure (Sprint 02) board.Sep 28 2022, 9:25 AM

I have started the decommissioning now with aqs1004

cookbooks.sre.hosts.decommission executed by btullis@cumin1001 for hosts: aqs1004.eqiad.wmnet

aqs1004.eqiad.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by btullis@cumin1001 for hosts: aqs1005.eqiad.wmnet

aqs1005.eqiad.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by btullis@cumin1001 for hosts: aqs1006.eqiad.wmnet

aqs1006.eqiad.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by btullis@cumin1001 for hosts: aqs1007.eqiad.wmnet

aqs1007.eqiad.wmnet (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Failed to wipe swraid, partition-table and filesystem signatures, manual intervention required to make it unbootable: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by btullis@cumin1001 for hosts: aqs1008.eqiad.wmnet

aqs1008.eqiad.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by btullis@cumin1001 for hosts: aqs1009.eqiad.wmnet

aqs1009.eqiad.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Change 839605 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove legacy AQS host configuration

https://gerrit.wikimedia.org/r/839605

gerritbot added a project: Patch-For-Review.Oct 6 2022, 3:31 PM

All hosts decommissioned. I have submitted a patch to remove the role and various other small items from puppet.
https://gerrit.wikimedia.org/r/c/operations/puppet/+/839605

BTullis updated the task description. (Show Details)Oct 7 2022, 9:45 AM

Change 839605 merged by Btullis:

[operations/puppet@production] Remove legacy AQS host configuration

https://gerrit.wikimedia.org/r/839605

BTullis updated the task description. (Show Details)Oct 7 2022, 10:18 AM

Maintenance_bot removed a project: Patch-For-Review.Oct 7 2022, 10:31 AM

Change 840096 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove absented resource definitions for aqs nodes

https://gerrit.wikimedia.org/r/840096

gerritbot added a project: Patch-For-Review.Oct 7 2022, 10:36 AM

Change 840096 merged by Btullis:

[operations/puppet@production] Remove absented resource definitions for aqs nodes

https://gerrit.wikimedia.org/r/840096

BTullis reassigned this task from BTullis to • Cmjohnson.Oct 7 2022, 1:55 PM

BTullis added a project: ops-eqiad.

BTullis updated the task description. (Show Details)

I believe that the service owner part of this task is all done, so I'm tagging ops-eqiad and assigning to @Cmjohnson as per the guidelines.

Please note that the wipefs step in the cookbook failed for aqs1007 so it may still be bootable. Please do let me know if there's anything else I can do to help.

Maintenance_bot added a project: SRE.Oct 7 2022, 2:29 PM

Maintenance_bot removed a project: Patch-For-Review.

BTullis mentioned this in T313936: aqs1004 low disk space warning.Oct 11 2022, 2:59 PM

• EChetty removed projects: Shared-Data-Infrastructure (Sprint 02), Data-Engineering-Planning.Oct 18 2022, 1:33 PM

completed steps for decom process

Decommission old AQS cluster nodes
Closed, ResolvedPublic1 Estimated Story Points
Actions

Description

aqs1004

aqs1005

aqs1006

aqs1007

aqs1008

aqs1009

Details

Related Objects
Search...

Event Timeline

Decommission old AQS cluster nodesClosed, ResolvedPublic1 Estimated Story PointsActions