Page MenuHomePhabricator

Decommission AQS 1.0
Closed, ResolvedPublic

Description

Description

Now that AQS 2.0 has run without major incident for >2 months Data Products
would be comfortable that AQS 1.0 be fully decommissioned.

Removal of this service is important as it will have significant impacts on Data-Persistence
team as it creates overhead in maintaining and will waste resources for no purpose.

There should be no traffic to AQS 1.0: https://grafana.wikimedia.org/d/2urzFwgIk/hnowlan-aqs-usage?orgId=1&from=1706540164273&to=1709132164274&viewPanel=2

Request

For Data SRE to evaluate what is required to remove AQS 1.0 from production and schedule removal.

Event Timeline

WDoranWMF created this task.

@WDoranWMF, just to make sure I understand: this is about decommissioning the AQS servers (/^aqs10(1[0-9]|2[0-1])\.eqiad\./ and /^aqs200[1-9]|aqs201[0-2]\.codfw\./), including some cleanup of puppet code, with the service already not in use, but still running. There is also Cassandra running on those hosts, but that can also be trashed, with no need to preserve any of the data?

@Gehel Yes, though I think Data-Persistence can help with the Cassandra part if needed. Is that fair, @KOfori ?

This would be great; Long term, decoupling the service from the storage cluster is going to simply maintenance a great deal. In the nearer term, this will unblock upgrading to Debian Bookworm (the transition to Bullseye was already problematic).

@WDoranWMF, just to make sure I understand: this is about decommissioning the AQS servers (/^aqs10(1[0-9]|2[0-1])\.eqiad\./ and /^aqs200[1-9]|aqs201[0-2]\.codfw\./), including some cleanup of puppet code, with the service already not in use, but still running. There is also Cassandra running on those hosts, but that can also be trashed, with no need to preserve any of the data?

Nooooo, we need to leave the cluster in place, and de-deploy (un-deploy?) the AQS service only.

Please, no one decommission the servers/Cassandra :)

Change 1013063 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] AQS1.0: disable aqs service

https://gerrit.wikimedia.org/r/1013063

Change 1013063 merged by Brouberol:

[operations/puppet@production] AQS1.0: disable aqs service

https://gerrit.wikimedia.org/r/1013063

Change #1013325 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: set aqs to non-paging

https://gerrit.wikimedia.org/r/1013325

Change #1013325 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: set aqs to non-paging

https://gerrit.wikimedia.org/r/1013325

Change #1013500 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/dns@master] Decommission aqs records

https://gerrit.wikimedia.org/r/1013500

Change #1013501 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Decommission aqs realserver pool

https://gerrit.wikimedia.org/r/1013501

Change #1013500 merged by Brouberol:

[operations/dns@master] Decommission aqs records

https://gerrit.wikimedia.org/r/1013500

Change #1014023 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] aqs: remove the relserver pool from host

https://gerrit.wikimedia.org/r/1014023

Change #1013501 merged by Brouberol:

[operations/puppet@production] Set state of aqs service to lvs_setup

https://gerrit.wikimedia.org/r/1013501

Change #1014023 merged by Brouberol:

[operations/puppet@production] aqs: remove the relserver pool from host

https://gerrit.wikimedia.org/r/1014023

Mentioned in SAL (#wikimedia-operations) [2024-03-25T15:00:31Z] <brouberol> restarting pybal on lvs2014.codfw.wmnet - T358793

Mentioned in SAL (#wikimedia-operations) [2024-03-25T15:13:06Z] <brouberol> restarting pybal on lvs2013.codfw.wmnet - T358793

Mentioned in SAL (#wikimedia-operations) [2024-03-25T16:05:54Z] <brouberol> restarting pybal on lvs1020.eqiad.wmnet - T358793

Mentioned in SAL (#wikimedia-operations) [2024-03-25T16:12:27Z] <brouberol> restarting pybal on lvs1019.eqiad.wmnet - T358793

Change #1014042 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] aqs: Remove conftool data and service entry

https://gerrit.wikimedia.org/r/1014042

Change #1014100 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] aqs: fix puppet compilation error

https://gerrit.wikimedia.org/r/1014100

Change #1014042 merged by Brouberol:

[operations/puppet@production] aqs: Remove conftool data and service entry

https://gerrit.wikimedia.org/r/1014042

Change #1014100 merged by Brouberol:

[operations/puppet@production] aqs: fix puppet compilation error

https://gerrit.wikimedia.org/r/1014100

I've removed the VIP from all AQS hosts. All there is to do now is to remove the VIP from netbox.

Mentioned in SAL (#wikimedia-operations) [2024-03-25T19:28:32Z] <brouberol> removing VIP from AQS hosts - T358793

Mentioned in SAL (#wikimedia-operations) [2024-03-26T08:18:45Z] <brouberol> deleting AQS codfw VIP (10.2.1.12/32) from Netbox - T358793

Mentioned in SAL (#wikimedia-operations) [2024-03-26T08:19:32Z] <brouberol> deleting AQS eqiad VIP (10.2.2.12/32) from Netbox - T358793

brouberol claimed this task.

Change #1014441 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] aqs: remove conftool data and envoy listener

https://gerrit.wikimedia.org/r/1014441

Change #1014441 merged by Brouberol:

[operations/puppet@production] aqs: remove conftool data and envoy listener

https://gerrit.wikimedia.org/r/1014441

Amazing, thank you @brouberol! Do you want to have the joy of resolving the task?

My pleasure :) I've already set it to Resolved 👍