Page MenuHomePhabricator

Decommission ORES configurations and servers
Closed, ResolvedPublic3 Estimated Story Points

Description

Now that ores-legacy handles all the traffic for ores.wikimedia.org, we can start decomming it.

The starting point is https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Remove_from_production

High level idea:

  • Remove old icinga alerts etc.. Set LVS checks to not page if all nodes are down.
  • Remove ores nodes from pybal, plus its config (VIP, etc..). This requires coordination with Traffic since some pybal restarts are probably needed (to update LVS etc..).
  • Shutdown all ores nodes + downtime.
  • Clean up DNS records.
  • Clean up puppet from ores classes and configs (including POSIX groups etc..) (the ores-admin group was only ever applied to hosts which now get fully decommissioned, as such the group entry can be removed entirely and the GID added to the reclaim list in L71 of data.yaml)
  • Cleanup TLS certificates for ores.discovery.wmnet
  • Check in Horizon for old VMs etc.. We should probably shutdown everything, and close projects that we don't need (so we release capacity etc..).
  • Update the Wikitech/Mediawiki documentation
  • Delete old grafana dashboards
  • Archive ores-related repositories - tracked in T349632

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -1
labs/privatemaster+0 -3
operations/puppetproduction+0 -16
operations/puppetproduction+0 -11
operations/puppetproduction+0 -6
operations/puppetproduction+0 -5
operations/puppetproduction+1 -10
operations/puppetproduction+0 -13
operations/puppetproduction+0 -12
operations/puppetproduction+0 -14
operations/puppetproduction+0 -58
operations/puppetproduction+0 -3
operations/puppetproduction+0 -1 K
operations/deployment-chartsmaster+0 -2
operations/puppetproduction+2 -8
operations/puppetproduction+0 -48
operations/puppetproduction+31 -0
operations/puppetproduction+1 -0
operations/dnsmaster+0 -2
operations/puppetproduction+0 -5
operations/puppetproduction+0 -57
operations/puppetproduction+1 -7
operations/puppetproduction+1 -1
operations/dnsmaster+0 -2
operations/dnsmaster+0 -4
operations/puppetproduction+1 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 961802 merged by Klausman:

[operations/dns@master] wmnet: Remove ORES discovery entry

https://gerrit.wikimedia.org/r/961802

Change 961799 merged by Klausman:

[operations/puppet@production] services: Move ORES to state lvs_setup for turndown

https://gerrit.wikimedia.org/r/961799

Change 961805 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] services/lvs: Turn down ORES LVS setup

https://gerrit.wikimedia.org/r/961805

Change 961805 merged by Klausman:

[operations/puppet@production] services/lvs: Turn down ORES LVS setup

https://gerrit.wikimedia.org/r/961805

Mentioned in SAL (#wikimedia-operations) [2023-09-28T14:00:51Z] <klausman> restarted pybal on lvs1020 and lvs2014 (LVS low-traffic backups) for T347278 (ORES turndown)

Mentioned in SAL (#wikimedia-operations) [2023-09-28T14:13:21Z] <klausman> restarting pybal on lvs1019 and lvs2013 (LVS low-traffic actives) for T347278 (ORES turndown)

Change 961791 merged by Klausman:

[operations/puppet@production] Services: Remove pybal/LVS entry for ORES

https://gerrit.wikimedia.org/r/961791

Change 961825 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] ORES: remove profile::services_proxy::envoy::enabled_listeners role

https://gerrit.wikimedia.org/r/961825

Change 961825 merged by Klausman:

[operations/puppet@production] ORES: remove ORES from Envoy listeners list

https://gerrit.wikimedia.org/r/961825

Change 962642 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Remove ores.svc.{eqiad,codfw}.wmnet records

https://gerrit.wikimedia.org/r/962642

Change 962642 merged by Elukey:

[operations/dns@master] Remove ores.svc.{eqiad,codfw}.wmnet records

https://gerrit.wikimedia.org/r/962642

Change 963009 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] conftool-data: Add entry for recommendation-api-ng

https://gerrit.wikimedia.org/r/963009

Change 963009 merged by Klausman:

[operations/puppet@production] conftool-data: Add entry for recommendation-api-ng

https://gerrit.wikimedia.org/r/963009

Change 963013 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera/services: add service for recommendation-api-ng

https://gerrit.wikimedia.org/r/963013

Change 963013 abandoned by Klausman:

[operations/puppet@production] hiera/services: add service for recommendation-api-ng

Reason:

Not needed since this is a ml-k8s service

https://gerrit.wikimedia.org/r/963013

Icinga downtime and Alertmanager silence (ID=79579710-b671-4886-a85b-afefbf9b3afb) set by klausman@cumin1001 for 90 days, 0:00:00 on 22 host(s) and their services with reason: Downtime for graceful shutdown and later decom

ores[2001-2009].codfw.wmnet,ores[1001-1009].eqiad.wmnet,orespoolcounter[2003-2004].codfw.wmnet,orespoolcounter[1003-1004].eqiad.wmnet

The machines ores[2001-2009].codfw.wmnet,ores[1001-1009].eqiad.wmnet have been shut down (1001 and 2001 are still running in case we need files from them).

After discussion on IRC, I have also shutdown 1001 and 2001.

@klausman the DNS step is marked as done, but I see the ORES SVC records still existing in Netbox ( https://netbox.wikimedia.org/ipam/ip-addresses/?q=ores ) is that a leftover or pending some other step? (when removed a run of the sre.dns.netbox cookbook is needed)

@klausman the DNS step is marked as done, but I see the ORES SVC records still existing in Netbox ( https://netbox.wikimedia.org/ipam/ip-addresses/?q=ores ) is that a leftover or pending some other step? (when removed a run of the sre.dns.netbox cookbook is needed)

My bad! Removed the records and ran the cookbook :)

Change 965124 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::redis::misc::{master,slave}: remove ORES configs

https://gerrit.wikimedia.org/r/965124

Mentioned in SAL (#wikimedia-releng) [2023-10-12T10:38:46Z] <elukey> delete ores proxy and instance in deployment-prep - T347278

Change 965124 merged by Effie Mouzeli:

[operations/puppet@production] role::redis::misc::{master,slave}: remove ORES configs

https://gerrit.wikimedia.org/r/965124

jijiki subscribed.

Data has been flushed from both rdb1011 and rdb2009

Change 967447 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Remove ores listener from mediawiki fixtures

https://gerrit.wikimedia.org/r/967447

Change 967447 merged by jenkins-bot:

[operations/deployment-charts@master] Remove ores listener from mediawiki fixtures

https://gerrit.wikimedia.org/r/967447

elukey@puppetmaster1001:~$ sudo puppet cert clean ores.discovery.wmnet
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
Notice: Revoked certificate with serial 6792
Notice: Removing file Puppet::SSL::Certificate ores.discovery.wmnet at '/var/lib/puppet/server/ssl/ca/signed/ores.discovery.wmnet.pem'
Notice: Removing file Puppet::SSL::Certificate ores.discovery.wmnet at '/var/lib/puppet/server/ssl/certs/ores.discovery.wmnet.pem'
calbon set the point value for this task to 3.Nov 2 2023, 7:13 PM
calbon moved this task from In Progress to Ready To Go on the Machine-Learning-Team board.
achou triaged this task as Medium priority.Nov 2 2023, 7:28 PM

Change 975213 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Remove ORES roles and configs

https://gerrit.wikimedia.org/r/975213

Change 975214 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::logstash: remove ORES configs

https://gerrit.wikimedia.org/r/975214

Change 975215 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Remove ORES deployment settings

https://gerrit.wikimedia.org/r/975215

Change 975216 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Remove ORES configs and clusters

https://gerrit.wikimedia.org/r/975216

Change 975217 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::prometheus::ops: remove ORES Redis configs

https://gerrit.wikimedia.org/r/975217

Change 975218 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cloud: Remove ores-beta ATS settings

https://gerrit.wikimedia.org/r/975218

Change 975219 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] admin: remove ores-admins group

https://gerrit.wikimedia.org/r/975219

Change 975220 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] contactgroups: remove old team-scoring

https://gerrit.wikimedia.org/r/975220

Change 975213 merged by Elukey:

[operations/puppet@production] Remove ORES roles and configs

https://gerrit.wikimedia.org/r/975213

Updated Wikitech and Mediawiki documentation pages about ORES with a deprecation banner.

Change 975267 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::httpbb: remove ores_test configs

https://gerrit.wikimedia.org/r/975267

Change 975267 merged by Elukey:

[operations/puppet@production] profile::httpbb: remove ores_test configs

https://gerrit.wikimedia.org/r/975267

Change 975214 merged by Elukey:

[operations/puppet@production] profile::logstash: remove ORES configs

https://gerrit.wikimedia.org/r/975214

Change 975215 merged by Elukey:

[operations/puppet@production] Remove ORES deployment settings

https://gerrit.wikimedia.org/r/975215

Change 975216 merged by Elukey:

[operations/puppet@production] Remove ORES configs and clusters

https://gerrit.wikimedia.org/r/975216

Change 975217 merged by Elukey:

[operations/puppet@production] profile::prometheus::ops: remove ORES Redis configs

https://gerrit.wikimedia.org/r/975217

Change 975219 merged by Elukey:

[operations/puppet@production] admin: remove ores-admins group

https://gerrit.wikimedia.org/r/975219

Change 975220 merged by Elukey:

[operations/puppet@production] contactgroups: remove old team-scoring

https://gerrit.wikimedia.org/r/975220

Change 975218 merged by Elukey:

[operations/puppet@production] cloud: Remove ores-beta ATS settings

https://gerrit.wikimedia.org/r/975218

Change 975285 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Clean up ores configs not used anymore

https://gerrit.wikimedia.org/r/975285

@klausman everything should be done, except the work in T349632, lemme know if anything is missing, otherwise this is done.

Change 975285 merged by Elukey:

[operations/puppet@production] Clean up ores configs not used anymore

https://gerrit.wikimedia.org/r/975285

@klausman everything should be done, except the work in T349632, lemme know if anything is missing, otherwise this is done.

Thank you! I'll do another sweep of the assorted SRE repos (puppet, dns, private, ...) and close the ticket when done

Change 975780 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] Clean up additional ORES leftovers

https://gerrit.wikimedia.org/r/975780

Change 975780 merged by Klausman:

[operations/puppet@production] Clean up additional ORES leftovers

https://gerrit.wikimedia.org/r/975780

Deleted two Redis passwords in the private puppet repo.

@klausman everything should be done, except the work in T349632, lemme know if anything is missing, otherwise this is done.

Thank you! I'll do another sweep of the assorted SRE repos (puppet, dns, private, ...) and close the ticket when done

Ping :)

These are the remaining hits in the puppet repo.

We need to keep these:

hieradata/common/profile/kubernetes/deployment_server.yaml
336:    ores-legacy:
338:        - name: ores-legacy
341:        - name: ores-legacy-deploy

hieradata/common/profile/trafficserver/backend.yaml
135:      target: http://ores.wikimedia.org
136:      replacement: https://ores-legacy.discovery.wmnet:31443
138:      target: http://ores-legacy.wikimedia.org
139:      replacement: https://ores-legacy.discovery.wmnet:31443

hieradata/role/common/cache/text.yaml
94:  ores.wikimedia.org:
96:  ores-legacy.wikimedia.org:

modules/profile/files/sre/buster.yaml
89:role::ores:

The last one is mostly for historical accuracy/tracking.

Remainder:

modules/profile/files/toolforge/legacy_redirector.lua
365:        'order-user-by-reg', 'ordia', 'orejasbot', 'ores', 'ores-afc',
366:        'ores-demos', 'ores-support-checklist', 'orphan-groups', 'orphantalk'

These probably can go.

modules/profile/manifests/mediawiki/maintenance/growthexperiments.pp
16:        command  => '/usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/listTaskCounts.php --topictype ores --statsd --output none',

I don't know if this somehow ties into the ORES extension.

modules/profile/templates/wmcs/backy2/wmcs_backup_instances.yaml.erb
77:  ores: cloudbackup1003

This probably can go.

modules/service/manifests/uwsgi.pp
22:#   Note: this parameter will be removed onces ores.wmflabs.org stops using service::uwsgi

This one I would keep, as it breaks nothing and the parameter in question is likely a bigger patch to remove.

From the private repo (and thus also on the pm):

hieradata/role/common/deployment_server/kubernetes.yaml
102:    ores-legacy:

Need to keep this.

hieradata/role/common/ores.yaml
1:profile::ores::web::redis_password: apassword

hieradata/role/common/ores/redis.yaml
1:profile::ores::redis::password: dummypass

hieradata/role/common/scb.yaml
2:profile::ores::web::redis_password: apassword

These likely can go, as Redis is no longer used.

I will provide patches for puppet and private repo and link them to this bug.

Change 979915 had a related patch set uploaded (by Klausman; author: Klausman):

[labs/private@master] hiera: clean up more ORES leftovers

https://gerrit.wikimedia.org/r/979915

Change 979916 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] profiles: Remove more ORES leftovers

https://gerrit.wikimedia.org/r/979916

Change 979915 merged by Klausman:

[labs/private@master] hiera: clean up more ORES leftovers

https://gerrit.wikimedia.org/r/979915

Change 979916 merged by Klausman:

[operations/puppet@production] profiles: Remove more ORES leftovers

https://gerrit.wikimedia.org/r/979916

We're all done here. The to-be-archived repos we'll handle in the separate ticket.