Page MenuHomePhabricator

Replace existing aphlict1001 with puppet-managed bullseye host
Closed, ResolvedPublic

Description

The current aphlict host (aphlict1001.eqiad.wmnet) is running buster, and is in large part running on hand-provisioned configs. A new host has been set up (aphlict1002), which is a bullseye host and has all configs and services managed by puppet. This ticket is tracking the work to move this into production and decommission aphlict1001.

The plan is to test the new host during a maintenance window. If the new host works correctly, we will keep it on as the production host and shut down the existing one

  • Set up aphlict1002, running bullseye
  • Schedule maintenance window for phabricator: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230424T1900
  • During the maintenance window:
    • Remove phabricator::aphlict::ensure: absent from puppet/hierdata/hosts/aphlict1002.yaml
    • Update aphlict.discovery.wmnet to point to the new host in dns/templates/wmnet
    • sudo run-puppet-agent on aphlict1002.eqiad.wmnet
    • Test that notifications in phabricator work correctly (move tickets in workboard, add comments to see popups, etc), check logs to see traffic hitting the new host
    • If needed, revert the changes above to return to normal
  • If keeping the new host, turn aphlict1001.eqiad.wmnet off for ~2 weeks to allow time for recovery if needed, then decommission the host

Once this is done and we've verified that the puppet-managed configs are sound, we can reimage the aphlict2001.codfw.wmnet host to ensure the remaining hand-rolled configs are correctly managed by puppet, and close the parent task

Event Timeline

Change 911352 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] Move aphlict.discovery.wmnet over to aphlict1002

https://gerrit.wikimedia.org/r/911352

Change 911352 merged by EoghanGaffney:

[operations/dns@master] Move aphlict.discovery.wmnet over to aphlict1002

https://gerrit.wikimedia.org/r/911352

Change 911357 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Add aphlict service to new vm

https://gerrit.wikimedia.org/r/911357

Change 911357 merged by EoghanGaffney:

[operations/puppet@production] Add aphlict service to new vm

https://gerrit.wikimedia.org/r/911357

Icinga downtime and Alertmanager silence (ID=03511766-314f-4a97-9387-a1bbedb36faa) set by eoghan@cumin1001 for 5 days, 0:00:00 on 1 host(s) and their services with reason: aphlict1002 is now active for testing

aphlict1001.eqiad.wmnet

@brennen and I worked on this today. The new aphlict1002 host is currently active (after this DNS change). The service is stopped on the old host (aphlict1001), but not masked or removed. This appears to be functioning correctly, but in case there is a need to revert this change:

  1. Restart the aphlict service: ssh aphlict1001.eqiad.wmnet, then sudo systemctl start aphlict
  2. Revert https://gerrit.wikimedia.org/r/911352, then merge that
  3. sudo authdns-update on dns1001.eqiad.wmnet
  4. Stop aphlict on aphlict1002: ssh aphlict1002.eqiad.wmnet, then sudo systemctl stop aphlict

I intend to leave this in place as is until Friday. On Friday, I will shutdown the aphlict1001 host, and leave it off for one week. If, after 1 week there are no issues raised, we will decommission it.

I'm not sure but it's possible there was some minor fallout here, manifesting as one task being in 2 columns on workboards. T335422

Maybe if people were moving a task right during the maintenance window. But also seems fixed now by moving it again from one column to another.

Icinga downtime and Alertmanager silence (ID=569558b7-ab79-4627-a68d-ab4f6ec49f24) set by eoghan@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: aphlict1002 is now active

aphlict1001.eqiad.wmnet

cookbooks.sre.hosts.decommission executed by eoghan@cumin1001 for hosts: aphlict1001.eqiad.wmnet

  • aphlict1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Change 917873 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] [aphlict] Remove aphlict1001 CNAME

https://gerrit.wikimedia.org/r/917873

Change 917873 merged by EoghanGaffney:

[operations/dns@master] [aphlict] Remove aphlict1001 CNAME

https://gerrit.wikimedia.org/r/917873

aphlict1001 was decommissioned and references in dns and wikitech were removed.