Page MenuHomePhabricator

Migrate poolcounter hosts to bookworm
Closed, ResolvedPublic

Description

poolcounter hosts are still buster, this is a tracking task for upgrading to bullseye

Notes:

  • Poolcounter is now upstreamed in bullseye, it's no longer needed to package it.
  • poolcounter-prometheus-exporter will need to be packaged for bullseye
  • MediaWiki does the right thing when a poolcounter is unresponsive now, but it's best to deploy a mediawiki-config change (see 0e9520b5d as an example)
  • Important: The upgrades must happen 1 by 1, otherwise we risk a really big outage.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptMar 14 2023, 2:23 PM
Dzahn subscribed.

T370458 is asking for help getting the same thing done in beta. Both will need the poolcounter-prometheus-exporter package for bullseye/bookworm.

poolcounter-prometheus-exporter will need to be packaged for bullseye

No need for that. It's a golang, I 've copied it to bookworm-wikimedia. poolcounter itself is also on bookworm already.

Cool, i'll pass on the good news!

Hey folks, as far as I can get both poolcounter (debian upstream) and poolcounter-prometheus-exporter (bookworm-wikimedia) are already good to go, so we could attempt a reimage of one of the nodes (namely I can try)?

Procedure (away from deployments):

  • depool one node from mediawiki (requires a deployment)
  • reimage the node + verify that it works correctly afterwards.
  • repool the node from mediawiki

Rinse and repeat for all the nodes, waiting a couple of days each time to catch weirdness/corner-cases/etc..

How does it sound?

MoritzMuehlenhoff renamed this task from Migrate poolcounter hosts to bullseye to Migrate poolcounter hosts to bookworm.Sep 9 2024, 9:44 AM

Better procedure after chatting with Moritz:

  • Create a new VM on Bookworm, test that everything is ok etc..
  • Swap a "live-buster" poolcounter with the new VM (via MW deployment).
  • Let it bake for a bit, and then delete the old VM.

Once we are sure that everything is really good, we could proceed with T332015#10129086 for the other nodes (easier but safer given the above).

Sounds good.
From what I can see, poolcounter2004.codfw.wmnet and poolcounter1005.eqiad.wmnet are the least used, depending on whether you plan on doing the update before or after T370962: Southward Datacenter Switchover (September 2024)
Thanks for taking care of this <3

Change #1072179 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Install poolcounter2005 with Puppet 7

https://gerrit.wikimedia.org/r/1072179

Change #1072179 merged by Elukey:

[operations/puppet@production] Install poolcounter2005 with Puppet 7

https://gerrit.wikimedia.org/r/1072179

The poolcounter2005 host is up with Bookworm, as far as I can see it seems working fine.

If serviceops could confirm that the host is working, we can easily swap 2003 with 2005 in mediawiki config (they are in the same Ganeti row, A).

Change #1072206 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/mediawiki-config@master] Swap poolcounter2003 with poolcounter2005

https://gerrit.wikimedia.org/r/1072206

Change #1072501 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: switch thumbor in codfw to poolcounter2005

https://gerrit.wikimedia.org/r/1072501

Change #1072501 merged by Elukey:

[operations/deployment-charts@master] services: switch thumbor in codfw to poolcounter2005

https://gerrit.wikimedia.org/r/1072501

Mentioned in SAL (#wikimedia-operations) [2024-09-12T12:41:26Z] <elukey> thumbor codfw on wikikube moved to poolcounter2005 - T332015

Moved thumbor codfw to poolcounter2005, everything worked nicely.

At this point I think that we can:

  • Create the missing 3 VMs (poolcounter2006, poolcounter1005 and poolcounter1006).
  • Move thumbor to the new VMs in eqiad.
  • Possibly move mw-debug's config to the new hosts.
  • Proceed with https://gerrit.wikimedia.org/r/1072206

Change #1072716 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update thumbor-eqiad to poolcounter1006

https://gerrit.wikimedia.org/r/1072716

Change #1072717 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: add new poolcounter nodes to MW configs

https://gerrit.wikimedia.org/r/1072717

All new VMs created!

Next steps:

  • Move thumbor-eqiad to poolcounter1006
  • Update the MediaWiki's k8s network policy to allow the new nodes.
  • Test the new nodes via mw-debug
  • Rollout the change (via mediawiki-config) everywhere.

Change #1072716 merged by Elukey:

[operations/deployment-charts@master] services: update thumbor-eqiad to poolcounter1006

https://gerrit.wikimedia.org/r/1072716

Change #1072717 merged by Elukey:

[operations/deployment-charts@master] services: add new poolcounter nodes to MW configs

https://gerrit.wikimedia.org/r/1072717

Change #1073164 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: remove old poolcounter netpolicies for Thumbor

https://gerrit.wikimedia.org/r/1073164

Thumbor has been migrated to the new poolcounter VMs, and the MW network policies support the new VM's IPs.

Next steps:

Change #1072206 merged by Elukey:

[operations/mediawiki-config@master] Swap poolcounter2003 with poolcounter2005

https://gerrit.wikimedia.org/r/1072206

Mentioned in SAL (#wikimedia-operations) [2024-09-17T10:38:27Z] <elukey@deploy1003> Started scap sync-world: Backport for [[gerrit:1072206|Swap poolcounter2003 with poolcounter2005 (T332015)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-17T10:45:06Z] <elukey@deploy1003> elukey: Backport for [[gerrit:1072206|Swap poolcounter2003 with poolcounter2005 (T332015)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-17T10:51:46Z] <elukey@deploy1003> Finished scap sync-world: Backport for [[gerrit:1072206|Swap poolcounter2003 with poolcounter2005 (T332015)]] (duration: 13m 19s)

Change #1073427 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/mediawiki-config@master] Swap poolcounter{2004,1004,1005} with the newer Bookworm-based hosts

https://gerrit.wikimedia.org/r/1073427

Change #1073502 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/mediawiki-config@master] Swap poolcounter1004 with poolcounter1006

https://gerrit.wikimedia.org/r/1073502

Change #1073503 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/mediawiki-config@master] Swap poolcounter1005 with poolcounter1007

https://gerrit.wikimedia.org/r/1073503

Change #1073164 merged by Elukey:

[operations/deployment-charts@master] services: remove old poolcounter netpolicies for Thumbor

https://gerrit.wikimedia.org/r/1073164

Change #1073427 merged by jenkins-bot:

[operations/mediawiki-config@master] Swap poolcounter2004 with poolcounter2006

https://gerrit.wikimedia.org/r/1073427

Mentioned in SAL (#wikimedia-operations) [2024-09-18T10:07:16Z] <elukey@deploy1003> Started scap sync-world: Backport for [[gerrit:1073427|Swap poolcounter2004 with poolcounter2006 (T332015)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-18T10:09:34Z] <elukey@deploy1003> elukey: Backport for [[gerrit:1073427|Swap poolcounter2004 with poolcounter2006 (T332015)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-18T10:14:24Z] <elukey@deploy1003> Finished scap sync-world: Backport for [[gerrit:1073427|Swap poolcounter2004 with poolcounter2006 (T332015)]] (duration: 07m 08s)

Change #1073502 merged by jenkins-bot:

[operations/mediawiki-config@master] Swap poolcounter1004 with poolcounter1006

https://gerrit.wikimedia.org/r/1073502

Mentioned in SAL (#wikimedia-operations) [2024-09-18T13:31:21Z] <elukey@deploy1003> Started scap sync-world: Backport for [[gerrit:1073502|Swap poolcounter1004 with poolcounter1006 (T332015)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-18T13:33:42Z] <elukey@deploy1003> elukey: Backport for [[gerrit:1073502|Swap poolcounter1004 with poolcounter1006 (T332015)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-18T13:38:37Z] <elukey@deploy1003> Finished scap sync-world: Backport for [[gerrit:1073502|Swap poolcounter1004 with poolcounter1006 (T332015)]] (duration: 07m 15s)

Change #1073503 merged by jenkins-bot:

[operations/mediawiki-config@master] Swap poolcounter1005 with poolcounter1007

https://gerrit.wikimedia.org/r/1073503

Mentioned in SAL (#wikimedia-operations) [2024-09-18T13:46:27Z] <elukey@deploy1003> Started scap sync-world: Backport for [[gerrit:1073503|Swap poolcounter1005 with poolcounter1007 (T332015)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-18T13:48:35Z] <elukey@deploy1003> elukey: Backport for [[gerrit:1073503|Swap poolcounter1005 with poolcounter1007 (T332015)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-18T13:53:51Z] <elukey@deploy1003> Finished scap sync-world: Backport for [[gerrit:1073503|Swap poolcounter1005 with poolcounter1007 (T332015)]] (duration: 07m 23s)

All poolcounter IPs for MediaWIki/Thumbor are now on Bookworm!

Next steps:

  1. Check for stale connections in all old VMs during the next couple of days, restart poolcounterd if needed, and make sure that nothing still creates new conns to the old nodes.
  2. File changes to remove network policies for the old VMs.
  3. Once we feel good, decommission the old VMs.

Change #1073802 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: remove old poolcounter nodes from MW's net policies

https://gerrit.wikimedia.org/r/1073802

Change #1073802 merged by Elukey:

[operations/deployment-charts@master] services: remove old poolcounter nodes from MW's net policies

https://gerrit.wikimedia.org/r/1073802

Last step remaining is to decommission the old VMs!

Change #1074949 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::lvs::realserver: update poolcounter hosts

https://gerrit.wikimedia.org/r/1074949

Change #1074949 merged by Elukey:

[operations/puppet@production] profile::lvs::realserver: update poolcounter hosts

https://gerrit.wikimedia.org/r/1074949

cookbooks.sre.hosts.decommission executed by elukey@cumin1002 for hosts: poolcounter1004.eqiad.wmnet

  • poolcounter1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

cookbooks.sre.hosts.decommission executed by elukey@cumin1002 for hosts: poolcounter1005.eqiad.wmnet

  • poolcounter1005.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

cookbooks.sre.hosts.decommission executed by elukey@cumin1002 for hosts: poolcounter2003.codfw.wmnet

  • poolcounter2003.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Change #1074953 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::poolcounter::server: cleanup after Bookworm migration

https://gerrit.wikimedia.org/r/1074953

cookbooks.sre.hosts.decommission executed by elukey@cumin1002 for hosts: poolcounter2004.codfw.wmnet

  • poolcounter2004.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Change #1074953 merged by Elukey:

[operations/puppet@production] role::poolcounter::server: cleanup after Bookworm migration

https://gerrit.wikimedia.org/r/1074953

elukey claimed this task.