Page MenuHomePhabricator

Upgrade Cumin hosts to Bookworm
Closed, ResolvedPublic

Description

We can create a parallel cumin1003 VM on Bookworm, test/adapt everything, then reimage cumin2002 and finally decom cumin1002.

  • Create cumin1003
  • Upgrade cuminunpriv1001
  • Upgrade cumin2002
  • Migrate pwstore repository to cumin1003
  • Migrate backups to cumin1003
  • Copy old cookbook/spicerack logs
  • Decom cumin1002

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -2
operations/puppetproduction+0 -2
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -2
operations/puppetproduction+0 -1
operations/debs/wmf-laptopmaster+1 -1
operations/puppetproduction+0 -1
operations/puppetproduction+0 -3
operations/puppetproduction+0 -2
operations/puppetproduction+0 -7
operations/puppetproduction+1 -4
operations/puppetproduction+10 -0
operations/puppetproduction+1 -0
operations/software/transferpymaster+6 -0
operations/puppetproduction+2 -2
operations/puppetproduction+3 -1
operations/puppetproduction+1 -1
operations/puppetproduction+4 -0
operations/puppetproduction+31 -26
operations/puppetproduction+16 -2
operations/software/spicerackmaster+17 -11
operations/puppetproduction+6 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host cumin1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host cumin1003.eqiad.wmnet with OS bookworm completed:

  • cumin1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503310832_jmm_380057_cumin1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1132577 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make cumin1003 a Cumin node

https://gerrit.wikimedia.org/r/1132577

Change #1133808 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] dnsdisc: make it compatible with bookworm

https://gerrit.wikimedia.org/r/1133808

Change #1133808 merged by jenkins-bot:

[operations/software/spicerack@master] dnsdisc: make it compatible with bookworm

https://gerrit.wikimedia.org/r/1133808

Mentioned in SAL (#wikimedia-operations) [2025-05-07T15:53:40Z] <moritzm> uploaded a python-pynetbox 7.4.1-1~wmf12u1 to bookworm-wikimedia (needed for Cumin update) T389380

Change #1143539 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/software/transferpy@master] transferpy: Build for Bookworm

https://gerrit.wikimedia.org/r/1143539

Mentioned in SAL (#wikimedia-operations) [2025-05-08T11:57:51Z] <moritzm> import transferpy 1.1+deb12u1 to bookworm-wikimedia T389380

Change #1132577 merged by Muehlenhoff:

[operations/puppet@production] Make cumin1003 a Cumin node

https://gerrit.wikimedia.org/r/1132577

Mentioned in SAL (#wikimedia-operations) [2025-05-08T14:45:06Z] <moritzm> imported ripe-atlas-sagan 1.3.1-1~wmf12u1 to apt.wikimedia.org/bookworm T389380

Mentioned in SAL (#wikimedia-operations) [2025-05-08T14:45:27Z] <moritzm> imported ripe-atlas-tools 2.3.0-3+wmf12u1 to apt.wikimedia.org/bookworm T389380

ripe-atlas-tools was in Bullseye, but missed Bookworm due to some RC bug. It's again part of Debian with trixie, so I prepared an internal build of 2.3.0-3+wmf12u1 for Bookworm and uploaded it to apt.wikimedia.org. This also needed a backport of
ripe-atlas-sagan (1.3.1-1~wmf12u1)

Icinga downtime and Alertmanager silence (ID=4e0ba7e4-66f2-4cc7-8562-eed30c2476f1) set by jmm@cumin2002 for 5 days, 0:00:00 on 1 host(s) and their services with reason: WIP new Bookworm host

cumin1003.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-05-16T08:26:02Z] <moritzm> uploaded httpbb 0.0.5-1+deb12u1 to apt.wikimedia.org T393711 T389380

Change #1148268 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] homer: make private repo support multiple peers

https://gerrit.wikimedia.org/r/1148268

Change #1148268 merged by Volans:

[operations/puppet@production] homer: make private repo support multiple peers

https://gerrit.wikimedia.org/r/1148268

Change #1150675 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] homer: fix private repository config

https://gerrit.wikimedia.org/r/1150675

Change #1150675 merged by Volans:

[operations/puppet@production] homer: fix private repository config

https://gerrit.wikimedia.org/r/1150675

Change #1159439 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Advance snapshot dbbackups start time by 4 hours

https://gerrit.wikimedia.org/r/1159439

Change #1159439 merged by Jcrespo:

[operations/puppet@production] dbbackups: Advance snapshot dbbackups start time by 4 hours

https://gerrit.wikimedia.org/r/1159439

Change #1160133 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Also switch cumin2002 to nftables

https://gerrit.wikimedia.org/r/1160133

I had to deploy homer to cumin2002 after the upgrade:

sudo cookbook sre.deploy.python-code -r 'Release v0.10.1' homer 'cumin2002*'

Now it works fine.

Change #1196886 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] cumin: Migrate cumin1002 mariadb remote backups to cumin1003

https://gerrit.wikimedia.org/r/1196886

I removed the nftables on transfer.py checklist- it is important (and I am working on that now), but it only limits the hosts it can transfer to, not where it can be installed (not a hard dependency).

Change #1196886 merged by Jcrespo:

[operations/puppet@production] cumin: Migrate cumin1002 mariadb remote backups to cumin1003

https://gerrit.wikimedia.org/r/1196886

The backup migration was deployed and tested succesfuly. But please keep cumin1002 for a few extra days while I keep monitoring it, in case there is some more subtle breakage monitoring didn't catch.

Change #1197321 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] homer-diff-checker: move execution from cumin1002 to cumin1003

https://gerrit.wikimedia.org/r/1197321

Change #1197321 merged by Cathal Mooney:

[operations/puppet@production] homer-diff-checker: move execution from cumin1002 to cumin1003

https://gerrit.wikimedia.org/r/1197321

Change #1143539 abandoned by Jcrespo:

[operations/software/transferpy@master] transferpy: Build for Bookworm

Reason:

Integrated elsewhere

https://gerrit.wikimedia.org/r/1143539

No blockers from me to remove cumin1002 (db backup orchestration was migrated already).

elukey subscribed.

New spicerack release deployed, cumin1002 is not needed anymore from Data Platform folks.

Change #1204357 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add safe.directory directives for the pwstore repository

https://gerrit.wikimedia.org/r/1204357

Change #1160133 merged by Muehlenhoff:

[operations/puppet@production] Also switch cumin2002 to nftables

https://gerrit.wikimedia.org/r/1160133

Mentioned in SAL (#wikimedia-operations) [2025-11-12T11:36:47Z] <moritzm> migrated cumin2002 to nftables T389380

Change #1204368 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Enable nftables on cluster::management on the role level

https://gerrit.wikimedia.org/r/1204368

Change #1204357 merged by Muehlenhoff:

[operations/puppet@production] Add safe.directory directives for the pwstore repository

https://gerrit.wikimedia.org/r/1204357

Change #1204375 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/debs/wmf-laptop@master] Update pwstore docs to point to cumin1003

https://gerrit.wikimedia.org/r/1204375

Change #1204380 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Ganeti: Remove cumin1002 from allow list for RAPI access

https://gerrit.wikimedia.org/r/1204380

Change #1204574 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove grant from cumin1002

https://gerrit.wikimedia.org/r/1204574

Change #1204380 merged by Muehlenhoff:

[operations/puppet@production] Ganeti: Remove cumin1002 from allow list for RAPI access

https://gerrit.wikimedia.org/r/1204380

Mentioned in SAL (#wikimedia-operations) [2025-11-12T15:17:48Z] <moritzm> migrated pwstore repository from cumin1002 to cumin1003 T389380

Change #1204609 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove cumin1002 from tcpircbot config

https://gerrit.wikimedia.org/r/1204609

Change #1204574 merged by Marostegui:

[operations/puppet@production] Remove grant from cumin1002

https://gerrit.wikimedia.org/r/1204574

Change #1204617 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] production-ms.sql.erb: Remove root@10.64.48.98

https://gerrit.wikimedia.org/r/1204617

Change #1204617 merged by Marostegui:

[operations/puppet@production] production-ms.sql.erb: Remove root@10.64.48.98

https://gerrit.wikimedia.org/r/1204617

Change #1204620 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove cumin1002 from alertmanager access

https://gerrit.wikimedia.org/r/1204620

Change #1204622 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove cumin1002 as Homer git peer

https://gerrit.wikimedia.org/r/1204622

Change #1204628 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] wmf_root_client.pp: Remove cumin1002

https://gerrit.wikimedia.org/r/1204628

Change #1204609 merged by Muehlenhoff:

[operations/puppet@production] Remove cumin1002 from tcpircbot config

https://gerrit.wikimedia.org/r/1204609

Change #1204620 merged by Muehlenhoff:

[operations/puppet@production] Remove cumin1002 from alertmanager access

https://gerrit.wikimedia.org/r/1204620

Change #1204375 merged by Muehlenhoff:

[operations/debs/wmf-laptop@master] Update pwstore docs to point to cumin1003

https://gerrit.wikimedia.org/r/1204375

Change #1204797 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove cumin1002 from list of Cumin masters

https://gerrit.wikimedia.org/r/1204797

Change #1204799 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove cumin1002 from mysql root list

https://gerrit.wikimedia.org/r/1204799

Change #1204799 merged by Muehlenhoff:

[operations/puppet@production] Remove cumin1002 from mysql root list

https://gerrit.wikimedia.org/r/1204799

Change #1204622 merged by Muehlenhoff:

[operations/puppet@production] Remove cumin1002 as Homer git peer

https://gerrit.wikimedia.org/r/1204622

Change #1204818 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Setup cumin1002 to insetup

https://gerrit.wikimedia.org/r/1204818

Change #1204818 merged by Muehlenhoff:

[operations/puppet@production] Setup cumin1002 to insetup

https://gerrit.wikimedia.org/r/1204818

Change #1204628 merged by Marostegui:

[operations/puppet@production] wmf_root_client.pp: Remove cumin1002

https://gerrit.wikimedia.org/r/1204628

Change #1204797 merged by Muehlenhoff:

[operations/puppet@production] Remove cumin1002 from list of Cumin masters

https://gerrit.wikimedia.org/r/1204797

Change #1204368 merged by Muehlenhoff:

[operations/puppet@production] Enable nftables on cluster::management on the role level

https://gerrit.wikimedia.org/r/1204368

@Volans made a copy of old the Spicerack/Cumin logs , they are available in /var/log/cumin100[12] on cumin1003 in case anyone needs them.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: cumin1002.eqiad.wmnet

  • cumin1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
MoritzMuehlenhoff claimed this task.
MoritzMuehlenhoff updated the task description. (Show Details)

All done