Page MenuHomePhabricator

Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm
Open, Needs TriagePublic

Description

< T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye | NOTYETCREATED >

With Stretch mostly removed now it's time to start removing Buster from deployment-prep, by either migrating services to newer Debian versions or by removing unused services entirely.

Tracking task for production migrations: T291916: Tracking task for Bullseye migrations in production

Instances to migrate

(live report)

  • deployment-acme-chief03 (replaced by deployment-acme-chief05)
  • deployment-acme-chief04 (replaced by deployment-acme-chief06)
  • deployment-cache-text07 (replaced by deployment-cache-text08)
  • deployment-cache-upload07 (replaced by deployment-cache-upload08)
  • deployment-cumin T361380
  • deployment-deploy03
  • deployment-docker-api-gateway01
  • deployment-docker-changeprop01
  • deployment-docker-cpjobqueue01
  • deployment-docker-mobileapps01
  • deployment-docker-proton01
  • deployment-echostore02 T361383
  • deployment-etcd02
  • deployment-eventlog08
  • deployment-ircd02
  • deployment-jobrunner04
  • deployment-kafka-jumbo-[5, 8-9] T361382
  • deployment-kafka-logging01 T361382
  • deployment-kafka-main-[5-6] T361382
  • deployment-maps-master01 T361381
  • deployment-mediawiki[11-12] T361387
  • deployment-memc[08-10] T361384
  • deployment-mwlog01
  • deployment-mwmaint02
  • deployment-ores02 T361385
  • deployment-parsoid12 T361386
  • deployment-poolcounter06
  • deployment-puppetdb[03-04] replaced with deployment-puppetdb05
  • deployment-puppetmaster04 replaced with deployment-puppetserver-1
  • deployment-push-notifications01
  • deployment-restbase04
  • deployment-sessionstore04
  • deployment-shellbox
  • deployment-snapshot03
  • deployment-urldownloader03
  • deployment-xhgui03

Related Objects

Event Timeline

Change 931550 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/mediawiki-config@master] [beta] Update wgCdnServersNoPurge

https://gerrit.wikimedia.org/r/931550

Change 931550 merged by jenkins-bot:

[operations/mediawiki-config@master] [beta] Update wgCdnServersNoPurge

https://gerrit.wikimedia.org/r/931550

Change 932189 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: Added new bullseye instance for cache-upload in deployment-prep

https://gerrit.wikimedia.org/r/932189

Change 932189 merged by Fabfur:

[operations/puppet@production] hiera: Added new bullseye instance for cache-upload in deployment-prep

https://gerrit.wikimedia.org/r/932189

Mentioned in SAL (#wikimedia-cloud) [2023-06-22T11:02:17Z] <fabfur> switch floating IP from deployment-cache-upload07 to deployment-cache-upload08 (bullseye upgrade: T327742)

Change 932380 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/mediawiki-config@master] [beta] Update wgCdnServersNoPurge for new cache server

https://gerrit.wikimedia.org/r/932380

Change 932380 merged by jenkins-bot:

[operations/mediawiki-config@master] [beta] Update wgCdnServersNoPurge for new cache server

https://gerrit.wikimedia.org/r/932380

Change 933085 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: Added new bullseye instance for cache-text in deployment-prep

https://gerrit.wikimedia.org/r/933085

Change 933085 merged by Fabfur:

[operations/puppet@production] hiera: Added new bullseye instance for cache-text in deployment-prep

https://gerrit.wikimedia.org/r/933085

Mentioned in SAL (#wikimedia-cloud) [2023-06-26T12:13:59Z] <fabfur> switch floating IP from deployment-cache-text07 to deployment-cache-text08 (bullseye upgrade: T327742)

Mentioned in SAL (#wikimedia-cloud) [2023-06-26T12:41:49Z] <fabfur> switch floating IP from deployment-cache-text07 to deployment-cache-text08 (bullseye upgrade: T327742) (fix sec group)

Change 933436 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] hiera: Removed unused cache instances from deployment-prep

https://gerrit.wikimedia.org/r/933436

Change 933463 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/mediawiki-config@master] [beta] Update wgCdnServersNoPurge to remove unused cache servers

https://gerrit.wikimedia.org/r/933463

Change 933436 merged by Fabfur:

[operations/puppet@production] hiera: Removed unused cache instances from deployment-prep

https://gerrit.wikimedia.org/r/933436

Change 933463 merged by jenkins-bot:

[operations/mediawiki-config@master] [beta] Update wgCdnServersNoPurge to remove unused cache servers

https://gerrit.wikimedia.org/r/933463

Mentioned in SAL (#wikimedia-cloud) [2023-06-28T09:54:41Z] <fabfur> removed (text|upload) instance references from wgCdnServersNoPurge (T327742)

Andrew updated the task description. (Show Details)

We talked about this a bit in the Release-Engineering-Team team meeting on Wednesday—we discussed whether we had the ability to help. And I don't think we're well-positioned to do all this work as this encompasses a clone of most of Wikimedia's production footprint (where SRE has the most expertise).

I think one path forward here would be for multiple teams pitch-in to fix the instances they know how to fix.


Looking at the instance list, I do wonder if they're needed in beta (echostore?). A place to start might be a bunch of subtasks verifying:

  1. Needed for beta? (i.e., critical to some team's workflow, if so: who?)
  2. Ported to Bullseye in production? If not, any foreseeable blockers?
  3. Can you help?

Other notes:

  • Quota: deployment-prep project is currently at its quota for vCPU, so it'd need some extra headroom to spin up new Bullseye instances before shutting down the "working" ones.
  • Production debian packages: T291916: Tracking task for Bullseye migrations in production may be a blocker in some cases deployment-prep depends on the same Debian packages used for production, and if they're not yet packaged for Bullseye, then there's no way to get a Bullseye image to run using production puppet code.
  • Quota: deployment-prep project is currently at its quota for vCPU, so it'd need some extra headroom to spin up new Bullseye instances before shutting down the "working" ones.

The average instance on deployment-prep is ~ 2vCPU, 20GB disk, 4GB ram.

deployment-prep is almost at RAM quota and at CPU quota. Bumping those up to allow 5 new instance should allow folks to at least get started. So 10vCPU/20GB ram? Possible? It'll probably possible to reclaim some of that post-migration.

Andrew updated the task description. (Show Details)
  • Quota: deployment-prep project is currently at its quota for vCPU, so it'd need some extra headroom to spin up new Bullseye instances before shutting down the "working" ones.

The average instance on deployment-prep is ~ 2vCPU, 20GB disk, 4GB ram.

deployment-prep is almost at RAM quota and at CPU quota. Bumping those up to allow 5 new instance should allow folks to at least get started. So 10vCPU/20GB ram? Possible? It'll probably possible to reclaim some of that post-migration.

Added as subtask: T361477: requests to increase quotas deployment-prep

I have asked something on IRC's wikimedia-cloud about a way to speed up the process. In the puppet repo we have dist-upgrade.sh, a script that should be able to do all the safe steps to upgrade the OS in place. If it is ok to use it, we'd avoid the re-creation of new VMs with the issue of copying stateful data around (for example, Kafka/databases/etc..). The only downside would be for the Cloud team, that would need to fix the VM's metadata (like reported OS) after the dist-upgrade.