Page MenuHomePhabricator

replace production buster deployment servers
Open, Needs TriagePublic

Description

status quo

The production deployment servers, deploy1002 in eqiad and deploy2002 in codfw, are running on Debian oldoldstable and EOL LTS is 2024-06-30.

goal

This ticket is about replacing buster machines with bullseye which is a larger global effort tracked in T291916.

At the same time and unrelatedly, the hardware of at least deploy1002 is also scheduled to be replaced (T361355).

combination with hardware replacement

Just recently T364416 was created which tracks installation of this new hardware, followed by implementing the services on it in T364417.

This could be considered a lucky coincidence because it allows for setting up the new server with bullseye and testing it without changing anything about the existing setup or downtimes until we actually switch over.

Afterwards we would have both new hardware and new OS version at the same time.

prep work / puppet buster support / usage in cloud VPS

One aspect of an upgrade like this is making the puppet role (deployment_server) support the new OS version.

For staging purposes the collaboration services team (as well as the maintainers of the Beta cluster and maybe others) also maintain deployment servers in Cloud VPS.

These are needed to locally deploy services deployed by scap, such as Gerrit, Phabricator or MediaWiki.

Recently the collaboration services team also wanted to replace all buster VMs (T360964) and

T363415 was created to replace a buster deployment server which required adding bullseye support to the puppet role.

topic branch deploy-bullseye added the support and also unblocked using the role on new production deployment servers now.

There were some other errors that were fixed.

scap init

When new deployment servers (scap masters) are created for the first time and haven't been used for deploying it we sometimes run into issues with the bootstrapping of scap itself. This issue is tracked in T257317.

A manual fix was to run scap deploy --init -Dblock_deployments:False in each deployment repo once.

required reboot

Once the deployment_server role is applied the first time there will be a permission issue with the mw-cgroup service. This can be fixed by a reboot of the machine and is mentioned in T363957.

technical switch

The DNS name deployment.eqiad.wmnet is the definition of the currently active deployment server. It points to deploy1002.eqiad.wmnet as of now.

The actual switch would be deploying a DNS change to point it to deploy1003.eqiad.wmnet instead.

Additionally the name appears hardcoded in other places though:

hieradata/common.yaml:deployment_server: deploy1002.eqiad.wmnet
hieradata/common.yaml:- 10.64.32.28                 # deploy1002.eqiad.wmnet
hieradata/common.yaml:- 2620:0:861:103:10:64:32:28  # deploy1002.eqiad.wmnet
hieradata/common/scap/dsh.yaml:  - "deploy1002.eqiad.wmnet"
hieradata/common/scap.yaml:scap::deployment_server: "deploy1002.eqiad.wmnet"

and maybe unexpected random places like:

modules/imagecatalog/manifests/init.pp:# but only use deploy1002. Soon it'll instead be active-passive behind a service hostname.
modules/profile/manifests/tcpircbot.pp:        'deploy1002.eqiad.wmnet',       # deployment eqiad

coordination with teams

The switch over and scap init needs to be coordinated with the release engineering team and should be added to the deployment calendar.

Deployers of all services across teams need to be made aware of the new server. And the changed SSH host key should be published on Wikitech. An email should be sent to appropriate mailing lists.

And since a deployment server is both a scap and a kubernetes deployment server nowadays it also includes all deployers of kubernetes services and the serviceops team.