Page MenuHomePhabricator

upgrade releases hosts to bookworm
Closed, ResolvedPublic

Description

Hosts releases1003 and releases2003 should be upgraded or replaced with bookworm hosts.

1failing over the backend of https://releases.wikimedia.org - the plan
2
3status quo:
4
5the service has 2 backends; one in eqiad and one in codfw; as of 2025-11-13 the host names are:
6releases1003.eqiad.wmnet & releases2003.codfw.wmnet
7
8releases1003 is still on bullseye while releases2003 has already been upgraded to bookworm (though not trixie)
9
10The DNS name releases.discovery.wmnet determines which of the backends gets the traffic and it is currently an alias for releases1003.eqiad.wmnet, making eqiad the active DC.
11
12
13actual goal: no backends are still on outdated bullseye.
14
15via steps: fail-over traffic from 1003 to 2003; verify 2003 works fine; reimage 1003
16
17A-1) Stop rsync and puppet on both hosts
18
19A) Hiera: in hieradata/common.yaml switch the definition of "releases_server" and "releases_servers_failover"
20
21https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204933
22
23what this does: changes which of the servers is the source for rsyncing data between servers; so the server that releases should be uploaded to or be created on.
24
25https://puppet-compiler.wmflabs.org/output/1204933/7609/releases1003.eqiad.wmnet/index.html
26
27B) jenkins service:
28
29prepare: control the service by DC name, not hostname: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204980
30
31then simply switch eqiad - codfw to mask/stop service in inactive DC and unmask/start service in active DC
32
33https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204982
34
35C) DNS: in templates/wmnet switch the releases.discovery.wmnet name to the other backend
36
37https://gerrit.wikimedia.org/r/c/operations/dns/+/1204684
38
39what this does: changes which of the servers gets the traffic from the CDN/caching servers. The discovery name is what is used in Apache Traffic Server config which maps releases.wikimedia.org to it. (trafficserver/backend.yaml). Therefore ATS config does not have to be changed, only DNS.
40
41actual steps:
42
43- update docs, create wiki fingerprint / host pages (https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org#Documentation_updates)
44- schedule downtime(s)
45- disable puppet on both backends
46- merge and deploy change A (Hiera)
47- merge and deploy change B (jenkins service)
48- ensure no users are uploading / are informed of maintenance / new server name
49- re-enable puppet on both
50-- verify rsync changes look good
51-- verify jenkins service masked on old, unmasked on new
52- merge and deploy change C (DNS)
53-- verify discovery name points to new backend
54-- tail apache logs on 2003 while making some requests to releases.wikimedia.org
55-- run httpbb tests from deployment server for releases services
56-- delete downtimes
57-- announce to releasers-* admin groups members (get email addresses from admin.yaml)
58-- end maintenance - close tickets as resolved
59
60- (later)
61-- reimage 1003 with bookworm
62
63- (later)
64-- reimage 1003 with trixie ...(and continue the cycle)..

Event Timeline

In T391590 we debugged why the first puppet run on a freshly reimaged releases host would not work, figured out that jenkins is deployed by scap on releases hosts (but by puppet on contint hosts) and fixed a puppet error.

The error was that a systemd override file was to be installed but no systemd service directory existed yet. Which is because here scap is pulling in a Debian package which then installs a systemd unit and this has not happened yet.

We ran a scap deployment but it turns out the Debian package is missing.

As pointed out by Hashar in T391590#10746526 another thing that is needed here is an upstream jenkins package to be imported into APT for bookworm.

Change #1137060 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] aptrepo: add jenkins to bookworm section in distributions-wikimedia

https://gerrit.wikimedia.org/r/1137060

I am adding @jnuche, our in house expert when it comes to Jenkins :]

Thanks for following up on this @Dzahn!

Thanks for the compliment Antoine but I never claimed to be, nor am, a Jenkins expert :) However I will help with this if I can.

Change #1137060 merged by Dzahn:

[operations/puppet@production] aptrepo: add jenkins to bookworm section in distributions-wikimedia

https://gerrit.wikimedia.org/r/1137060

17:11 <+icinga-wm> PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/f6f5517444c0e6ac6856ee72ce652871a39bd66d0e257e08749d2e18dbdeec17/merged is not 
                   accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space 
                   https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
..

17:13 < mutante> re: disk space on releases1003 - meh.. it's the issue again with docker overlay fs. but dont know what changed. probably people working on it yesterday. we need some config change to exclude that or .. something

re: missing Debian package. After adding jenkins to bookworm section in distributions-wikimedia config.

[apt1002:/srv/wikimedia] $  sudo -i reprepro -C thirdparty/ci --restrict=jenkins checkupdate bullseye-wikimedia
Calculating packages to get...
Updates needed for 'bullseye-wikimedia|thirdparty/ci|amd64':

[apt1002:/srv/wikimedia] $  sudo -i reprepro -C thirdparty/ci --restrict=jenkins checkupdate bookworm-wikimedia
Error: distribution 'bookworm-wikimedia' uses flat update pattern 'jenkins'
with target component 'thirdparty/ci' which it does not contain!
There have been errors!

Change #1137361 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] aptrepo: add thirdparty/ci component to bookworm-wikimedia

https://gerrit.wikimedia.org/r/1137361

LSobanski triaged this task as Medium priority.
LSobanski moved this task from Incoming to Work in Progress on the collaboration-services board.

I wanted to kindly poke this task.

releases2003 is the backup server for https://releases-jenkins.wikimedia.org, which runs several important workflows; especially the weekly train branch cut which is a critical part of the WikiMedia train. Not being able to install Jenkins on releases2003 means any problems with releases1003 could potentially become a blocker for the train.

Other functionalities, like doc publishing from repos to doc.wikimedia.org, would also be affected.

Change #1137361 merged by Dzahn:

[operations/puppet@production] aptrepo: add thirdparty/ci component to bookworm-wikimedia

https://gerrit.wikimedia.org/r/1137361

I wanted to kindly poke this task.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137361 has been merged now

This gives us a thirdparty/jenkins APT component for bookworm which should further unblock this task.

on it!

Change #1153349 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Also configure new update config for bookworm

https://gerrit.wikimedia.org/r/1153349

Change #1153349 merged by Dzahn:

[operations/puppet@production] Also configure new update config for bookworm

https://gerrit.wikimedia.org/r/1153349

initial update of the new thirdparty/jenkins component now succesful. thanks Moritz.

[apt1002:~] $  sudo -i reprepro -C thirdparty/jenkins --restrict=jenkins checkupdate bookworm-wikimedia
Calculating packages to get...
Updates needed for 'bookworm-wikimedia|thirdparty/jenkins|amd64':
'jenkins': newly installed as '2.504.2' (from 'jenkins-bookworm'):
 files needed: pool/thirdparty/jenkins/j/jenkins/jenkins_2.504.2_all.deb

[apt1002:~] $  sudo -i reprepro -C thirdparty/jenkins --restrict=jenkins update bookworm-wikimedia
Calculating packages to get...
Getting packages...
Installing (and possibly deleting) packages...
Exporting indices...
..
[apt1002:~] $ sudo -i reprepro -C thirdparty/jenkins ls jenkins 
jenkins | 2.504.2 | bookworm-wikimedia | amd64

Change #1167823 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] Use thirdparty/jenkins on Bookworm

https://gerrit.wikimedia.org/r/1167823

Following the renaming of thirdparty/ci to thirdparty/jenkins the documentation to import the Jenkins package needs to be updated: https://wikitech.wikimedia.org/wiki/Jenkins#Upgrading

Change #1167823 merged by Muehlenhoff:

[operations/puppet@production] Use thirdparty/jenkins on Bookworm

https://gerrit.wikimedia.org/r/1167823

releases2003 (Bookworm) now has thirdparty/jenkins in /etc/apt/sources.list.d/thirdparty-jenkins.sources.

The other hosts running Bullseye (contint1002, releases1003) are unaffected.

I have updated the reprepro documentation at https://wikitech.wikimedia.org/wiki/Jenkins#Get_the_package :)

Change #1204684 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] fail-over releases.wikimedia.org backend

https://gerrit.wikimedia.org/r/1204684

Change #1204933 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] releases: flip the active backend from eqiad to codfw

https://gerrit.wikimedia.org/r/1204933

Dzahn renamed this task from upgrade releases hosts to bookworm to upgrade releases hosts to bookworm.Nov 13 2025, 8:39 PM
Dzahn updated the task description. (Show Details)

Here is a plan for failing over the backend from eqiad to codfw on the coming Tuesday:

1failing over the backend of https://releases.wikimedia.org - the plan
2
3status quo:
4
5the service has 2 backends; one in eqiad and one in codfw; as of 2025-11-13 the host names are:
6releases1003.eqiad.wmnet & releases2003.codfw.wmnet
7
8releases1003 is still on bullseye while releases2003 has already been upgraded to bookworm (though not trixie)
9
10The DNS name releases.discovery.wmnet determines which of the backends gets the traffic and it is currently an alias for releases1003.eqiad.wmnet, making eqiad the active DC.
11
12
13actual goal: no backends are still on outdated bullseye.
14
15via steps: fail-over traffic from 1003 to 2003; verify 2003 works fine; reimage 1003
16
17A-1) Stop rsync and puppet on both hosts
18
19A) Hiera: in hieradata/common.yaml switch the definition of "releases_server" and "releases_servers_failover"
20
21https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204933
22
23what this does: changes which of the servers is the source for rsyncing data between servers; so the server that releases should be uploaded to or be created on.
24
25https://puppet-compiler.wmflabs.org/output/1204933/7609/releases1003.eqiad.wmnet/index.html
26
27B) jenkins service:
28
29prepare: control the service by DC name, not hostname: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204980
30
31then simply switch eqiad - codfw to mask/stop service in inactive DC and unmask/start service in active DC
32
33https://gerrit.wikimedia.org/r/c/operations/puppet/+/1204982
34
35C) DNS: in templates/wmnet switch the releases.discovery.wmnet name to the other backend
36
37https://gerrit.wikimedia.org/r/c/operations/dns/+/1204684
38
39what this does: changes which of the servers gets the traffic from the CDN/caching servers. The discovery name is what is used in Apache Traffic Server config which maps releases.wikimedia.org to it. (trafficserver/backend.yaml). Therefore ATS config does not have to be changed, only DNS.
40
41actual steps:
42
43- update docs, create wiki fingerprint / host pages (https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org#Documentation_updates)
44- schedule downtime(s)
45- disable puppet on both backends
46- merge and deploy change A (Hiera)
47- merge and deploy change B (jenkins service)
48- ensure no users are uploading / are informed of maintenance / new server name
49- re-enable puppet on both
50-- verify rsync changes look good
51-- verify jenkins service masked on old, unmasked on new
52- merge and deploy change C (DNS)
53-- verify discovery name points to new backend
54-- tail apache logs on 2003 while making some requests to releases.wikimedia.org
55-- run httpbb tests from deployment server for releases services
56-- delete downtimes
57-- announce to releasers-* admin groups members (get email addresses from admin.yaml)
58-- end maintenance - close tickets as resolved
59
60- (later)
61-- reimage 1003 with bookworm
62
63- (later)
64-- reimage 1003 with trixie ...(and continue the cycle)..

Dzahn changed the task status from Open to In Progress.Nov 13 2025, 9:07 PM

Change #1204980 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] releases: control jenkins service by DC name, not host name

https://gerrit.wikimedia.org/r/1204980

Change #1204982 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] releases: stop/mask jenkins in eqiad, start/unmask jenkins in codfw

https://gerrit.wikimedia.org/r/1204982

Change #1204933 merged by Dzahn:

[operations/puppet@production] releases: flip the active backend from eqiad to codfw

https://gerrit.wikimedia.org/r/1204933

Change #1204980 merged by Dzahn:

[operations/puppet@production] releases: control jenkins service by DC name, not host name

https://gerrit.wikimedia.org/r/1204980

Change #1204982 merged by Dzahn:

[operations/puppet@production] releases: stop/mask jenkins in eqiad, start/unmask jenkins in codfw

https://gerrit.wikimedia.org/r/1204982

Change #1206959 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] releases: control jenkins service by host name

https://gerrit.wikimedia.org/r/1206959

Change #1206959 merged by Dzahn:

[operations/puppet@production] releases: control jenkins service by host name

https://gerrit.wikimedia.org/r/1206959

Change #1204684 merged by Dzahn:

[operations/dns@master] fail-over releases.wikimedia.org backend

https://gerrit.wikimedia.org/r/1204684

The bookworm machine, releases2003, has become the new production backend now.

Waiting a day before reimaging the previous prod machine with bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host releases1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host releases1003.eqiad.wmnet with OS bookworm completed:

  • releases1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202511191743_dzahn_507924_releases1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

releases1003 has been upgraded to bookworm and is the standby host now.

puppet setup the rsync correctly.. it pulls from the active server releases2003